Understanding Negative Sampling in Graph Representation Learning

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Virtual Event CA USA July, 2020, pp. 1666-1676, 2020.

Cited by: 1|Bibtex|Views675|Links
EI
Keywords:
noise contrastive estimationmean reciprocal rankingMonte Carlo Negative Samplinggraph convolutional networkMarkov chain Monte CarloMore(8+)
Weibo:
We study the effect of negative sampling, a practical approach adopted in the literature of graph representation learning

Abstract:

Graph representation learning has been extensively studied in recent years, in which sampling is a critical point. Prior arts usually focus on sampling positive node pairs, while the strategy for negative sampling is left insufficiently explored. To bridge the gap, we systematically analyze the role of negative sampling from the perspecti...More
0
Introduction
  • Recent years have seen the graph representation learning gradually stepping into the spotlight of data mining research.
  • The mainstream graph representation learning algorithms include traditional Network Embedding methods (e.g. DeepWalk [24], LINE [29]) and Graph Neural Networks (e.g. GCN [14], GraphSAGE [10]), the latter sometimes are trained end-to-end in classification tasks.
Highlights
  • Recent years have seen the graph representation learning gradually stepping into the spotlight of data mining research
  • We propose an effective and scalable negative sampling strategy, Markov chain Monte Carlo Negative Sampling (MCNS), which applies our theory with an approximated positive distribution based on current embeddings
  • To compare the efficiency of different negative sampling methods, we report the runtime of Monte Carlo Negative Sampling and hard-samples or generative adversarial nets-based strategies (PinSAGE, WARP, dynamic negative sampling, KBGAN) with GraphSAGE encoder in recommendation task in Figure 5
  • We study the effect of negative sampling, a practical approach adopted in the literature of graph representation learning
  • Different from the existing works that decide a proper distribution for negative sampling in heuristics, we theoretically analyze the objective and risk of the negative sampling approach and conclude that the negative sampling distribution should be positively but sub-linearly correlated to their positive sampling distribution
  • Extensive experiments show that Monte Carlo Negative Sampling outperforms 8 negative sampling strategies, regardless of the underlying graph representation learning methods
Methods
  • 4.1 The Self-contrast Approximation

    the authors deduced that pn (u |v) ∝ pd (u |v)α , the real pd is unknown and its approximation pd is often implicitly defined. – How can the principle help negative sampling?

    The authors propose a self-contrast approximation, replacing pd by inner products based on the current encoder, i.e.

    pn (u |v) ∝ pd (u |v)α ≈

    Eθ (u) · Eθ (v) α u′ ∈U Eθ (u ′) · Eθ (v) α

    The resulting form is similar to a technique in RotatE [28], and very time-consuming.
  • The authors deduced that pn (u |v) ∝ pd (u |v)α , the real pd is unknown and its approximation pd is often implicitly defined.
  • Each sampling requires O(n) time, making it impossible for middle- or large-scale graphs.
  • Methods, the MetropolisâĂŞHastings algorithm [20] is designed for obtaining a sequence of random samples from unnormalized distributions.
  • The MetropolisâĂŞHastings algorithm constructs a Markov chain {X (t)} that is ergodic and stationary with respect to π âĂŤ- meaning that, X (t) ∼ π (x), t → ∞
Results
  • The authors demonstrate the results in the 19 settings.
  • The tests in the 19 settings give a max p-value (Amazon, GraphSAGE) of 0.0005 ≪ 0.01, quantitatively proving the significance of the improvements.
  • Recommendation is the most important technology in many E-commerce platforms, which evolved from collaborative filtering to graph-based models.
  • Graph-based recommender systems represent all users and items by embeddings, and recommend items with largest inner products for a given user
Conclusion
  • The authors study the effect of negative sampling, a practical approach adopted in the literature of graph representation learning.
  • Motivated by the theoretical results, the authors propose MCNS, approximating the ideal distribution by self-contrast and accelerating sampling by Metropolis-Hastings.
  • Extensive experiments show that MCNS outperforms 8 negative sampling strategies, regardless of the underlying graph representation learning methods
Summary
  • Introduction:

    Recent years have seen the graph representation learning gradually stepping into the spotlight of data mining research.
  • The mainstream graph representation learning algorithms include traditional Network Embedding methods (e.g. DeepWalk [24], LINE [29]) and Graph Neural Networks (e.g. GCN [14], GraphSAGE [10]), the latter sometimes are trained end-to-end in classification tasks.
  • Methods:

    4.1 The Self-contrast Approximation

    the authors deduced that pn (u |v) ∝ pd (u |v)α , the real pd is unknown and its approximation pd is often implicitly defined. – How can the principle help negative sampling?

    The authors propose a self-contrast approximation, replacing pd by inner products based on the current encoder, i.e.

    pn (u |v) ∝ pd (u |v)α ≈

    Eθ (u) · Eθ (v) α u′ ∈U Eθ (u ′) · Eθ (v) α

    The resulting form is similar to a technique in RotatE [28], and very time-consuming.
  • The authors deduced that pn (u |v) ∝ pd (u |v)α , the real pd is unknown and its approximation pd is often implicitly defined.
  • Each sampling requires O(n) time, making it impossible for middle- or large-scale graphs.
  • Methods, the MetropolisâĂŞHastings algorithm [20] is designed for obtaining a sequence of random samples from unnormalized distributions.
  • The MetropolisâĂŞHastings algorithm constructs a Markov chain {X (t)} that is ergodic and stationary with respect to π âĂŤ- meaning that, X (t) ∼ π (x), t → ∞
  • Results:

    The authors demonstrate the results in the 19 settings.
  • The tests in the 19 settings give a max p-value (Amazon, GraphSAGE) of 0.0005 ≪ 0.01, quantitatively proving the significance of the improvements.
  • Recommendation is the most important technology in many E-commerce platforms, which evolved from collaborative filtering to graph-based models.
  • Graph-based recommender systems represent all users and items by embeddings, and recommend items with largest inner products for a given user
  • Conclusion:

    The authors study the effect of negative sampling, a practical approach adopted in the literature of graph representation learning.
  • Motivated by the theoretical results, the authors propose MCNS, approximating the ideal distribution by self-contrast and accelerating sampling by Metropolis-Hastings.
  • Extensive experiments show that MCNS outperforms 8 negative sampling strategies, regardless of the underlying graph representation learning methods
Tables
  • Table1: Statistics of the tasks and datasets
  • Table2: Recommendation Results of MCNS with various encoders on three datasets. GCN is not evaluted on Amazon and Alibaba datasets due to its limited scalability
  • Table3: The results of link prediction with different negative sampling strategies on the Arxiv dataset
  • Table4: Micro-F1 scores for multi-label classification on the BlogCatalog dataset. Similar trends hold for Macro-F1 scores
Download tables as Excel
Related work
  • Graph Representation Learning. The mainstream of graph representation learning on graphs diverges into two main topics: traditional network embedding and GNNs. Traditional network embedding cares more about the distribution of positive node pairs. Inspired by the skip-gram model [21], DeepWalk [24] learns embeddings via sampling “context” nodes for each vertex with random walks, and maximize the log-likelihood of observed context nodes for the given vertex. LINE [29] and node2vec [8] extend DeepWalk with various positive distributions. GNNs are deep learning based methods that generalize convolution operation to graph data. [14] design GCNs by approximating localized 1-order spectral convolution. For scalability, GraphSAGE [10] employs neighbor sampling to alleviate the receptive field expansion. FastGCN [3] further improves the sampling algorithm and adopts importance sampling in each layer.
Reference
  • Liwei Cai and William Yang Wang. 2018. KBGAN: Adversarial Learning for Knowledge Graph Embeddings. In NAACL-HLTâĂŸ18. 1470–1480.
    Google ScholarLocate open access versionFindings
  • Hugo Caselles-Dupré, Florian Lesaint, and Jimena Royo-Letelier. 2018. Word2vec applied to recommendation: Hyperparameters matter. In RecSys’18. ACM, 352– 356.
    Google ScholarLocate open access versionFindings
  • Jie Chen, Tengfei Ma, and Cao Xiao. 2018. FastGCN: fast learning with graph convolutional networks via importance sampling. ICLR’18 (2018).
    Google ScholarLocate open access versionFindings
  • Siddhartha Chib and Edward Greenberg. 1995. Understanding the metropolishastings algorithm. The american statistician 49, 4 (1995), 327–335.
    Google ScholarLocate open access versionFindings
  • Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the fourth ACM conference on Recommender systems. ACM, 39–46.
    Google ScholarLocate open access versionFindings
  • Ming Ding, Jie Tang, and Jie Zhang. 2018. Semi-supervised learning on graphs with generative adversarial nets. In CIKM’18. ACM, 913–922.
    Google ScholarLocate open access versionFindings
  • Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of machine learning research 9, Aug (2008), 1871–1874.
    Google ScholarLocate open access versionFindings
  • Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In KDD’16. ACM, 855–864.
    Google ScholarLocate open access versionFindings
  • Michael U Gutmann and Aapo Hyvärinen. 2012. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research 13, Feb (2012), 307–361.
    Google ScholarLocate open access versionFindings
  • Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In NIPS’17. 1024–1034.
    Google ScholarLocate open access versionFindings
  • Henry Hsu and Peter A Lachenbruch. 2007. Paired t test. Wiley encyclopedia of clinical trials (2007), 1–3.
    Google ScholarFindings
  • Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In ICDM’08. Ieee, 263–272.
    Google ScholarLocate open access versionFindings
  • Hong Huang, Jie Tang, Sen Wu, Lu Liu, and Xiaoming Fu. 2014. Mining triadic closure patterns in social networks. In WWW’14. 499–504.
    Google ScholarLocate open access versionFindings
  • Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. ICLR’17 (2017).
    Google ScholarLocate open access versionFindings
  • Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. 2007. Graph evolution: Densification and shrinking diameters. TKDD’07 1, 1 (2007), 2–es.
    Google ScholarFindings
  • Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS’14. 2177–2185.
    Google ScholarLocate open access versionFindings
  • Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI’18.
    Google ScholarFindings
  • Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet computing 7, 1 (2003), 76–80.
    Google ScholarLocate open access versionFindings
  • Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. In SIGIR’15. ACM, 43–52.
    Google ScholarLocate open access versionFindings
  • Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. 1953. Equation of state calculations by fast computing machines. The journal of chemical physics 21, 6 (1953), 1087–1092.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
    Google ScholarFindings
  • Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings efficiently with noise-contrastive estimation. In NIPS’13. 2265–2273.
    Google ScholarLocate open access versionFindings
  • Rong Pan, Yunhong Zhou, Bin Cao, Nathan N Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. 2008. One-class collaborative filtering. In ICDM’08. IEEE, 502– 511.
    Google ScholarFindings
  • Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In KDD’14. ACM, 701–710.
    Google ScholarLocate open access versionFindings
  • Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. 2018.
    Google ScholarFindings
  • Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In UAI’09. AUAI Press, 452–461.
    Google ScholarFindings
  • Kazunari Sugiyama and Min-Yen Kan. 2010. Scholarly paper recommendation via user’s recent research interests. In JCDL’10. ACM, 29–38.
    Google ScholarLocate open access versionFindings
  • Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019. Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197 (2019).
    Findings
  • Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In WWW’15. 1067–1077.
    Google ScholarLocate open access versionFindings
  • Cunchao Tu, Han Liu, Zhiyuan Liu, and Maosong Sun. 2017. Cane: Context-aware network embedding for relation modeling. In ACL’17. 1722–1731.
    Google ScholarLocate open access versionFindings
  • Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. ICLR’18 (2018).
    Google ScholarLocate open access versionFindings
  • Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. Irgan: A minimax game for unifying generative and discriminative information retrieval models. In SIGIR’17. ACM, 515–524.
    Google ScholarLocate open access versionFindings
  • Qinyong Wang, Hongzhi Yin, Zhiting Hu, Defu Lian, Hao Wang, and Zi Huang.
    Google ScholarFindings
  • 2018. Neural memory streaming recommender networks with adversarial training. In KDD’18. ACM, 2467–2475.
    Google ScholarLocate open access versionFindings
  • [34] Xiao Wang, Peng Cui, Jing Wang, Jian Pei, Wenwu Zhu, and Shiqiang Yang. 2017. Community preserving network embedding. In AAAI’17.
    Google ScholarFindings
  • [35] Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI’11.
    Google ScholarLocate open access versionFindings
  • [36] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).
    Findings
  • [37] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In KDD’18. ACM, 974–983.
    Google ScholarLocate open access versionFindings
  • [38] Weinan Zhang, Tianqi Chen, Jun Wang, and Yong Yu. 2013. Optimizing Top-N Collaborative Filtering via Dynamic Negative Item Sampling. In SIGIR’13. ACM, 785–788.
    Google ScholarLocate open access versionFindings
  • [39] Zheng Zhang and Pierre Zweigenbaum. 2018. GNEG: Graph-Based Negative Sampling for word2vec. In ACL’18. 566–571.
    Google ScholarLocate open access versionFindings
  • [40] Tong Zhao, Julian McAuley, and Irwin King. 2015. Improving latent factor models via personalized feature projection for one class recommendation. In CIKM’15. ACM, 821–830.
    Google ScholarLocate open access versionFindings
  • [41] Chang Zhou, Yuqiong Liu, Xiaofei Liu, Zhongyi Liu, and Jun Gao. 2017. Scalable graph embedding for asymmetric proximity. In AAAI’17. (3) Examine whether the rank of uìvì⊤ is less than k (hit) or not (miss).
    Google ScholarFindings
  • Hits@k. MRR assigns different scores to different ranks of item v in the ranked list mentioned above when querying user u, which is defined as follows: 3 https://grouplens.org/datasets/movielens/100k/4 http://jmcauley.ucsd.edu/data/amazon/links.html 5we use A+dataset to represent the real data we collected.
    Locate open access versionFindings
  • (1) We extract 30% of true edges and remove them from the whole graph while ensuring the residual graph is connected.
    Google ScholarFindings
  • (2) We sample 30 % false edges, and then combine 30% of true edges and false edges into a test set.
    Google ScholarFindings
  • (3) Each graph embedding algorithm with various negative sampling strategies is trained using the residual graph, and then the embedding for each node is obtained.
    Google ScholarFindings
  • (4) We predict the probability of a node pair being a true edge according to the inner product. We finally calculate AUC score according to the probability via roc_auc_score function.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments