# Understanding Negative Sampling in Graph Representation Learning

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Virtual Event CA USA July, 2020, pp. 1666-1676, 2020.

EI

Keywords:

noise contrastive estimationmean reciprocal rankingMonte Carlo Negative Samplinggraph convolutional networkMarkov chain Monte CarloMore(8+)

Weibo:

Abstract:

Graph representation learning has been extensively studied in recent years, in which sampling is a critical point. Prior arts usually focus on sampling positive node pairs, while the strategy for negative sampling is left insufficiently explored. To bridge the gap, we systematically analyze the role of negative sampling from the perspecti...More

Introduction

- Recent years have seen the graph representation learning gradually stepping into the spotlight of data mining research.
- The mainstream graph representation learning algorithms include traditional Network Embedding methods (e.g. DeepWalk [24], LINE [29]) and Graph Neural Networks (e.g. GCN [14], GraphSAGE [10]), the latter sometimes are trained end-to-end in classification tasks.

Highlights

- Recent years have seen the graph representation learning gradually stepping into the spotlight of data mining research
- We propose an effective and scalable negative sampling strategy, Markov chain Monte Carlo Negative Sampling (MCNS), which applies our theory with an approximated positive distribution based on current embeddings
- To compare the efficiency of different negative sampling methods, we report the runtime of Monte Carlo Negative Sampling and hard-samples or generative adversarial nets-based strategies (PinSAGE, WARP, dynamic negative sampling, KBGAN) with GraphSAGE encoder in recommendation task in Figure 5
- We study the effect of negative sampling, a practical approach adopted in the literature of graph representation learning
- Different from the existing works that decide a proper distribution for negative sampling in heuristics, we theoretically analyze the objective and risk of the negative sampling approach and conclude that the negative sampling distribution should be positively but sub-linearly correlated to their positive sampling distribution
- Extensive experiments show that Monte Carlo Negative Sampling outperforms 8 negative sampling strategies, regardless of the underlying graph representation learning methods

Methods

- 4.1 The Self-contrast Approximation

the authors deduced that pn (u |v) ∝ pd (u |v)α , the real pd is unknown and its approximation pd is often implicitly defined. – How can the principle help negative sampling?

The authors propose a self-contrast approximation, replacing pd by inner products based on the current encoder, i.e.

pn (u |v) ∝ pd (u |v)α ≈

Eθ (u) · Eθ (v) α u′ ∈U Eθ (u ′) · Eθ (v) α

The resulting form is similar to a technique in RotatE [28], and very time-consuming. - The authors deduced that pn (u |v) ∝ pd (u |v)α , the real pd is unknown and its approximation pd is often implicitly defined.
- Each sampling requires O(n) time, making it impossible for middle- or large-scale graphs.
- Methods, the MetropolisâĂŞHastings algorithm [20] is designed for obtaining a sequence of random samples from unnormalized distributions.
- The MetropolisâĂŞHastings algorithm constructs a Markov chain {X (t)} that is ergodic and stationary with respect to π âĂŤ- meaning that, X (t) ∼ π (x), t → ∞

Results

- The authors demonstrate the results in the 19 settings.
- The tests in the 19 settings give a max p-value (Amazon, GraphSAGE) of 0.0005 ≪ 0.01, quantitatively proving the significance of the improvements.
- Recommendation is the most important technology in many E-commerce platforms, which evolved from collaborative filtering to graph-based models.
- Graph-based recommender systems represent all users and items by embeddings, and recommend items with largest inner products for a given user

Conclusion

- The authors study the effect of negative sampling, a practical approach adopted in the literature of graph representation learning.
- Motivated by the theoretical results, the authors propose MCNS, approximating the ideal distribution by self-contrast and accelerating sampling by Metropolis-Hastings.
- Extensive experiments show that MCNS outperforms 8 negative sampling strategies, regardless of the underlying graph representation learning methods

Summary

## Introduction:

Recent years have seen the graph representation learning gradually stepping into the spotlight of data mining research.- The mainstream graph representation learning algorithms include traditional Network Embedding methods (e.g. DeepWalk [24], LINE [29]) and Graph Neural Networks (e.g. GCN [14], GraphSAGE [10]), the latter sometimes are trained end-to-end in classification tasks.
## Methods:

4.1 The Self-contrast Approximation

the authors deduced that pn (u |v) ∝ pd (u |v)α , the real pd is unknown and its approximation pd is often implicitly defined. – How can the principle help negative sampling?

The authors propose a self-contrast approximation, replacing pd by inner products based on the current encoder, i.e.

pn (u |v) ∝ pd (u |v)α ≈

Eθ (u) · Eθ (v) α u′ ∈U Eθ (u ′) · Eθ (v) α

The resulting form is similar to a technique in RotatE [28], and very time-consuming.- The authors deduced that pn (u |v) ∝ pd (u |v)α , the real pd is unknown and its approximation pd is often implicitly defined.
- Each sampling requires O(n) time, making it impossible for middle- or large-scale graphs.
- Methods, the MetropolisâĂŞHastings algorithm [20] is designed for obtaining a sequence of random samples from unnormalized distributions.
- The MetropolisâĂŞHastings algorithm constructs a Markov chain {X (t)} that is ergodic and stationary with respect to π âĂŤ- meaning that, X (t) ∼ π (x), t → ∞
## Results:

The authors demonstrate the results in the 19 settings.- The tests in the 19 settings give a max p-value (Amazon, GraphSAGE) of 0.0005 ≪ 0.01, quantitatively proving the significance of the improvements.
- Recommendation is the most important technology in many E-commerce platforms, which evolved from collaborative filtering to graph-based models.
- Graph-based recommender systems represent all users and items by embeddings, and recommend items with largest inner products for a given user
## Conclusion:

The authors study the effect of negative sampling, a practical approach adopted in the literature of graph representation learning.- Motivated by the theoretical results, the authors propose MCNS, approximating the ideal distribution by self-contrast and accelerating sampling by Metropolis-Hastings.
- Extensive experiments show that MCNS outperforms 8 negative sampling strategies, regardless of the underlying graph representation learning methods

- Table1: Statistics of the tasks and datasets
- Table2: Recommendation Results of MCNS with various encoders on three datasets. GCN is not evaluted on Amazon and Alibaba datasets due to its limited scalability
- Table3: The results of link prediction with different negative sampling strategies on the Arxiv dataset
- Table4: Micro-F1 scores for multi-label classification on the BlogCatalog dataset. Similar trends hold for Macro-F1 scores

Related work

- Graph Representation Learning. The mainstream of graph representation learning on graphs diverges into two main topics: traditional network embedding and GNNs. Traditional network embedding cares more about the distribution of positive node pairs. Inspired by the skip-gram model [21], DeepWalk [24] learns embeddings via sampling “context” nodes for each vertex with random walks, and maximize the log-likelihood of observed context nodes for the given vertex. LINE [29] and node2vec [8] extend DeepWalk with various positive distributions. GNNs are deep learning based methods that generalize convolution operation to graph data. [14] design GCNs by approximating localized 1-order spectral convolution. For scalability, GraphSAGE [10] employs neighbor sampling to alleviate the receptive field expansion. FastGCN [3] further improves the sampling algorithm and adopts importance sampling in each layer.

Reference

- Liwei Cai and William Yang Wang. 2018. KBGAN: Adversarial Learning for Knowledge Graph Embeddings. In NAACL-HLTâĂŸ18. 1470–1480.
- Hugo Caselles-Dupré, Florian Lesaint, and Jimena Royo-Letelier. 2018. Word2vec applied to recommendation: Hyperparameters matter. In RecSys’18. ACM, 352– 356.
- Jie Chen, Tengfei Ma, and Cao Xiao. 2018. FastGCN: fast learning with graph convolutional networks via importance sampling. ICLR’18 (2018).
- Siddhartha Chib and Edward Greenberg. 1995. Understanding the metropolishastings algorithm. The american statistician 49, 4 (1995), 327–335.
- Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the fourth ACM conference on Recommender systems. ACM, 39–46.
- Ming Ding, Jie Tang, and Jie Zhang. 2018. Semi-supervised learning on graphs with generative adversarial nets. In CIKM’18. ACM, 913–922.
- Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of machine learning research 9, Aug (2008), 1871–1874.
- Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In KDD’16. ACM, 855–864.
- Michael U Gutmann and Aapo Hyvärinen. 2012. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research 13, Feb (2012), 307–361.
- Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In NIPS’17. 1024–1034.
- Henry Hsu and Peter A Lachenbruch. 2007. Paired t test. Wiley encyclopedia of clinical trials (2007), 1–3.
- Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In ICDM’08. Ieee, 263–272.
- Hong Huang, Jie Tang, Sen Wu, Lu Liu, and Xiaoming Fu. 2014. Mining triadic closure patterns in social networks. In WWW’14. 499–504.
- Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. ICLR’17 (2017).
- Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. 2007. Graph evolution: Densification and shrinking diameters. TKDD’07 1, 1 (2007), 2–es.
- Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS’14. 2177–2185.
- Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI’18.
- Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet computing 7, 1 (2003), 76–80.
- Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. In SIGIR’15. ACM, 43–52.
- Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. 1953. Equation of state calculations by fast computing machines. The journal of chemical physics 21, 6 (1953), 1087–1092.
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
- Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings efficiently with noise-contrastive estimation. In NIPS’13. 2265–2273.
- Rong Pan, Yunhong Zhou, Bin Cao, Nathan N Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. 2008. One-class collaborative filtering. In ICDM’08. IEEE, 502– 511.
- Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In KDD’14. ACM, 701–710.
- Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. 2018.
- Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In UAI’09. AUAI Press, 452–461.
- Kazunari Sugiyama and Min-Yen Kan. 2010. Scholarly paper recommendation via user’s recent research interests. In JCDL’10. ACM, 29–38.
- Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019. Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197 (2019).
- Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In WWW’15. 1067–1077.
- Cunchao Tu, Han Liu, Zhiyuan Liu, and Maosong Sun. 2017. Cane: Context-aware network embedding for relation modeling. In ACL’17. 1722–1731.
- Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. ICLR’18 (2018).
- Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. Irgan: A minimax game for unifying generative and discriminative information retrieval models. In SIGIR’17. ACM, 515–524.
- Qinyong Wang, Hongzhi Yin, Zhiting Hu, Defu Lian, Hao Wang, and Zi Huang.
- 2018. Neural memory streaming recommender networks with adversarial training. In KDD’18. ACM, 2467–2475.
- [34] Xiao Wang, Peng Cui, Jing Wang, Jian Pei, Wenwu Zhu, and Shiqiang Yang. 2017. Community preserving network embedding. In AAAI’17.
- [35] Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI’11.
- [36] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).
- [37] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In KDD’18. ACM, 974–983.
- [38] Weinan Zhang, Tianqi Chen, Jun Wang, and Yong Yu. 2013. Optimizing Top-N Collaborative Filtering via Dynamic Negative Item Sampling. In SIGIR’13. ACM, 785–788.
- [39] Zheng Zhang and Pierre Zweigenbaum. 2018. GNEG: Graph-Based Negative Sampling for word2vec. In ACL’18. 566–571.
- [40] Tong Zhao, Julian McAuley, and Irwin King. 2015. Improving latent factor models via personalized feature projection for one class recommendation. In CIKM’15. ACM, 821–830.
- [41] Chang Zhou, Yuqiong Liu, Xiaofei Liu, Zhongyi Liu, and Jun Gao. 2017. Scalable graph embedding for asymmetric proximity. In AAAI’17. (3) Examine whether the rank of uìvì⊤ is less than k (hit) or not (miss).
- Hits@k. MRR assigns different scores to different ranks of item v in the ranked list mentioned above when querying user u, which is defined as follows: 3 https://grouplens.org/datasets/movielens/100k/4 http://jmcauley.ucsd.edu/data/amazon/links.html 5we use A+dataset to represent the real data we collected.
- (1) We extract 30% of true edges and remove them from the whole graph while ensuring the residual graph is connected.
- (2) We sample 30 % false edges, and then combine 30% of true edges and false edges into a test set.
- (3) Each graph embedding algorithm with various negative sampling strategies is trained using the residual graph, and then the embedding for each node is obtained.
- (4) We predict the probability of a node pair being a true edge according to the inner product. We finally calculate AUC score according to the probability via roc_auc_score function.

Tags

Comments