# GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Virtual Event CA USA July, 2020, pp. 1150-1160, 2020.

EI

Keywords:

structural representationDeep Graph Kernelgraph representation learninggraph datasetpre trainingMore(10+)

Weibo:

Abstract:

Graph representation learning has emerged as a powerful technique for addressing real-world problems. Various downstream graph learning tasks have benefited from its recent developments, such as node classification, similarity search, and graph classification. However, prior arts on graph representation learning focus on domain specific p...More

Code:

Data:

Introduction

- Representative graph structural patterns are universal and transferable across networks.
- Barabasi and Albert show that several types of networks, e.g., World Wide Web, social, and biological networks, have the scale-free property, i.e., all of their degree distributions follow a power law [1].
- Other common patterns across networks include small world [58], motif distribution [31], community organization [34], and core-periphery structure [6], validating the hypothesis at the conceptual level

Highlights

- Recall that we focus on structural representation pre-training while most graph neural networks models require vertex features/attributes as input
- We want to emphasize that Deep Graph Kernel, graph2vec and InfoGraph all need to be pre-trained on target domain graphs, but Graph Contrastive Coding only relies on the graphs listed in Table 1 for pre-training
- We show that a graph neural network encoder pre-trained on several popular graph datasets can be directly adapted to new graph datasets and unseen graph learning tasks
- We study graph representation learning with the goal of characterizing and transferring structural features in social and information networks
- We present Graph Contrastive Coding (GCC), which is a graph-based contrastive learning framework to learn structural representations and similarity from data

Methods

- The authors evaluate GCC on three graph learning tasks — node classification, graph classification, and similarity search, which have been commonly used to benchmark graph learning algorithms [12, 43, 46, 60, 61].
- The authors first introduce the self-supervised pre-training settings in Section 4.1, and report GCC transfer learning results on three graph learning tasks in Section 4.2.
- The authors' self-supervised pre-training is performed on six graph datasets, which can be categorized into two groups — academic graphs and social graphs.
- As for academic graphs, the authors collect the Academia dataset from NetRep [44] as well as two DBLP datasets from SNAP [62] and NetRep [44], respectively.

Results

- The authors compare GCC with ProNE [65], GraphWave [12], and Struc2vec [43]. Table 2 represents the results.
- Compared with models trained from scratch, the reused model achieves competitive and sometimes better performance
- This demonstrates the transferability of graph structural patterns and the effectiveness of the GCC framework in capturing these patterns.
- It is still not clear if GCC’s good performance is due to pre-training or the expression power of its GIN [60] encoder
- To answer this question, the authors fully fine-tune GCC with its GIN encoder randomly initialized, which is equivalent to train a GIN encoder from scratch.

Conclusion

**Discussion on graph sampling**

In random walk with restart sampling, the restart probability controls the radius of ego-network (i.e., r ) which GCC conducts data augmentation on.- Its generalized positional embedding is defined to be the top eigenvectors of its normalized graph Laplacian.
- Suppose one subgraph has adjacency matrix A and degree matrix D, the authors conduct eigen-decomposition on its normalized graph Laplacian s.t. I −D−1/2AD−1/2 = U ΛU ⊤, where the top eigenvectors in U [55] are defined as generalized positional embedding.In this work, the authors study graph representation learning with the goal of characterizing and transferring structural features in social and information networks.
- The authors would like to explore applications of GCC on graphs in other domains, such as protein-protein association networks [47]

Summary

## Introduction:

Representative graph structural patterns are universal and transferable across networks.- Barabasi and Albert show that several types of networks, e.g., World Wide Web, social, and biological networks, have the scale-free property, i.e., all of their degree distributions follow a power law [1].
- Other common patterns across networks include small world [58], motif distribution [31], community organization [34], and core-periphery structure [6], validating the hypothesis at the conceptual level
## Methods:

The authors evaluate GCC on three graph learning tasks — node classification, graph classification, and similarity search, which have been commonly used to benchmark graph learning algorithms [12, 43, 46, 60, 61].- The authors first introduce the self-supervised pre-training settings in Section 4.1, and report GCC transfer learning results on three graph learning tasks in Section 4.2.
- The authors' self-supervised pre-training is performed on six graph datasets, which can be categorized into two groups — academic graphs and social graphs.
- As for academic graphs, the authors collect the Academia dataset from NetRep [44] as well as two DBLP datasets from SNAP [62] and NetRep [44], respectively.
## Results:

The authors compare GCC with ProNE [65], GraphWave [12], and Struc2vec [43]. Table 2 represents the results.- Compared with models trained from scratch, the reused model achieves competitive and sometimes better performance
- This demonstrates the transferability of graph structural patterns and the effectiveness of the GCC framework in capturing these patterns.
- It is still not clear if GCC’s good performance is due to pre-training or the expression power of its GIN [60] encoder
- To answer this question, the authors fully fine-tune GCC with its GIN encoder randomly initialized, which is equivalent to train a GIN encoder from scratch.
## Conclusion:

**Discussion on graph sampling**

In random walk with restart sampling, the restart probability controls the radius of ego-network (i.e., r ) which GCC conducts data augmentation on.- Its generalized positional embedding is defined to be the top eigenvectors of its normalized graph Laplacian.
- Suppose one subgraph has adjacency matrix A and degree matrix D, the authors conduct eigen-decomposition on its normalized graph Laplacian s.t. I −D−1/2AD−1/2 = U ΛU ⊤, where the top eigenvectors in U [55] are defined as generalized positional embedding.In this work, the authors study graph representation learning with the goal of characterizing and transferring structural features in social and information networks.
- The authors would like to explore applications of GCC on graphs in other domains, such as protein-protein association networks [47]

- Table1: Datasets for pre-training, sorted by number of vertices
- Table2: Node classification
- Table3: Graph classification
- Table4: Top-k similarity search (k = 20, 40)
- Table5: Momentum ablation
- Table6: Pre-training hyper-parameters for E2E and MoCo
- Table7: Performance of GIN model under various hyperparameter configurations

Related work

- In this section, we review related work of vertex similarity, contrastive learning and graph pre-training.

2.1 Vertex Similarity

Quantifying similarity of vertices in networks/graphs has been extensively studied in the past years. The goal of vertex similarity is to answer questions [26] like “How similar are these two vertices?” or “Which other vertices are most similar to these vertices?” The definition of similarity can be different in different situations. We briefly review the following three types of vertex similarity.

Neighborhood similarity The basic assumption of neighborhood similarity, a.k.a., proximity, is that vertices closely connected should be considered similar. Early neighborhood similarity measures include Jaccard similarity (counting common neighbors), RWR similarity [36] and SimRank [21], etc. Most recently developed network embedding algorithms, such as LINE [48], DeepWalk [39], node2vec [14], also follow the neighborhood similarity assumption.

Funding

- The work is supported by the National Key R&D Program of China (2018YFB1402600), NSFC for Distinguished Young Scholar (61825602), and NSFC (61836013)

Reference

- Réka Albert and Albert-László Barabási. 2002. Statistical mechanics of complex networks. Reviews of modern physics 74, 1 (2002), 47.
- J Ignacio Alvarez-Hamelin, Luca Dall’Asta, Alain Barrat, and Alessandro Vespignani. 2006. Large scale networks fingerprinting and visualization using the k-core decomposition. In Advances in neural information processing systems. 41–50.
- Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan. 2006. Group formation in large social networks: membership, growth, and evolution. In KDD ’06. 44–54.
- Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018).
- Austin R Benson, David F Gleich, and Jure Leskovec. 2016. Higher-order organization of complex networks. Science 353, 6295 (2016), 163–166.
- Stephen P Borgatti and Martin G Everett. 2000. Models of core/periphery structures. Social networks 21, 4 (2000), 375–395.
- Ronald S Burt. 2009. Structural holes: The social structure of competition. Harvard university press.
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2, 3 (2011), 1–27.
- Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 201ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In ICLR ’19.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT ’19. 4171–4186.
- Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In KDD ’17. 135– 144.
- Claire Donnat, Marinka Zitnik, David Hallac, and Jure Leskovec. 2018. Learning structural node embeddings via diffusion wavelets. In KDD ’18. 1320–1329.
- Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing for quantum chemistry. In ICML ’17. JMLR. org, 1263–1272.
- Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In KDD ’16. 855–864.
- Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In CVPR ’06, Vol. 2. IEEE, 1735–1742.
- Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in neural information processing systems. 1024–1034.
- Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In CVPR ’20. 9729–9738.
- Keith Henderson, Brian Gallagher, Tina Eliassi-Rad, Hanghang Tong, Sugato Basu, Leman Akoglu, Danai Koutra, Christos Faloutsos, and Lei Li. 2012. Rolx: structural role extraction & mining in large graphs. In KDD ’12. 1231–1239.
- Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. 20Pre-training graph neural networks. In ICLR ’19.
- Ziniu Hu, Changjun Fan, Ting Chen, Kai-Wei Chang, and Yizhou Sun. 2019. Unsupervised Pre-Training of Graph Convolutional Networks. ICLR 2019 Workshop: Representation Learning on Graphs and Manifolds (2019).
- Glen Jeh and Jennifer Widom. 2002. SimRank: a measure of structural-context similarity. In KDD ’02. 538–543.
- Yilun Jin, Guojie Song, and Chuan Shi. 2019. GraLSP: Graph Neural Networks with Local Structural Patterns. arXiv preprint arXiv:1911.07675 (2019).
- Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. 2016. Benchmark Data Sets for Graph Kernels. http://graphkernels.cs.tu-dortmund.de
- Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. ICLR ’15.
- Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR ’17.
- Elizabeth A Leicht, Petter Holme, and Mark EJ Newman. 2006. Vertex similarity in networks. Physical Review E 73, 2 (2006), 026120.
- Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In KDD ’06. 631–636.
- Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. 2005. Graphs over time: densification laws, shrinking diameters and possible explanations. In KDD ’05. 177–187.
- Silvio Micali and Zeyuan Allen Zhu. 2016. Reconstructing markov processes from independent and anonymous experiments. Discrete Applied Mathematics 200 (2016), 108–122.
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
- Ron Milo, Shalev Itzkovitz, Nadav Kashtan, Reuven Levitt, Shai Shen-Orr, Inbal Ayzenshtat, Michal Sheffer, and Uri Alon. 2004. Superfamilies of evolved and designed networks. Science 303, 5663 (2004), 1538–1542.
- Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii, and Uri Alon. 2002. Network motifs: simple building blocks of complex networks. Science 298, 5594 (2002), 824–827.
- Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal. 2017. graph2vec: Learning distributed representations of graphs. arXiv preprint arXiv:1707.05005 (2017).
- Mark EJ Newman. 2006. Modularity and community structure in networks. Proceedings of the national academy of sciences 103, 23 (2006), 8577–8582.
- Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
- Jia-Yu Pan, Hyung-Jeong Yang, Christos Faloutsos, and Pinar Duygulu. 2004. Automatic multimedia cross-modal correlation discovery. In KDD ’04. 653–658.
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems. 8024–8035.
- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research 12, Oct (2011), 2825–2830.
- Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In KDD ’14. 701–710.
- Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Chi Wang, Kuansan Wang, and Jie Tang. 2019. Netsmf: Large-scale network embedding as sparse matrix factorization. In The World Wide Web Conference. 1509–1520.
- Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. 2018. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In WSDM ’18. 459–467.
- Jiezhong Qiu, Jian Tang, Hao Ma, Yuxiao Dong, Kuansan Wang, and Jie Tang. 2018. Deepinf: Social influence prediction with deep learning. In KDD ’18. 2110–2119.
- Leonardo FR Ribeiro, Pedro HP Saverese, and Daniel R Figueiredo. 2017. struc2vec: Learning node representations from structural identity. In KDD ’17. 385–394.
- Scott C Ritchie, Stephen Watts, Liam G Fearnley, Kathryn E Holt, Gad Abraham, and Michael Inouye. 2016. A scalable permutation approach reveals replication and preservation patterns of network modules in large datasets. Cell systems 3, 1 (2016), 71–82.
- Daniel A Spielman and Shang-Hua Teng. 2013. A local clustering algorithm for massive graphs and its application to nearly linear time graph partitioning. SIAM Journal on computing 42, 1 (2013), 1–26.
- Fan-Yun Sun, Jordan Hoffman, Vikas Verma, and Jian Tang. 2019. InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. In ICLR ’19.
- Damian Szklarczyk, John H Morris, Helen Cook, Michael Kuhn, Stefan Wyder, Milan Simonovic, Alberto Santos, Nadezhda T Doncheva, Alexander Roth, Peer Bork, et al. 2016. The STRING database in 2017: quality-controlled protein– protein association networks, made broadly accessible. Nucleic acids research (2016), gkw937.
- Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In WWW ’15. 1067–1077.
- Shang-Hua Teng et al. 2016. Scalable algorithms for data and network analysis. Foundations and Trends® in Theoretical Computer Science 12, 1–2 (2016), 1–274.
- Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2019. Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019).
- Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. 2006. Fast random walk with restart and its applications. In ICDM ’06. IEEE, 613–622.
- Johan Ugander, Lars Backstrom, Cameron Marlow, and Jon Kleinberg. 2012. Structural diversity in social contagion. Proceedings of the National Academy of Sciences 109, 16 (2012), 5962–5966.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
- Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. ICLR ’18 (2018).
- Ulrike Von Luxburg. 2007. A tutorial on spectral clustering. Statistics and computing 17, 4 (2007), 395–416.
- Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In ICLR ’19.
- Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, et al. 2019. Deep graph library: Towards efficient and scalable deep learning on graphs. arXiv preprint arXiv:1909.01315 (2019).
- Duncan J Watts and Steven H Strogatz. 1998. Collective dynamics of small-world networks. nature 393, 6684 (1998), 440.
- Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In CVPR ’18. 3733– 3742.
- Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful are Graph Neural Networks?. In ICLR ’19.
- Pinar Yanardag and SVN Vishwanathan. 2015. Deep graph kernels. In KDD ’15. 1365–1374.
- Jaewon Yang and Jure Leskovec. 2015. Defining and evaluating network communities based on ground-truth. Knowledge and Information Systems 42, 1 (2015), 181–213.
- Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In KDD ’18. 974–983.
- Fanjin Zhang, Xiao Liu, Jie Tang, Yuxiao Dong, Peiran Yao, Jie Zhang, Xiaotao Gu, Yan Wang, Bin Shao, Rui Li, and et al. 2019. OAG: Toward Linking Large-Scale Heterogeneous Entity Graphs. In KDD ’19. 2585–2595.
- Jie Zhang, Yuxiao Dong, Yan Wang, Jie Tang, and Ming Ding. 2019. ProNE: fast and scalable network representation learning. In IJCAI ’19. 4278–4284.
- Jing Zhang, Jie Tang, Cong Ma, Hanghang Tong, Yu Jing, and Juanzi Li. 2015. Panther: Fast top-k similarity search on large networks. In KDD ’15. 1445–1454.
- Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. 2018. An end-to-end deep learning architecture for graph classification. In AAAI ’18.
- [12] We download the authors’ official source code and keep all the training settings as the same. The implementation requires a networkx graph and time points as input. We convert our dataset to the networkx format, and use automatic selection of the range of scales provided by the authors. We set the output embedding dimension to 64.
- [43] We download the authors’ official source code and use default hyper-parameters provided by the authors: (1) walk length = 80; (2) number of walks = 10; (3) window size = 10; (4) number of iterations = 5.
- [12] Embeddings computed by the GraphWave method also have the ability to generalize across graphs. The authors evaluated on synthetic graphs in their paper which are not publicly available. To compare with GraphWave on the co-author datasets, we compute GraphWave embeddings given two graphs G1 and G2 and follow the same procedure mentioned in section 4.2.2 to compute the HITS@10 (top-10 accuracy) score.

Tags

Comments