# A distributed approach for graph mining in massive networks

Data Min. Knowl. Discov., Volume 30, Issue 5, 2016, Pages 1024-1052.

EI

Keywords:

Parallel graph miningDistributed graph miningSingle large graphFrequent subgraph miningHigh performance computing

Wei bo:

Abstract:

We propose a novel distributed algorithm for mining frequent subgraphs from a single, very large, labeled network. Our approach is the first distributed method to mine a massive input graph that is too large to fit in the memory of any individual compute node. The input graph thus has to be partitioned among the nodes, which can lead to p...More

Code:

Data:

Introduction

- Frequent graph mining is a well-studied problem with numerous applications in areas such as computational chemistry, bioinformatics, and social networks.
- Subgraph Isomorphism and Embedding: The authors say that the pattern P is subgraph isomorphic to G = (V, E), denoted as P ⊆ G, if there exists an injective function, φ : VP → V such that: 1) ∀v ∈ VP , L(v) = L(φ(v)), and 2) ∀ ∈ EP , (φ, φ) ∈ E and L = L(φ, φ)
- In this case, the isomorphic subgraph in G comprising the vertices φ(v1), φ(v2), · · · , φ is called an embedding of the pattern P in the input graph G.
- The authors use the terms isomorphism and embedding interchangeably, since given the isomorphism function φ, the authors can uniquely identify the corresponding embedding

Highlights

- Frequent graph mining is a well-studied problem with numerous applications in areas such as computational chemistry, bioinformatics, and social networks
- – We develop a distributed solution for mining a single massive graph that leverages effective collective communication primitives for scalable performance
- – We propose a hybrid solution for subgraph mining that utilizes both thread-based parallelism within each compute node and distributed computation across multiple compute nodes
- We propose the first distributed subgraph mining method for a partitioned input graph, and one that can handle webscale networks spanning over a billion vertices and edges
- We presented a novel distributed approach for mining frequent subgraph patterns from a single large graph
- We partition the input graph to distribute the workload across the system

Methods

- All the distributed experiments are performed on up to 2048 IBM Blue Gene/Q (BG/Q) nodes.
- DistGraph is implemented in C++, compiled using g++ (v.
- 3.0) library for thread-based parallelism within compute nodes, and the portable, open-source and freely available MPI implementation MPICH2 (v.
- 1.5) for distributed computation across nodes.
- The authors compared the sequential implementation with GraMi [5], which is written in Java, and is compiled using JDK v.
- The authors' DistGraph code is available for download at https://github.com/zakimjz/DistGraph

Conclusion

- The authors presented a novel distributed approach for mining frequent subgraph patterns from a single large graph.
- The authors partition the input graph to distribute the workload across the system.
- The authors' solution is very flexible, since it relies on collective communication primitives that are available on almost all distributed frameworks.
- The authors' hybrid approach leverages both shared memory and distributed systems.
- Based on the resource availability, one can consider different number of partitions, compute nodes and threads, for scalability

Summary

## Introduction:

Frequent graph mining is a well-studied problem with numerous applications in areas such as computational chemistry, bioinformatics, and social networks.- Subgraph Isomorphism and Embedding: The authors say that the pattern P is subgraph isomorphic to G = (V, E), denoted as P ⊆ G, if there exists an injective function, φ : VP → V such that: 1) ∀v ∈ VP , L(v) = L(φ(v)), and 2) ∀ ∈ EP , (φ, φ) ∈ E and L = L(φ, φ)
- In this case, the isomorphic subgraph in G comprising the vertices φ(v1), φ(v2), · · · , φ is called an embedding of the pattern P in the input graph G.
- The authors use the terms isomorphism and embedding interchangeably, since given the isomorphism function φ, the authors can uniquely identify the corresponding embedding
## Methods:

All the distributed experiments are performed on up to 2048 IBM Blue Gene/Q (BG/Q) nodes.- DistGraph is implemented in C++, compiled using g++ (v.
- 3.0) library for thread-based parallelism within compute nodes, and the portable, open-source and freely available MPI implementation MPICH2 (v.
- 1.5) for distributed computation across nodes.
- The authors compared the sequential implementation with GraMi [5], which is written in Java, and is compiled using JDK v.
- The authors' DistGraph code is available for download at https://github.com/zakimjz/DistGraph
## Conclusion:

The authors presented a novel distributed approach for mining frequent subgraph patterns from a single large graph.- The authors partition the input graph to distribute the workload across the system.
- The authors' solution is very flexible, since it relies on collective communication primitives that are available on almost all distributed frameworks.
- The authors' hybrid approach leverages both shared memory and distributed systems.
- Based on the resource availability, one can consider different number of partitions, compute nodes and threads, for scalability

- Table1: AllGather and AlltoAll examples
- Table2: Datasets and their Properties: Number of vertices |V |, edges |E|, labels |L|, avg. degree AD, clustering coefficient CC
- Table3: Partitions and External Neighbors dataset PDB1 PDB2 PDB3 Patent Youtube max comp. 4,477 7,856 11,568

Related work

- FSM is a well studied problem, in both the graph database [10, 14, 27] and the single graph [15, 5] setting.

Graph Database (multiple graphs): Initial FSM methods used systematic generation of candidate subgraph patterns [10, 14] and their level-wise growth via breadth-first search (BFS). Later methods like gSpan [27] and FFSM [9] use canonical ordering of the patterns, and a depth-first (DFS) exploration. Parallel graph mining algorithms targeting SMP systems were proposed in [4] and [19], where they explore different schemes for load balancing among the multiple processors. One of the first distributed methods was [6], which uses a message passing approach for mining molecular graphs. Recently, map-reduce based approaches [26, 18, 16, 2, 7] have attracted attention; GPUs have also been exploited [12]. It is important to note that all of these parallel/distributed methods are designed for multiple input graphs, and cannot be adapted for the single graph case due to the underlying algorithmic assumptions.

Funding

- This work was supported by NSF Award IIS-1302231

Reference

- F. N. Afrati, D. Fotakis, and J. D. Ullman. Enumerating subgraph instances using mapreduce. In IEEE Int’l Conference on Data Engineering, 2013.
- M. Bhuiyan and M. Al Hasan. An iterative mapreduce based frequent subgraph mining algorithm. IEEE Transactions on Knowledge and Data Engineering, 27(3):608–620, March 2015.
- B. Bringmann and S. Nijssen. What is frequent in a single graph? In Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, 2008.
- G. Buehrer, S. Parthasarathy, and Y.-K. Chen. Adaptive parallel graph mining for cmp architectures. In IEEE Int’l Conference on Data Mining, 2006.
- M. Elseidy, E. Abdelhamid, S. Skiadopoulos, and P. Kalnis. Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment, 7:517–528, 2014.
- G. D. Fatta and M. R. Berthold. Dynamic load balancing for the distributed mining of molecular structures. IEEE Transactions on Parallel and Distributed Systems, 17(8):773– 785, 2006.
- S. Hill, B. Srichandan, and R. Sunderraman. An iterative mapreduce approach to frequent subgraph mining in biological datasets. In ACM Conference on Bioinformatics, Computational Biology and Biomedicine, 2012.
- L. B. Holder and D. J. Cook. Discovery of inexact concepts from structural data. IEEE Transactions on Knowledge and Data Engineering, 5(6):992–994, 1993.
- J. Huan, W. Wang, and J. Prins. Efficient mining of frequent subgraphs in the presence of isomorphism. In IEEE International Conference on Data Mining, 2003.
- A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In Principles of Data Mining and Knowledge Discovery, LNCS Vol. 1910, Springer, pages 13–23, 2000.
- G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal of Scientific Computing, 20(1):359–392, 1998.
- R. Kessl, N. Talukder, P. Anchuri, and M. J. Zaki. Parallel graph mining with GPUs. Proceedings of the BigMine Workshop (ACM SIGKDD), Journal of Machine Learning Research: Conference and Workshop Proceedings, 36:1–16, 2014.
- B. Kimelfeld and P. G. Kolaitis. The complexity of mining maximal frequent subgraphs. ACM Transactions on Database Systems (TODS), 39(4):32, 2014.
- M. Kuramochi and G. Karypis. Frequent subgraph discovery. In IEEE International Conference on Data Mining, 2001.
- M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse graph. Data Mining and Knowledge Discovery, 11(3):243–271, 2005.
- W. Lin, X. Xiao, and G. Ghinita. Large-scale frequent subgraph mining in mapreduce. In IEEE International Conference on Data Engineering, 2014.
- Y. Liu, X. Jiang, H. Chen, J. Ma, and X. Zhang. Mapreduce-based pattern finding algorithm applied in motif detection for prescription compatibility network. In Advanced Parallel Processing Technologies, LNCS Vol. 5737, Springer, pages 341–355, 2009.
- W. Lu, G. Chen, A. Tung, and F. Zhao. Efficiently extracting frequent subgraphs using mapreduce. In IEEE International Conference on Big Data, 2013.
- T. Meinl, M. Wörlein, I. Fischer, and M. Philippsen. Mining molecular datasets on symmetric multiprocessor systems. In IEEE International Conference on Systems, Man and Cybernetics, Vol. 2, 2006.
- S. Reinhardt and G. Karypis. A multi-level parallel implementation of a program for finding frequent patterns in a large sparse graph. In IEEE International Parallel and Distributed Processing Symposium, 2007.
- S. Shahrivari and S. Jalili. Distributed discovery of frequent subgraphs of a network using mapreduce. Computing, 97(11):1101–1120, 2015.
- Y. Shao, B. Cui, L. Chen, L. Ma, J. Yao, and N. Xu. Parallel subgraph listing in a large-scale graph. In ACM SIGMOD International Conference on Management of Data, 2014.
- Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on billion node graphs. Proceedings of VLDB Endowment, 5(9):788–799, 2012.
- C. H. C. Teixeira, A. J. Fonseca, M. Serafini, G. Siganos, M. J. Zaki, and A. Aboulnaga. Arabesque: A system for distributed graph pattern mining. In 25th ACM Symposium on Operating Systems Principles, 2015.
- D. Ucar, S. Asur, U. Catalyurek, and S. Parthasarathy. Improving functional modularity in protein-protein interactions graphs using hub-induced subgraphs. In Knowledge Discovery in Databases: PKDD 2006, pages 371–382. Springer, 2006.
- B. Wu and Y. Bai. An efficient distributed subgraph mining algorithm in extreme large graphs. In International Conference on Artificial Intelligence and Computational Intelligence: Part I, 2010.
- X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In IEEE International Conference on Data Mining, 2002.
- G. Yang. The complexity of mining maximal frequent itemsets and maximal frequent patterns. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 344–353. ACM, 2004.

Tags

Comments