A distributed approach for graph mining in massive networks

    Data Min. Knowl. Discov., Volume 30, Issue 5, 2016, Pages 1024-1052.

    Cited by: 27|Bibtex|Views21|Links
    EI
    Keywords:
    Parallel graph miningDistributed graph miningSingle large graphFrequent subgraph miningHigh performance computing
    Wei bo:
    We presented a novel distributed approach for mining frequent subgraph patterns from a single large graph

    Abstract:

    We propose a novel distributed algorithm for mining frequent subgraphs from a single, very large, labeled network. Our approach is the first distributed method to mine a massive input graph that is too large to fit in the memory of any individual compute node. The input graph thus has to be partitioned among the nodes, which can lead to p...More

    Code:

    Data:

    0
    Introduction
    • Frequent graph mining is a well-studied problem with numerous applications in areas such as computational chemistry, bioinformatics, and social networks.
    • Subgraph Isomorphism and Embedding: The authors say that the pattern P is subgraph isomorphic to G = (V, E), denoted as P ⊆ G, if there exists an injective function, φ : VP → V such that: 1) ∀v ∈ VP , L(v) = L(φ(v)), and 2) ∀ ∈ EP , (φ, φ) ∈ E and L = L(φ, φ)
    • In this case, the isomorphic subgraph in G comprising the vertices φ(v1), φ(v2), · · · , φ is called an embedding of the pattern P in the input graph G.
    • The authors use the terms isomorphism and embedding interchangeably, since given the isomorphism function φ, the authors can uniquely identify the corresponding embedding
    Highlights
    • Frequent graph mining is a well-studied problem with numerous applications in areas such as computational chemistry, bioinformatics, and social networks
    • – We develop a distributed solution for mining a single massive graph that leverages effective collective communication primitives for scalable performance
    • – We propose a hybrid solution for subgraph mining that utilizes both thread-based parallelism within each compute node and distributed computation across multiple compute nodes
    • We propose the first distributed subgraph mining method for a partitioned input graph, and one that can handle webscale networks spanning over a billion vertices and edges
    • We presented a novel distributed approach for mining frequent subgraph patterns from a single large graph
    • We partition the input graph to distribute the workload across the system
    Methods
    • All the distributed experiments are performed on up to 2048 IBM Blue Gene/Q (BG/Q) nodes.
    • DistGraph is implemented in C++, compiled using g++ (v.
    • 3.0) library for thread-based parallelism within compute nodes, and the portable, open-source and freely available MPI implementation MPICH2 (v.
    • 1.5) for distributed computation across nodes.
    • The authors compared the sequential implementation with GraMi [5], which is written in Java, and is compiled using JDK v.
    • The authors' DistGraph code is available for download at https://github.com/zakimjz/DistGraph
    Conclusion
    • The authors presented a novel distributed approach for mining frequent subgraph patterns from a single large graph.
    • The authors partition the input graph to distribute the workload across the system.
    • The authors' solution is very flexible, since it relies on collective communication primitives that are available on almost all distributed frameworks.
    • The authors' hybrid approach leverages both shared memory and distributed systems.
    • Based on the resource availability, one can consider different number of partitions, compute nodes and threads, for scalability
    Summary
    • Introduction:

      Frequent graph mining is a well-studied problem with numerous applications in areas such as computational chemistry, bioinformatics, and social networks.
    • Subgraph Isomorphism and Embedding: The authors say that the pattern P is subgraph isomorphic to G = (V, E), denoted as P ⊆ G, if there exists an injective function, φ : VP → V such that: 1) ∀v ∈ VP , L(v) = L(φ(v)), and 2) ∀ ∈ EP , (φ, φ) ∈ E and L = L(φ, φ)
    • In this case, the isomorphic subgraph in G comprising the vertices φ(v1), φ(v2), · · · , φ is called an embedding of the pattern P in the input graph G.
    • The authors use the terms isomorphism and embedding interchangeably, since given the isomorphism function φ, the authors can uniquely identify the corresponding embedding
    • Methods:

      All the distributed experiments are performed on up to 2048 IBM Blue Gene/Q (BG/Q) nodes.
    • DistGraph is implemented in C++, compiled using g++ (v.
    • 3.0) library for thread-based parallelism within compute nodes, and the portable, open-source and freely available MPI implementation MPICH2 (v.
    • 1.5) for distributed computation across nodes.
    • The authors compared the sequential implementation with GraMi [5], which is written in Java, and is compiled using JDK v.
    • The authors' DistGraph code is available for download at https://github.com/zakimjz/DistGraph
    • Conclusion:

      The authors presented a novel distributed approach for mining frequent subgraph patterns from a single large graph.
    • The authors partition the input graph to distribute the workload across the system.
    • The authors' solution is very flexible, since it relies on collective communication primitives that are available on almost all distributed frameworks.
    • The authors' hybrid approach leverages both shared memory and distributed systems.
    • Based on the resource availability, one can consider different number of partitions, compute nodes and threads, for scalability
    Tables
    • Table1: AllGather and AlltoAll examples
    • Table2: Datasets and their Properties: Number of vertices |V |, edges |E|, labels |L|, avg. degree AD, clustering coefficient CC
    • Table3: Partitions and External Neighbors dataset PDB1 PDB2 PDB3 Patent Youtube max comp. 4,477 7,856 11,568
    Download tables as Excel
    Related work
    • FSM is a well studied problem, in both the graph database [10, 14, 27] and the single graph [15, 5] setting.

      Graph Database (multiple graphs): Initial FSM methods used systematic generation of candidate subgraph patterns [10, 14] and their level-wise growth via breadth-first search (BFS). Later methods like gSpan [27] and FFSM [9] use canonical ordering of the patterns, and a depth-first (DFS) exploration. Parallel graph mining algorithms targeting SMP systems were proposed in [4] and [19], where they explore different schemes for load balancing among the multiple processors. One of the first distributed methods was [6], which uses a message passing approach for mining molecular graphs. Recently, map-reduce based approaches [26, 18, 16, 2, 7] have attracted attention; GPUs have also been exploited [12]. It is important to note that all of these parallel/distributed methods are designed for multiple input graphs, and cannot be adapted for the single graph case due to the underlying algorithmic assumptions.
    Funding
    • This work was supported by NSF Award IIS-1302231
    Reference
    • F. N. Afrati, D. Fotakis, and J. D. Ullman. Enumerating subgraph instances using mapreduce. In IEEE Int’l Conference on Data Engineering, 2013.
      Google ScholarLocate open access versionFindings
    • M. Bhuiyan and M. Al Hasan. An iterative mapreduce based frequent subgraph mining algorithm. IEEE Transactions on Knowledge and Data Engineering, 27(3):608–620, March 2015.
      Google ScholarLocate open access versionFindings
    • B. Bringmann and S. Nijssen. What is frequent in a single graph? In Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, 2008.
      Google ScholarLocate open access versionFindings
    • G. Buehrer, S. Parthasarathy, and Y.-K. Chen. Adaptive parallel graph mining for cmp architectures. In IEEE Int’l Conference on Data Mining, 2006.
      Google ScholarLocate open access versionFindings
    • M. Elseidy, E. Abdelhamid, S. Skiadopoulos, and P. Kalnis. Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment, 7:517–528, 2014.
      Google ScholarLocate open access versionFindings
    • G. D. Fatta and M. R. Berthold. Dynamic load balancing for the distributed mining of molecular structures. IEEE Transactions on Parallel and Distributed Systems, 17(8):773– 785, 2006.
      Google ScholarLocate open access versionFindings
    • S. Hill, B. Srichandan, and R. Sunderraman. An iterative mapreduce approach to frequent subgraph mining in biological datasets. In ACM Conference on Bioinformatics, Computational Biology and Biomedicine, 2012.
      Google ScholarLocate open access versionFindings
    • L. B. Holder and D. J. Cook. Discovery of inexact concepts from structural data. IEEE Transactions on Knowledge and Data Engineering, 5(6):992–994, 1993.
      Google ScholarLocate open access versionFindings
    • J. Huan, W. Wang, and J. Prins. Efficient mining of frequent subgraphs in the presence of isomorphism. In IEEE International Conference on Data Mining, 2003.
      Google ScholarLocate open access versionFindings
    • A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In Principles of Data Mining and Knowledge Discovery, LNCS Vol. 1910, Springer, pages 13–23, 2000.
      Google ScholarFindings
    • G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal of Scientific Computing, 20(1):359–392, 1998.
      Google ScholarLocate open access versionFindings
    • R. Kessl, N. Talukder, P. Anchuri, and M. J. Zaki. Parallel graph mining with GPUs. Proceedings of the BigMine Workshop (ACM SIGKDD), Journal of Machine Learning Research: Conference and Workshop Proceedings, 36:1–16, 2014.
      Google ScholarLocate open access versionFindings
    • B. Kimelfeld and P. G. Kolaitis. The complexity of mining maximal frequent subgraphs. ACM Transactions on Database Systems (TODS), 39(4):32, 2014.
      Google ScholarLocate open access versionFindings
    • M. Kuramochi and G. Karypis. Frequent subgraph discovery. In IEEE International Conference on Data Mining, 2001.
      Google ScholarLocate open access versionFindings
    • M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse graph. Data Mining and Knowledge Discovery, 11(3):243–271, 2005.
      Google ScholarLocate open access versionFindings
    • W. Lin, X. Xiao, and G. Ghinita. Large-scale frequent subgraph mining in mapreduce. In IEEE International Conference on Data Engineering, 2014.
      Google ScholarLocate open access versionFindings
    • Y. Liu, X. Jiang, H. Chen, J. Ma, and X. Zhang. Mapreduce-based pattern finding algorithm applied in motif detection for prescription compatibility network. In Advanced Parallel Processing Technologies, LNCS Vol. 5737, Springer, pages 341–355, 2009.
      Google ScholarLocate open access versionFindings
    • W. Lu, G. Chen, A. Tung, and F. Zhao. Efficiently extracting frequent subgraphs using mapreduce. In IEEE International Conference on Big Data, 2013.
      Google ScholarLocate open access versionFindings
    • T. Meinl, M. Wörlein, I. Fischer, and M. Philippsen. Mining molecular datasets on symmetric multiprocessor systems. In IEEE International Conference on Systems, Man and Cybernetics, Vol. 2, 2006.
      Google ScholarLocate open access versionFindings
    • S. Reinhardt and G. Karypis. A multi-level parallel implementation of a program for finding frequent patterns in a large sparse graph. In IEEE International Parallel and Distributed Processing Symposium, 2007.
      Google ScholarLocate open access versionFindings
    • S. Shahrivari and S. Jalili. Distributed discovery of frequent subgraphs of a network using mapreduce. Computing, 97(11):1101–1120, 2015.
      Google ScholarLocate open access versionFindings
    • Y. Shao, B. Cui, L. Chen, L. Ma, J. Yao, and N. Xu. Parallel subgraph listing in a large-scale graph. In ACM SIGMOD International Conference on Management of Data, 2014.
      Google ScholarLocate open access versionFindings
    • Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on billion node graphs. Proceedings of VLDB Endowment, 5(9):788–799, 2012.
      Google ScholarLocate open access versionFindings
    • C. H. C. Teixeira, A. J. Fonseca, M. Serafini, G. Siganos, M. J. Zaki, and A. Aboulnaga. Arabesque: A system for distributed graph pattern mining. In 25th ACM Symposium on Operating Systems Principles, 2015.
      Google ScholarLocate open access versionFindings
    • D. Ucar, S. Asur, U. Catalyurek, and S. Parthasarathy. Improving functional modularity in protein-protein interactions graphs using hub-induced subgraphs. In Knowledge Discovery in Databases: PKDD 2006, pages 371–382. Springer, 2006.
      Google ScholarLocate open access versionFindings
    • B. Wu and Y. Bai. An efficient distributed subgraph mining algorithm in extreme large graphs. In International Conference on Artificial Intelligence and Computational Intelligence: Part I, 2010.
      Google ScholarLocate open access versionFindings
    • X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In IEEE International Conference on Data Mining, 2002.
      Google ScholarLocate open access versionFindings
    • G. Yang. The complexity of mining maximal frequent itemsets and maximal frequent patterns. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 344–353. ACM, 2004.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments