AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
In this paper we proposed PEGASUS, a graph mining package for very large graphs using the HADOOP architecture

PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, pp.229-238, (2009)

Cited by: 887|Views334
EI WOS
Full Text
Bibtex
Weibo

Abstract

In this paper, we describe PEGASUS, an open source peta graph mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. as the size of graphs reaches several giga-, tera- or peta-bytes, the necessity for such a library grows...More

Code:

Data:

0
Introduction
  • Graphs are ubiquitous: computer networks, social networks, mobile call networks, the World Wide Web [1], protein regulation networks to name a few.

    The large volume of available data, the low cost of storage and the stunning success of online social networks and web2.0 applications all lead to graphs of unprecedented size.
  • Based on HADOOP, here the authors describe PEGASUS, a graph mining package for handling graphs with billions of nodes and edges.
  • There are several algorithms, using Breadth-First Search, Depth-First-Search, “propagation” ([24], [25], [26]), or “contraction” [27]
  • These works rely on a shared memory model which limits their ability to handle large, disk-resident graphs.
  • MAPREDUCE has two major advantages: (a) the programmer is oblivious
Highlights
  • Graphs are ubiquitous: computer networks, social networks, mobile call networks, the World Wide Web [1], protein regulation networks to name a few.

    The large volume of available data, the low cost of storage and the stunning success of online social networks and web2.0 applications all lead to graphs of unprecedented size
  • Based on HADOOP, here we describe PEGASUS, a graph mining package for handling graphs with billions of nodes and edges
  • 2) The careful implementation of GIM-V, with several optimizations, and several graph mining operations (PageRank, Random Walk with Restart(RWR), diameter estimation, and connected components)
  • We show how we can customize GIM-V, to handle important graph mining operations including PageRank, Random Walk with Restart, diameter estimation, and connected components
  • In this paper we proposed PEGASUS, a graph mining package for very large graphs using the HADOOP architecture
  • We identified the common, underlying primitive of several graph mining operations, and we showed that it is a generalized form of a matrix-vector multiplication
Methods
  • How can the authors quickly find connected components, diameter, PageRank, node proximities of very large graphs fast? The authors show that, even if they seem unrelated, eventually the authors can unify them using the GIM-V primitive, standing for Generalized Iterative Matrix-Vector multiplication, which the authors describe in the next.
  • How can the authors quickly find connected components, diameter, PageRank, node proximities of very large graphs fast?
  • Even if they seem unrelated, eventually the authors can unify them using the GIM-V primitive, standing for Generalized Iterative Matrix-Vector multiplication, which the authors describe in the next.
  • GIM-V, or ‘Generalized Iterative Matrix-Vector multiplication’ is a generalization of normal matrix-vector multiplication.
  • Suppose the authors have a n by n matrix M and a vector v of size n.
  • The usual matrix-vector multiplication is
Results
  • In GIM-V BL the authors can specify each block using a block row id and a block column id with two 4-byte Integers, and refer to elements inside the block using 2 × logb bits
  • This is possible because the authors can use logb bits to refer to a row or column inside a block.
  • In the second spike at size 1101, more than 80 % of the components are porn sites disconnected from the giant connected component
Conclusion
  • In this paper the authors proposed PEGASUS, a graph mining package for very large graphs using the HADOOP architecture.
  • Other open source libraries such as HAMA (Hadoop Matrix Algebra) [42] can benefit significantly from PEGASUS.
  • One major research direction is to add to PEGASUS an eigensolver, which will compute the top k eigenvectors and eigenvalues of a matrix.
  • Another directions includes tensor analysis on HADOOP ([43]), and inferences of graphical models in large scale
Tables
  • Table1: ORDER AND SIZE OF NETWORKS
Download tables as Excel
Funding
  • The authors would like to thank YAHOO! for providing us with the web graph and access to the M45. This material is based upon work supported by the National Science Foundation under Grants No IIS-0705359 IIS0808661 and under the auspices of the U.S Department of Energy by University of California Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344 (LLNL-CONF-404625), subcontracts B579447, B580840
Reference
  • A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener, “Graph structure in the web,” Computer Networks 33, 2000.
    Google ScholarLocate open access versionFindings
  • J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” OSDI, 2004.
    Google ScholarLocate open access versionFindings
  • J. Chen, O. R. Zaiane, and R. Goebel, “Detecting communities in social networks using max-min modularity,” SDM, 2009.
    Google ScholarLocate open access versionFindings
  • T. Falkowski, A. Barth, and M. Spiliopoulou, “Dengraph: A density-based community detection algorithm,” Web Intelligence, 2007.
    Google ScholarLocate open access versionFindings
  • G. Karypis and V. Kumar, “Parallel multilevel kway partitioning for irregular graphs,” SIAM Review, vol. 41, no. 2, 1999.
    Google ScholarLocate open access versionFindings
  • S. Ranu and A. K. Singh, “Graphsig: A scalable approach to mining significant subgraphs in large graph databases,” ICDE, 2009.
    Google ScholarLocate open access versionFindings
  • Y. Ke, J. Cheng, and J. X. Yu, “Top-k correlative graph mining,” SDM, 2009.
    Google ScholarLocate open access versionFindings
  • P. Hintsanen and H. Toivonen, “Finding reliable subgraphs from large probabilistic graphs,” PKDD, 2008.
    Google ScholarLocate open access versionFindings
  • J. Cheng, J. X. Yu, B. Ding, P. S. Yu, and H. Wang, “Fast graph pattern matching,” ICDE, 2008.
    Google ScholarLocate open access versionFindings
  • F. Zhu, X. Yan, J. Han, and P. S. Yu, “gprune: A constraint pushing framework for graph pattern mining,” PAKDD, 2007.
    Google ScholarLocate open access versionFindings
  • C. Chen, X. Yan, F. Zhu, and J. Han, “gapprox: Mining frequent approximate patterns from a massive network,” ICDM, 2007.
    Google ScholarLocate open access versionFindings
  • X. Yan and J. Han, “gspan: Graph-based substructure pattern mining,” ICDM, 2002.
    Google ScholarLocate open access versionFindings
  • N. S. Ketkar, L. B. Holder, and D. J. Cook, “Subdue: Compression-based frequent pattern discovery in graph data,” OSDM, August 2005.
    Google ScholarLocate open access versionFindings
  • M. Kuramochi and G. Karypis, “Finding frequent patterns in a large sparse graph,” SIAM Data Mining Conference, 2004.
    Google ScholarLocate open access versionFindings
  • C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi, “Scalable mining of large disk-based graph databases,” KDD, 2004.
    Google ScholarLocate open access versionFindings
  • N. Wang, S. Parthasarathy, K.-L. Tan, and A. K. H. Tung, “Csv: Visualizing and mining cohesive subgraph,” SIGMOD, 2008.
    Google ScholarLocate open access versionFindings
  • S. Brin and L. Page, “The anatomy of a large-scale hypertextual (web) search engine.” in WWW, 1998.
    Google ScholarFindings
  • J. Kleinberg, “Authoritative sources in a hyperlinked environment,” in Proc. 9th ACM-SIAM SODA, 1998.
    Google ScholarLocate open access versionFindings
  • C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos, “Doulion: Counting triangles in massive graphs with a coin,” KDD, 2009.
    Google ScholarLocate open access versionFindings
  • C. E. Tsourakakis, M. N. Kolountzakis, and G. L. Miller, “Approximate triangle counting,” Apr 2009. [Online]. Available: http://arxiv.org/abs/0904.3761
    Findings
  • U. Kang, C. Tsourakakis, A. Appel, C. Faloutsos, and J. Leskovec, “Hadi: Fast diameter estimation and mining in massive graphs with hadoop,” CMU-ML-08117, 2008.
    Google ScholarFindings
  • T. Qian, J. Srivastava, Z. Peng, and P. C. Sheu, “Simultaneouly finding fundamental articles and new topics using a community tracking method,” PAKDD, 2009.
    Google ScholarLocate open access versionFindings
  • N. Shrivastava, A. Majumder, and R. Rastogi, “Mining (social) network graphs to detect random link attacks,” ICDE, 2008.
    Google ScholarLocate open access versionFindings
  • Y. Shiloach and U. Vishkin, “An o(logn) parallel connectivity algorithm,” Journal of Algorithms, pp. 57–67, 1982.
    Google ScholarLocate open access versionFindings
  • B. Awerbuch and Y. Shiloach, “New connectivity and msf algorithms for ultracomputer and pram,” ICPP, 1983.
    Google ScholarLocate open access versionFindings
  • D. Hirschberg, A. Chandra, and D. Sarwate, “Computing connected components on parallel computers,” Communications of the ACM, vol. 22, no. 8, pp. 461– 464, 1979.
    Google ScholarLocate open access versionFindings
  • J. Greiner, “A comparison of parallel algorithms for connected components,” Proceedings of the 6th ACM Symposium on Parallel Algorithms and Architectures, June 1994.
    Google ScholarLocate open access versionFindings
  • G. Aggarwal, M. Data, S. Rajagopalan, and M. Ruhl, “On the streaming model augmented with a sorting primitive,” Proceedings of FOCS, 2004.
    Google ScholarLocate open access versionFindings
  • R. Lammel, “Google’s mapreduce programming model – revisited,” Science of Computer Programming, vol. 70, pp. 1–30, 2008. [30] “Hadoop information,” http://hadoop.apache.org/.
    Locate open access versionFindings
  • [31] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig latin: a not-so-foreign language for data processing,” in SIGMOD ’08, 2008, pp. 1099– 1110.
    Google ScholarFindings
  • [32] S. Papadimitriou and J. Sun, “Disco: Distributed coclustering with map-reduce,” ICDM, 2008.
    Google ScholarLocate open access versionFindings
  • [33] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou, “Scope: easy and efficient parallel processing of massive data sets,” VLDB, 2008.
    Google ScholarLocate open access versionFindings
  • [34] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting the data: Parallel analysis with sawzall,” Scientific Programming Journal, 2005.
    Google ScholarLocate open access versionFindings
  • [35] R. L. Grossman and Y. Gu, “Data mining using high performance data clouds: experimental studies using sector and sphere,” KDD, 2008.
    Google ScholarLocate open access versionFindings
  • [36] J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu, “Automatic multimedia cross-modal correlation discovery,” ACM SIGKDD, Aug. 2004.
    Google ScholarLocate open access versionFindings
  • [37] J. Leskovec, D. Chakrabarti, J. M. Kleinberg, and C. Faloutsos, “Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication,” PKDD, 2005.
    Google ScholarLocate open access versionFindings
  • [38] M. E. J. Newman, “Power laws, pareto distributions and zipf’s law,” Contemporary Physics, no. 46, pp. 323–351, 2005.
    Google ScholarLocate open access versionFindings
  • [39] M. Mcglohon, L. Akoglu, and C. Faloutsos, “Weighted graphs and disconnected components: patterns and a generator,” KDD, pp. 524–532, 2008.
    Google ScholarLocate open access versionFindings
  • [40] R. Dunbar, “Grooming, gossip, and the evolution of language,” Harvard Univ Press, October 1998.
    Google ScholarFindings
  • [41] G. Pandurangan, P. Raghavan, and E. Upfal, “Using pagerank to characterize web structure,” COCOON, August 2002.
    Google ScholarLocate open access versionFindings
  • [42] “Hama website,” http://incubator.apache.org/hama/.
    Findings
  • [43] T. G. Kolda and J. Sun, “Scalable tensor decompsitions for multi-aspect data mining,” ICDM, 2008.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科