## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, pp.229-238, (2009)

EI WOS

Keywords

Abstract

In this paper, we describe PEGASUS, an open source peta graph mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. as the size of graphs reaches several giga-, tera- or peta-bytes, the necessity for such a library grows...More

Code:

Data:

Introduction

- Graphs are ubiquitous: computer networks, social networks, mobile call networks, the World Wide Web [1], protein regulation networks to name a few.

The large volume of available data, the low cost of storage and the stunning success of online social networks and web2.0 applications all lead to graphs of unprecedented size. - Based on HADOOP, here the authors describe PEGASUS, a graph mining package for handling graphs with billions of nodes and edges.
- There are several algorithms, using Breadth-First Search, Depth-First-Search, “propagation” ([24], [25], [26]), or “contraction” [27]
- These works rely on a shared memory model which limits their ability to handle large, disk-resident graphs.
- MAPREDUCE has two major advantages: (a) the programmer is oblivious

Highlights

- Graphs are ubiquitous: computer networks, social networks, mobile call networks, the World Wide Web [1], protein regulation networks to name a few.

The large volume of available data, the low cost of storage and the stunning success of online social networks and web2.0 applications all lead to graphs of unprecedented size - Based on HADOOP, here we describe PEGASUS, a graph mining package for handling graphs with billions of nodes and edges
- 2) The careful implementation of GIM-V, with several optimizations, and several graph mining operations (PageRank, Random Walk with Restart(RWR), diameter estimation, and connected components)
- We show how we can customize GIM-V, to handle important graph mining operations including PageRank, Random Walk with Restart, diameter estimation, and connected components
- In this paper we proposed PEGASUS, a graph mining package for very large graphs using the HADOOP architecture
- We identified the common, underlying primitive of several graph mining operations, and we showed that it is a generalized form of a matrix-vector multiplication

Methods

- How can the authors quickly find connected components, diameter, PageRank, node proximities of very large graphs fast? The authors show that, even if they seem unrelated, eventually the authors can unify them using the GIM-V primitive, standing for Generalized Iterative Matrix-Vector multiplication, which the authors describe in the next.
- How can the authors quickly find connected components, diameter, PageRank, node proximities of very large graphs fast?
- Even if they seem unrelated, eventually the authors can unify them using the GIM-V primitive, standing for Generalized Iterative Matrix-Vector multiplication, which the authors describe in the next.
- GIM-V, or ‘Generalized Iterative Matrix-Vector multiplication’ is a generalization of normal matrix-vector multiplication.
- Suppose the authors have a n by n matrix M and a vector v of size n.
- The usual matrix-vector multiplication is

Results

- In GIM-V BL the authors can specify each block using a block row id and a block column id with two 4-byte Integers, and refer to elements inside the block using 2 × logb bits
- This is possible because the authors can use logb bits to refer to a row or column inside a block.
- In the second spike at size 1101, more than 80 % of the components are porn sites disconnected from the giant connected component

Conclusion

- In this paper the authors proposed PEGASUS, a graph mining package for very large graphs using the HADOOP architecture.
- Other open source libraries such as HAMA (Hadoop Matrix Algebra) [42] can benefit significantly from PEGASUS.
- One major research direction is to add to PEGASUS an eigensolver, which will compute the top k eigenvectors and eigenvalues of a matrix.
- Another directions includes tensor analysis on HADOOP ([43]), and inferences of graphical models in large scale

- Table1: ORDER AND SIZE OF NETWORKS

Funding

- The authors would like to thank YAHOO! for providing us with the web graph and access to the M45. This material is based upon work supported by the National Science Foundation under Grants No IIS-0705359 IIS0808661 and under the auspices of the U.S Department of Energy by University of California Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344 (LLNL-CONF-404625), subcontracts B579447, B580840

Reference

- A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener, “Graph structure in the web,” Computer Networks 33, 2000.
- J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” OSDI, 2004.
- J. Chen, O. R. Zaiane, and R. Goebel, “Detecting communities in social networks using max-min modularity,” SDM, 2009.
- T. Falkowski, A. Barth, and M. Spiliopoulou, “Dengraph: A density-based community detection algorithm,” Web Intelligence, 2007.
- G. Karypis and V. Kumar, “Parallel multilevel kway partitioning for irregular graphs,” SIAM Review, vol. 41, no. 2, 1999.
- S. Ranu and A. K. Singh, “Graphsig: A scalable approach to mining significant subgraphs in large graph databases,” ICDE, 2009.
- Y. Ke, J. Cheng, and J. X. Yu, “Top-k correlative graph mining,” SDM, 2009.
- P. Hintsanen and H. Toivonen, “Finding reliable subgraphs from large probabilistic graphs,” PKDD, 2008.
- J. Cheng, J. X. Yu, B. Ding, P. S. Yu, and H. Wang, “Fast graph pattern matching,” ICDE, 2008.
- F. Zhu, X. Yan, J. Han, and P. S. Yu, “gprune: A constraint pushing framework for graph pattern mining,” PAKDD, 2007.
- C. Chen, X. Yan, F. Zhu, and J. Han, “gapprox: Mining frequent approximate patterns from a massive network,” ICDM, 2007.
- X. Yan and J. Han, “gspan: Graph-based substructure pattern mining,” ICDM, 2002.
- N. S. Ketkar, L. B. Holder, and D. J. Cook, “Subdue: Compression-based frequent pattern discovery in graph data,” OSDM, August 2005.
- M. Kuramochi and G. Karypis, “Finding frequent patterns in a large sparse graph,” SIAM Data Mining Conference, 2004.
- C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi, “Scalable mining of large disk-based graph databases,” KDD, 2004.
- N. Wang, S. Parthasarathy, K.-L. Tan, and A. K. H. Tung, “Csv: Visualizing and mining cohesive subgraph,” SIGMOD, 2008.
- S. Brin and L. Page, “The anatomy of a large-scale hypertextual (web) search engine.” in WWW, 1998.
- J. Kleinberg, “Authoritative sources in a hyperlinked environment,” in Proc. 9th ACM-SIAM SODA, 1998.
- C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos, “Doulion: Counting triangles in massive graphs with a coin,” KDD, 2009.
- C. E. Tsourakakis, M. N. Kolountzakis, and G. L. Miller, “Approximate triangle counting,” Apr 2009. [Online]. Available: http://arxiv.org/abs/0904.3761
- U. Kang, C. Tsourakakis, A. Appel, C. Faloutsos, and J. Leskovec, “Hadi: Fast diameter estimation and mining in massive graphs with hadoop,” CMU-ML-08117, 2008.
- T. Qian, J. Srivastava, Z. Peng, and P. C. Sheu, “Simultaneouly finding fundamental articles and new topics using a community tracking method,” PAKDD, 2009.
- N. Shrivastava, A. Majumder, and R. Rastogi, “Mining (social) network graphs to detect random link attacks,” ICDE, 2008.
- Y. Shiloach and U. Vishkin, “An o(logn) parallel connectivity algorithm,” Journal of Algorithms, pp. 57–67, 1982.
- B. Awerbuch and Y. Shiloach, “New connectivity and msf algorithms for ultracomputer and pram,” ICPP, 1983.
- D. Hirschberg, A. Chandra, and D. Sarwate, “Computing connected components on parallel computers,” Communications of the ACM, vol. 22, no. 8, pp. 461– 464, 1979.
- J. Greiner, “A comparison of parallel algorithms for connected components,” Proceedings of the 6th ACM Symposium on Parallel Algorithms and Architectures, June 1994.
- G. Aggarwal, M. Data, S. Rajagopalan, and M. Ruhl, “On the streaming model augmented with a sorting primitive,” Proceedings of FOCS, 2004.
- R. Lammel, “Google’s mapreduce programming model – revisited,” Science of Computer Programming, vol. 70, pp. 1–30, 2008. [30] “Hadoop information,” http://hadoop.apache.org/.
- [31] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig latin: a not-so-foreign language for data processing,” in SIGMOD ’08, 2008, pp. 1099– 1110.
- [32] S. Papadimitriou and J. Sun, “Disco: Distributed coclustering with map-reduce,” ICDM, 2008.
- [33] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou, “Scope: easy and efficient parallel processing of massive data sets,” VLDB, 2008.
- [34] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting the data: Parallel analysis with sawzall,” Scientific Programming Journal, 2005.
- [35] R. L. Grossman and Y. Gu, “Data mining using high performance data clouds: experimental studies using sector and sphere,” KDD, 2008.
- [36] J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu, “Automatic multimedia cross-modal correlation discovery,” ACM SIGKDD, Aug. 2004.
- [37] J. Leskovec, D. Chakrabarti, J. M. Kleinberg, and C. Faloutsos, “Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication,” PKDD, 2005.
- [38] M. E. J. Newman, “Power laws, pareto distributions and zipf’s law,” Contemporary Physics, no. 46, pp. 323–351, 2005.
- [39] M. Mcglohon, L. Akoglu, and C. Faloutsos, “Weighted graphs and disconnected components: patterns and a generator,” KDD, pp. 524–532, 2008.
- [40] R. Dunbar, “Grooming, gossip, and the evolution of language,” Harvard Univ Press, October 1998.
- [41] G. Pandurangan, P. Raghavan, and E. Upfal, “Using pagerank to characterize web structure,” COCOON, August 2002.
- [42] “Hama website,” http://incubator.apache.org/hama/.
- [43] T. G. Kolda and J. Sun, “Scalable tensor decompsitions for multi-aspect data mining,” ICDM, 2008.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn