Dense subgraph maintenance under streaming edge weight updates for real-time story identification

The VLDB Journal, no. 2 (2014): 175-199

Cited by: 121|Views200
EI

Abstract

Recent years have witnessed an unprecedented proliferation of social media. People around the globe author, every day, millions of blog posts, micro-blog posts, social network status updates, etc. This rich stream of information can be used to identify, on an ongoing basis, emerging stories, and events that capture popular attention. Stor...More

Code:

Data:

0
Introduction
  • Recent years have witnessed an unprecedented proliferation of social media. Millions of people around the globe author on a daily basis millions of blog posts, micro-blog posts and social network status updates.
  • Consider the U.S military strike in Abbottabad, Pakistan in early May 2011, which resulted in the death of Osama bin Laden
  • This event was extensively covered on Twitter, the popular micro-blogging service, significantly in advance of traditional media, starting with the live coverage of the operation by an local witness, to millions of tweets around the world providing Nick Koudas.
  • By piecing together these aspects, the overall event of interest can be inferred
Highlights
  • Recent years have witnessed an unprecedented proliferation of social media
  • Whereas the focus of this work is to efficiently identify dense subgraphs in an incremental manner, we provide evidence of the effectiveness of our approach
  • We computed dense subgraphs of cardinality up to Nmax = 5, using AVGDEGREE to quantify density, so as to favor larger dense subgraphs; for presentation purposes these were subsequently reranked in a diversity-aware manner [2] (subgraph overlap was penalized by multiplying subgraph density by 1 − 0.8 · )
  • Motivated by the need to mine important stories and events from the social media collective, as they emerge, in this work we examine the problem of maintaining dense subgraphs under streaming edge weight updates
  • For a broad definition of graph density, we propose the first efficient algorithm, DYNDENS, which is based on novel theoretical results regarding the magnitude of change that a single edge weight update can have
  • While it is easy to see how ENGAGEMENT can be applied to this domain, its characteristics are somewhat different from those of real-time story identification, and it would be interesting to explore how to adapt DYNDENS to the diverse challenges this domain imposes
Methods
  • All algorithms evaluated were implemented in Java, and executed on 64-bit Hotspot VM, on a machine with 8 Intel(R) Xeon(R) CPU E5540 cores clocked at 2.53GHz.
  • In all performance experiments, the time reported is the median time of 3 identical runs.
  • Datasets: Unless otherwise noted, all the experiments were run using real-world datasets, based on a sample of all tweets for May 1st, 2011 (The authors' dataset consisted of 13.8M tweets.
  • The sampling was performed by Twitter itself, as part of the restricted access provided to its data stream; for details cf.
  • The authors used an in-house entity extractor [3] to identify mentions of real-world entities.
Results
  • Having introduced the proposed DYNDENS algorithm, the authors elaborate on its theoretical underpinnings.
  • Section 4.1 presents a general result, on when a single exploration iteration per stable-dense subgraph is sufficient.
  • Whereas the focus of this work is to efficiently identify dense subgraphs in an incremental manner, the authors provide evidence of the effectiveness of the approach.
  • The authors will present some sample results of utilizing dense subgraphs for story identification.
  • In order to present sample results, the authors chose to focus on stories at the granularity of a single day.
Conclusion
  • As witnessed from the above result, the magnitude of δ is directly correlated with the impact on dense subgraphs.
  • A useful analogy is that of an edge weight update as a perturbation: the greater its magnitude δ, the further away in the graph its effects can be potentially felt
  • In this context, parameter δit offers a tunable space-time tradeoff.
  • While it is easy to see how ENGAGEMENT can be applied to this domain, its characteristics are somewhat different from those of real-time story identification, and it would be interesting to explore how to adapt DYNDENS to the diverse challenges this domain imposes
  • Another interesting technical problem arises when considering the need for adjusting the density threshold T , during execution - e.g. in order to adapt to changes in the dataset.
  • The authors are actively exploring adapting the techniques used in DYNDENS to more efficiently perform this task
Tables
  • Table1: Definitions of density-related properties
  • Table2: Summary of main symbols used
  • Table3: Top stories, May 1st 2011 Pres. Obama announces killing of Osama bin Laden involving: Barack Obama,U.S House Permanent Select Committee on Intelligence,Osama bin Laden,NBC News Commentary on death of bin Laden, comparison to famous athletes involving14 : Barack Obama,LeBron James,Delonte West,Osama bin Laden Discussions on Lady Gaga’s activities involving: Lady Gaga,Galeria Libya crisis:NATO Airstrike results in death of 3 grandchildren of Gaddafi involving: NATO,Libya Discussions on Harry Potter involving: Hermione Granger,Draco Malfoy,Bella Swan News on Osama Bin Laden’s Death Spreads On Twitter involving15 : Clint Eastwood,Barack Obama,U.S House Permanent Select Committee on Intelligence,Osama bin Laden,CBS News straightforward, as GRASP is geared towards identifying a few large dense subgraphs, as opposed to all dense subgraphs
Download tables as Excel
Related work
  • While we are not aware of any work that addresses the maintenance of dense subgraphs in weighted graphs, under streaming edge weight updates, for a broad definition of density, there exists a rich literature of works dealing with related problems.

    [27] addresses incremental maximal clique maintenance, from a mostly theoretical perspective, and using a growth property. This is very closely related to a special case of ENGAGEMENT (namely, for unweighted graphs, AVGWEIGHT, and T = 1). An important difference is that our instantiation of ENGAGEMENT deals with all cliques, with cardinality constraints, as opposed to maximal cliques of unconstrained cardinality. As discussed in Section 5.2, while the former is better suited to real-time story identification, the latter may be preferable in other scenarios.

    [28] addresses near-clique identification, in an offline setting, again from a mostly theoretical perspective, and using a growth property; this corresponds to the offline version of ENGAGEMENT for unweighted graphs, and AVGWEIGHT. The techniques proposed therein cannot be efficiently dynamized in a straight-forward fashion, as the information they rely upon cannot be efficiently maintained across updates. Our DEGREEPRIORITIZE pruning condition is inspired by the parent degree-based criterion proposed in this work. [23] addresses the same problem, using a similar growth property, and with a focus on a parallel implementation. As with the other works, the techniques developed therein are not straightforward to efficiently dynamize.
Reference
  • J. Abello, M. G. C. Resende, and S. Sudarsky. Massive quasi-clique detection. In LATIN, pages 598–612, 2002.
    Google ScholarLocate open access versionFindings
  • A. Angel and N. Koudas. Efficient diversity-aware search. In SIGMOD, pages 781–792, 2011.
    Google ScholarLocate open access versionFindings
  • A. Angel, N. Koudas, N. Sarkas, and D. Srivastava. What’s on the grapevine? In SIGMOD, pages 1047–1050, 2009.
    Google ScholarLocate open access versionFindings
  • A. Angel, N. Koudas, N. Sarkas, and D. Srivastava. Dense subgraph maintenance under streaming edge weight updates for real-time story identification. Tr, 2011. Available at http://tinyurl.com/dyndens.
    Findings
  • N. Bansal, F. Chiang, N. Koudas, and F. W. Tompa. Seeking stable clusters in the blogosphere. In VLDB, pages 806–817, 2007.
    Google ScholarLocate open access versionFindings
  • Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In SODA, pages 623–632, 2002.
    Google ScholarLocate open access versionFindings
  • D. Chakrabarti, R. Kumar, and A. Tomkins. Evolutionary clustering. In KDD, pages 554–560, 2006.
    Google ScholarLocate open access versionFindings
  • M. Charikar, C. Chekuri, T. Feder, and R. Motwani. Incremental clustering and dynamic information retrieval. In STOC, pages 626–635, 1997.
    Google ScholarLocate open access versionFindings
  • C. Cortes, D. Pregibon, and C. Volinsky. Computational methods for dynamic graphs. JCGS, 12(4):950–970, 2003.
    Google ScholarLocate open access versionFindings
  • D. Eppstein, Z. Galil, and G. F. Italiano. Dynamic graph algorithms. In Algorithms and Theory of Computation Handbook, chapter 8. 1999.
    Google ScholarLocate open access versionFindings
  • M. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, and X. Xu. Incremental clustering for mining in a data warehousing environment. In VLDB, pages 323–333, 1998.
    Google ScholarLocate open access versionFindings
  • G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In KDD, pages 150–160, 2000.
    Google ScholarLocate open access versionFindings
  • D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in massive graphs. In VLDB, pages 721–732, 2005.
    Google ScholarLocate open access versionFindings
  • A. Goldberg. Finding a maximum density subgraph. Technical report, University of California at Berkeley, 1984.
    Google ScholarFindings
  • S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams: Theory and practice. TKDE, 15(3):515–528, 2003.
    Google ScholarLocate open access versionFindings
  • J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, pages 1–12, 2000.
    Google ScholarLocate open access versionFindings
  • J. Hartline and A. Sharp. An incremental model for combinatorial maximization problems. In WEA, pages 36–48, 2006.
    Google ScholarLocate open access versionFindings
  • J. Hartline and A. Sharp. Incremental flow. Networks, 50(1):77–85, 2007.
    Google ScholarLocate open access versionFindings
  • S. Hill, D. K. Agarwal, R. Bell, and C. Volinsky. Building an effective representation for dynamic networks. Journal of Computational and Graphical Statistics, 15(3):584–608, 2006.
    Google ScholarLocate open access versionFindings
  • S. Khuller and B. Saha. On finding dense subgraphs. In ICALP, pages 597–608, 2009.
    Google ScholarLocate open access versionFindings
  • M.-S. Kim and J. Han. Chronicle: A two-stage density-based clustering algorithm for dynamic networks. In Discovery Science, pages 152–167, 2009.
    Google ScholarLocate open access versionFindings
  • S. Kumar and P. Gupta. An incremental algorithm for the maximum flow problem. JMMA, 2(1):1–16, 2003.
    Google ScholarLocate open access versionFindings
  • J. Long and C. Hartman. ODES: an overlapping dense sub-graph algorithm. Bioinformatics, 26(21):2788–2789, 2010.
    Google ScholarLocate open access versionFindings
  • M. Mathioudakis and N. Koudas. Twittermonitor: trend detection over the twitter stream. In SIGMOD, pages 1155–1158, 2010.
    Google ScholarLocate open access versionFindings
  • P. M. Pardalos and J. Xue. The maximum clique problem. Journal of Global Optimization, 4(3):301–328, 1994.
    Google ScholarLocate open access versionFindings
  • N. Sarkas, A. Angel, N. Koudas, and D. Srivastava. Efficient identification of coupled entities in document collections. In ICDE, pages 769–772, 2010.
    Google ScholarLocate open access versionFindings
  • V. Stix. Finding all maximal cliques in dynamic graphs. Computational Optimization and Applications, 27(2):173–186, 2004.
    Google ScholarLocate open access versionFindings
  • T. Uno. An efficient algorithm for solving pseudo clique enumeration problem. Algorithmica, 56(1):3–16, 2010.
    Google ScholarLocate open access versionFindings
  • N. Wang, S. Parthasarathy, K.-L. Tan, and A. K. H. Tung. Csv: visualizing and mining cohesive subgraphs. In SIGMOD, pages 445–458, 2008.
    Google ScholarLocate open access versionFindings
  • D. Yang, E. A. Rundensteiner, and M. O. Ward. Neighbor-based pattern detection for windows over streaming data. In EDBT, pages 529–540, 2009.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科