Ego-net Sketching for Streaming Graph Analytics

semanticscholar(2016)

引用 0|浏览10
暂无评分
摘要
We propose a novel, scalable, and principled graph sketching technique based on min-wise local neighborhood sampling. For an n-node graph with e-edges, we incrementally maintain an in-memory min-wise neighbor sampled sub-graph, bounded by a user configurable memory limit. This sketch representation, capable of handling real-time edge streaming rate, lowers the memory requirement to O(n) instead of O(e), making it particularly useful for streaming graphs commonly with e n, with both n and e possibly unknown apriori. Symmetrization and similarity-based techniques can recover from these data structures a significant portion of the original graph. With bounded memory, the quality of results using the sketch representation is competitive against baselines which use the full graph, and the computational performance is often significantly better. Our framework is flexible and configurable to be leveraged by numerous other graph analytics algorithms. 1. OUR FRAMEWORK Minwise independent permutation based hash functions have seen ubiquitous use in graph and network problems, in the context of graph sparsification [10], community detection [8, 9], dense subgraph detection [4], link prediction [11] and computing various measures of interest like local triangle count [1]. In this paper, we use minhash in a manner orthogonal to its traditional usage. To the best of our knowledge, it’s use has not been suggested as a fixed size sketch for an edge-streamed graph with low memory footprint. We additionally provide theoretical insights on the type of information retained by this representation. Figure 1 shows a toy example of the min-wise neighborhood sampling, graph construction, and edge recovery of our graph sketching framework. Each row of Mk is initialized with self-loop and C with zero. The edges of source graph G are processed iteratively by Algorithm 1 to construct count vector (C) and sketch matrix (Mk) using k different linear min-wise independent hash functions (hm) [3, 2]. Each node i in graph G is represented by row i in Mk, which is a min-wise sample of i’s egonet. Next, unique neighbors of each node (row) in Mk form directed graph G ∗, which is symmetrized to generate Gm. Additionally, using Mk and C, Gm is augmented with similarity induced edges thereby generating Gs, which might be useful for scenarios where a substantial portion of the original graph is lost due to sampling (like Twitter data with its power-law degree distribution). Additionally, the user can run a myriad of existing algorithms directly on Gm and Gs. 2. METHODOLOGY 2.1 Sketch Creation and Updating Algorithm 1 Update Sketch Matrix Parameter: Sketch Matrix Mk Parameter: Count Vector C Parameter: new edge (i,j) 1: for m = 1 to k do 2: if hm(j) < hm(Mk[i,m]) then 3: Mk[i,m] = j 4: end if 5: if hm(i) < hm(Mk[j,m]) then 6: Mk[j,m] = i 7: end if 8: end for 9: C[i] + +; C[j] + +; 2.2 Key Theoretical insights We analyze the retention probability per edge due to minwise sampling and then use it to construct an unbiased estimator of the total number of edges to be retained in Gm. The proofs have been omitted due to lack of space. Lemma 2.1. For any node i with degree di, the probability of losing any edge (i, j) of G in G∗ with k hashes is (1− 1 di ). Lemma 2.2. The inclusion probability pij of any edge (i, j) of G in Gm is pij = 1− [(1− 1 di )× (1− 1 dj )] Lemma 2.3. From Gm, an unbiased estimator of the total number of edges of G using edges Em of Gm is ∑ {(i,j):Em∈Gm} 1 (1− [(1− 1 di )× (1− 1 dj )]k) 3. EXPERIMENTS We implemented all code in C++ and ran the experiment on a 3.40GHz Intel(R) Core(TM) i7-2600 machine with 256 In this example let the randomized h1 permutation be 1, 5, 2, 4, 3, that is, h1(1) < h1(5) < . . . < h1(3), and h2 permutation be 4, 1, 5, 2, 3.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要