AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
If the space used by an object is Ψ, and we have a Zipfian with z = 1, the sampling algorithm uses O(k log uses

Finding frequent items in data streams

PVLDB, no. 2 (2008): 1530-1541

Cited by: 274|Views179
EI

Abstract

The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large...More

Code:

Data:

Introduction
  • One of the most basic problems on a data stream [HRR98,AMS99] is that of finding the most frequently occurring items in the stream.
  • The space bound depends on the distribution of the frequency of the items in the data stream.
  • The authors show that using a count sketch, the authors reliably estimate the frequencies of the most common items, which directly yields a 1pass algorithm for solving FindApproxTop(S, k, ).
  • Given two streams, the authors can compute the difference of their sketches, which leads directly to a 2-pass algorithm for computing the items whose frequency changes the most between the streams.
Highlights
  • One of the most basic problems on a data stream [HRR98,AMS99] is that of finding the most frequently occurring items in the stream
  • We shall assume here that the stream is large enough that memory-intensive solutions such as sorting the stream or keeping a counter for each distinct element are infeasible, and that we can afford to make only one pass over the data. This problem comes up in the context of search engines, where the streams in question are streams of queries sent to the search engine and we are interested in finding the most frequent queries handled in some period of time
  • The space bound depends on the distribution of the frequency of the items in the data stream
  • A summary of our final results are as follows: We introduce a simple data structure called a count sketch, and give a 1-pass algorithm for computing the count sketch of a stream
  • The most straightforward solution to the FindCandidateTop(S, k, l) problem is to keep a uniform random sample of the elements stored as a list of items plus a counter for each item
  • If the space used by an object is Ψ, and we have a Zipfian with z = 1, the sampling algorithm uses O(k log uses
Results
  • In Section 4.2, the authors show how the algorithm can be adapted to find elements with the largest change in frequency.
  • Fang et al [FSGM+96] consider the related problem of finding all items in a data stream which occur with frequency above some fixed threshold, which they call iceberg queries.
  • Alon, Matias and Szegedy [AMS99] give an Ω(n) lower bound on the space complexity of any algorithm for estimating the frequency of the largest item given an arbitrary data stream.
  • Recall that the authors would like a data structure that maintains the approximate counts of the high frequency elements in a stream and is compact.
  • Lemma 5, setting to be a constant so that, with high probability, the algorithms’ list of l = O(k) elements is guaranteed to contain the most frequent k elements.
  • The authors compare these bounds with the space requirements for the random sampling algorithm.
  • The size of the random sample required to ensure that the k most frequent elements occur in the random sample with probability 1 − δ is n nk log(k/δ).
  • The authors measure the space requirement of the random sampling algorithm by the expected number of distinct elements in the random sample.
  • It turns out that for Zipf parameter z ≤ 1, the expected number of distinct elements is within a constant factor of the sample size.
  • Zipf parameter random sampling Count Sketch Algorithm z< 1 2 kz k m m log δ m1−2z k2z log n δ z
Conclusion
  • The authors can adapt the algorithm for finding most frequent elements to this problem of finding elements whose frequencies change the most.
  • Both algorithms need counters that require O(log n) bits, the authors only keep k objects from the stream, while the Sampling algorithm keeps a potentially much larger set of items from the stream.
  • If the space used by an object is Ψ , and the authors have a Zipfian with z = 1, the sampling algorithm uses O(k log uses
  • As for the max-change problem, the authors note that there is still an open problem of finding the elements with the max-percent change, or other objective functions that somehow balance absolute and relative changes
Tables
  • Table1: Comparison of space requirements for random sampling vs. our algorithm
Download tables as Excel
Reference
  • ACM Symposium on Principles of Database Systems, pages 274–281, 2001.
    Google ScholarFindings
  • [AMS99] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System
    Google ScholarLocate open access versionFindings
  • Sciences, 58(1):137–147, 1999.
    Google ScholarFindings
  • Science, pages 501–511, 1999.
    Google ScholarFindings
  • ACM-SIAM Symposium on Discrete Algorithms, pages 165–174, 2000.
    Google ScholarFindings
  • 22nd International Conference on Very Large Data Bases, pages 307–317, 1996.
    Google ScholarFindings
  • Theory of Computing, 2002.
    Google ScholarLocate open access versionFindings
  • International Conference on Management of Data, pages 331–342, 1998.
    Google ScholarLocate open access versionFindings
  • [GM99] Phillip Gibbons and Yossi Matias. Synopsis data structures for massive data sets. In Proc. 10th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 909–910, 1999.
    Google ScholarLocate open access versionFindings
  • Clustering data streams. In Proc. 41st IEEE Symposium on Foundations of Computer Science, pages 359–366, 2000.
    Google ScholarLocate open access versionFindings
  • Google. Google zeitgeist - search patterns, trends, and surprises according to google. http://www.google.com/press/zeitgeist.html.
    Findings
  • [HRR98] Monika Henzinger, Prabhakar Raghavan, and Sridhar Rajagopalan. Computing on data streams. Technical Report SRC TR 1998-011, DEC, 1998.
    Google ScholarFindings
  • [Ind00] Piotr Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. 41st IEEE Symposium on Foundations of Computer Science, pages 148–155, 2000.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科