AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
If the space used by an object is Ψ, and we have a Zipfian with z = 1, the sampling algorithm uses O(k log uses
Finding frequent items in data streams
PVLDB, no. 2 (2008): 1530-1541
The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large...More
PPT (Upload PPT)
- One of the most basic problems on a data stream [HRR98,AMS99] is that of finding the most frequently occurring items in the stream.
- The space bound depends on the distribution of the frequency of the items in the data stream.
- The authors show that using a count sketch, the authors reliably estimate the frequencies of the most common items, which directly yields a 1pass algorithm for solving FindApproxTop(S, k, ).
- Given two streams, the authors can compute the difference of their sketches, which leads directly to a 2-pass algorithm for computing the items whose frequency changes the most between the streams.
- One of the most basic problems on a data stream [HRR98,AMS99] is that of finding the most frequently occurring items in the stream
- We shall assume here that the stream is large enough that memory-intensive solutions such as sorting the stream or keeping a counter for each distinct element are infeasible, and that we can afford to make only one pass over the data. This problem comes up in the context of search engines, where the streams in question are streams of queries sent to the search engine and we are interested in finding the most frequent queries handled in some period of time
- The space bound depends on the distribution of the frequency of the items in the data stream
- A summary of our final results are as follows: We introduce a simple data structure called a count sketch, and give a 1-pass algorithm for computing the count sketch of a stream
- The most straightforward solution to the FindCandidateTop(S, k, l) problem is to keep a uniform random sample of the elements stored as a list of items plus a counter for each item
- If the space used by an object is Ψ, and we have a Zipfian with z = 1, the sampling algorithm uses O(k log uses
- In Section 4.2, the authors show how the algorithm can be adapted to find elements with the largest change in frequency.
- Fang et al [FSGM+96] consider the related problem of finding all items in a data stream which occur with frequency above some fixed threshold, which they call iceberg queries.
- Alon, Matias and Szegedy [AMS99] give an Ω(n) lower bound on the space complexity of any algorithm for estimating the frequency of the largest item given an arbitrary data stream.
- Recall that the authors would like a data structure that maintains the approximate counts of the high frequency elements in a stream and is compact.
- Lemma 5, setting to be a constant so that, with high probability, the algorithms’ list of l = O(k) elements is guaranteed to contain the most frequent k elements.
- The authors compare these bounds with the space requirements for the random sampling algorithm.
- The size of the random sample required to ensure that the k most frequent elements occur in the random sample with probability 1 − δ is n nk log(k/δ).
- The authors measure the space requirement of the random sampling algorithm by the expected number of distinct elements in the random sample.
- It turns out that for Zipf parameter z ≤ 1, the expected number of distinct elements is within a constant factor of the sample size.
- Zipf parameter random sampling Count Sketch Algorithm z< 1 2 kz k m m log δ m1−2z k2z log n δ z
- The authors can adapt the algorithm for finding most frequent elements to this problem of finding elements whose frequencies change the most.
- Both algorithms need counters that require O(log n) bits, the authors only keep k objects from the stream, while the Sampling algorithm keeps a potentially much larger set of items from the stream.
- If the space used by an object is Ψ , and the authors have a Zipfian with z = 1, the sampling algorithm uses O(k log uses
- As for the max-change problem, the authors note that there is still an open problem of finding the elements with the max-percent change, or other objective functions that somehow balance absolute and relative changes
- Table1: Comparison of space requirements for random sampling vs. our algorithm
- ACM Symposium on Principles of Database Systems, pages 274–281, 2001.
- [AMS99] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System
- Sciences, 58(1):137–147, 1999.
- Science, pages 501–511, 1999.
- ACM-SIAM Symposium on Discrete Algorithms, pages 165–174, 2000.
- 22nd International Conference on Very Large Data Bases, pages 307–317, 1996.
- Theory of Computing, 2002.
- International Conference on Management of Data, pages 331–342, 1998.
- [GM99] Phillip Gibbons and Yossi Matias. Synopsis data structures for massive data sets. In Proc. 10th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 909–910, 1999.
- Clustering data streams. In Proc. 41st IEEE Symposium on Foundations of Computer Science, pages 359–366, 2000.
- Google. Google zeitgeist - search patterns, trends, and surprises according to google. http://www.google.com/press/zeitgeist.html.
- [HRR98] Monika Henzinger, Prabhakar Raghavan, and Sridhar Rajagopalan. Computing on data streams. Technical Report SRC TR 1998-011, DEC, 1998.
- [Ind00] Piotr Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. 41st IEEE Symposium on Foundations of Computer Science, pages 148–155, 2000.