Finding Subcube Heavy Hitters in Data Streams.

international world wide web conferences(2017)

引用 22|浏览51
暂无评分
摘要
We address the problem of finding subcube heavy hitters in high dimensional data streams. Formally, the data stream consists of d-dimensional items, and a subcube is a subset of coordinates. The goal is to report all heavy hitters of an arbitrary query subcube correctly with high probability. We show that the sampling approach uses space that matches the lower bound given by Liberty et al. up to polylogarithmic factors. This lower bound implies a quadratic dependency on the number of dimensions d in the worst case. Our main contribution is to circumvent this quadratic bottleneck via a model-based approach. In particular, we assume that the dimensions are related to each other via the Naive Bayes model. We present a new two-pass algorithm for our problem that uses space that is linear in the number of dimensions d. Furthermore, we exhibit a fast polynomial time algorithm for reporting all heavy hitters of a query subcube. We also perform empirical study with a synthetic dataset as well as real datasets from Adobe and Yandex. We show that our algorithm achieves the least error in finding subcube heavy hitters compared to a one-pass variant or the sampling approach in small space. Our work shows the potential of model-based approach to data stream analysis.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要