Efficient aggregation algorithms for probabilistic data

SODA(2007)

引用 129|浏览13
暂无评分
摘要
We study the problem of computing aggregation operators on probabilistic data in an I/O efficient manner. Algorithms for aggregation operators such as SUM, COUNT, AVG, and MIN/MAX are crucial to applications on probabilistic databases. We give a generalization of the classical data stream model to handle probabilistic data, called probabilistic streams, in order to analyze the I/O-requirements of our algorithms. Whereas the algorithms for SUM and COUNT turn out to be simple, the problem is harder for both AVG and MIN/MAX. Although data stream algorithms typically use randomness, all of the algorithms we present are deterministic. For MIN and MAX, we obtain efficient one-pass data stream algorithms for estimating each of these quantities with relative accuracy (1 + ε), using constant update time per element and O(1/ε lg R) space, where each element has a value between 1 and R. For AVG, we present a new data stream algorithm for estimating its value to a relative accuracy (1 + ε) in O(log n) passes over the data with O(1/ε log2 n) space and update time O(1/ε log n) per element. On the other hand, we prove a space lower bound of Ω(n) for any exact one-pass deterministic data stream algorithm. Complementing this result, we also present an O(n log2 n)-time exact deterministic algorithm which uses O(n) space (thus removing the data-streaming restriction), improving dramatically on the previous O(n3)-time algorithm. Our algorithms for AVG involve a novel technique based on generating functions and numerical integration, which may be of independent interest. Finally, we provide an experimental analysis and show that our algorithms, coupled with additional heuristics, have excellent performance over large data sets.
更多
查看译文
关键词
large data set,relative accuracy,aggregation operator,log2 n,efficient aggregation algorithm,probabilistic data,deterministic data stream algorithm,classical data stream model,efficient one-pass data stream,log n,data stream,probabilistic database
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要