Processing online aggregation on skewed data in mapreduce.

CIKM'13: 22nd ACM International Conference on Information and Knowledge Management San Francisco California USA October, 2013(2013)

引用 9|浏览15
暂无评分
摘要
In online aggregation, a system constantly maintains an estimate of the final answer to an aggregate query throughout execution, along with statistically meaningful bounds for the estimate's accuracy. Given the popularity of ad-hoc analytic query processing over enormous datasets, providing online aggregation in a large-scale, MapReduce environment is therefore an emerging important application need. However, existing work targeted at single-node centralized environment cannot be easily extended to fit the MapReduce paradigm. The substantial challenge lies in, given a number of input blocks, and given the prevalence of data skew, the runtime of upstream operators is uneven, so the set of intermediate results delivered to downstream operators at any particular point cannot be seen as a random sample, leading to biased estimates. In this paper, we analyze how data skew breaks the randomness in the distributed environment. To address that, we present a keep-order approach that accounts for biases that can arise when estimating aggregates over skewed dataset in a distributed environment. Moreover, we provide a pre-computing method to promise a fast result rate. A set of experiments indicates that our method can provide reasonable precise estimates early in the execution with statistically valid confidence bounds, even when significant skew exists.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要