Groupwise Analytics Via Adaptive Mapreduce

Liping Peng,Kai Zeng,Andrey Balmin,Vuk Ercegovac,Peter J. Haas,Yannis Sismanis

2015 IEEE 31ST INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE)（2015）

引用 1|浏览2

暂无评分

摘要

Shared-nothing systems such as Hadoop vastly simplify parallel programming when processing disk-resident data whose size exceeds aggregate cluster memory. Such systems incur a significant performance penalty, however, on the important class of "groupwise set-valued analytics" (GSVA) queries in which the data is dynamically partitioned into groups and then a set-valued synopsis is computed for some or all of the groups. Key examples of synopses include top-k sets, bottom-k sets, and uniform random samples. Applications of GSVA queries include micro-marketing, root-cause analysis for problem diagnosis, and fraud detection. A naive approach to executing GSVA queries first reshuffles all of the data so that all records in a group are at the same node and then computes the synopsis for the group. This approach can be extremely inefficient when, as is typical, only a very small fraction of the records in each group actually contribute to the final groupwise synopsis, so that most of the shuffling effort is wasted. We show how to significantly speed up GSVA queries by slightly modifying the shared-nothing environment to allow tasks to occasionally access a small, common data structure; we focus on the Hadoop setting and use the "Adaptive MapReduce" infrastructure of Vernica et al. to implement the data structure. Our approach retains most of the advantages of a system such as Hadoop while significantly improving GSVA query performance, and also allows for incremental updating of query results. Experiments show speedups of up to 5x. Importantly, our new technique can potentially be applied to other shared-nothing systems with disk-resident data.

查看译文

关键词

data handling,data structures,parallel programming,query processing,set theory,storage management,GSVA queries,Hadoop,adaptive MapReduce,aggregate cluster memory,bottom-k sets,disk-resident data,disk-resident data processing,fraud detection,groupwise set-valued analytics,micromarketing,parallel programming,problem diagnosis,root-cause analysis,set-valued synopsis,shared-nothing systems,top-k sets,uniform random samples,

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要