Sharing across Multiple MapReduce Jobs

Tomasz Nykiel,Michalis Potamias,Chaitanya Mishra,George Kollios,Nick Koudas

ACM Trans. Database Syst.（2014）

引用 13|浏览76

暂无评分

摘要

Large-scale data analysis lies in the core of modern enterprises and scientific research. With the emergence of cloud computing, the use of an analytical query processing infrastructure can be directly associated with monetary cost. MapReduce has been a popular framework in the context of cloud computing, designed to serve long-running queries (jobs) which can be processed in batch mode. Taking into account that different jobs often perform similar work, there are many opportunities for sharing. In principle, sharing similar work reduces the overall amount of work, which can lead to reducing monetary charges for utilizing the processing infrastructure. In this article we present a sharing framework tailored to MapReduce, namely, MRShare. Our framework, MRShare, transforms a batch of queries into a new batch that will be executed more efficiently, by merging jobs into groups and evaluating each group as a single query. Based on our cost model for MapReduce, we define an optimization problem and we provide a solution that derives the optimal grouping of queries. Given the query grouping, we merge jobs appropriately and submit them to MapReduce for processing. A key property of MRShare is that it is independent of the MapReduce implementation. Experiments with our prototype, built on top of Hadoop, demonstrate the overall effectiveness of our approach. MRShare is primarily designed for handling I/O-intensive queries. However, with the development of high-level languages operating on top of MapReduce, user queries executed in this model become more complex and CPU intensive. Commonly, executed queries can be modeled as evaluating pipelines of CPU-expensive filters over the input stream. Examples of such filters include, but are not limited to, index probes, or certain types of joins. In this article we adapt some of the standard techniques for filter ordering used in relational and stream databases, propose their extensions, and implement them through MRAdaptiveFilter, an extension of MRShare for expensive filter ordering tailored to MapReduce, which allows one to handle both single- and batch-query execution modes. We present an experimental evaluation that demonstrates additional benefits of MRAdaptiveFilter, when executing CPU-intensive queries in MRShare.

查看译文

关键词

algorithms,systems,mapreduce,parallel databases,sharing mapreduce jobs,query processing

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要