Same Queries, Different Data: Can We Predict Runtime Performance?

Data Engineering Workshops(2012)

引用 59|浏览0
暂无评分
摘要
We consider MapReduce workloads that are produced by analytics applications. In contrast to ad hoc query workloads, analytics applications are comprised of fixed data flows that are run over newly arriving data sets or on different portions of an existing data set. Examples of such workloads include document analysis/indexing, social media analytics, and ETL (Extract Transform Load). Motivated by these workloads, we propose a technique that predicts the runtime performance for a fixed set of queries running over varying input data sets. Our prediction technique splits each query into several segments where each segment's performance is estimated using machine learning models. These per-segment estimates are plugged into a global analytical model to predict the overall query runtime. Our approach uses minimal statistics about the input data sets (e.g., tuple size, cardinality), which are complemented with historical information about prior query executions (e.g., execution time). We analyze the accuracy of predictions for several segment granularities on both standard analytical benchmarks such as TPC-DS [17], and on several real workloads. We obtain less than 25% prediction errors for 90% of predictions.
更多
查看译文
关键词
real workloads,existing data,mapreduce workloads,prior query execution,varying input data set,input data set,predict runtime performance,overall query runtime,analytics application,fixed data flow,different data,learning artificial intelligence,computational modeling,data analysis,data models,predictive models,data sets,statistics,estimation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要