Automatically Configuring Parallelism for Hybrid Layouts

advances in databases and information systems(2019)

引用 1|浏览29
暂无评分
摘要
Distributed processing frameworks process data in parallel by dividing it into multiple partitions and each partition is processed in a separate task. The number of tasks is always created based on the total file size. However, this can lead to launch more tasks than needed in the case of hybrid layouts, because they help to read less data for certain operations (i.e., projection, selection). The over-provisioning of tasks may increase the job execution time and induce significant waste of computing resources. The latter due to the fact that each task introduces extra overhead (e.g., initialization, garbage collection, etc.). To allow a more efficient use of resources and reduce the job execution time, we propose a cost-based approach that decides the number of tasks based on the data being read. The proposed cost-model can be utilized in a multi-objective approach to decide both the number of tasks and number of machines for execution.
更多
查看译文
关键词
Big data, Hybrid storage layouts, Parallelism, Parquet, Spark
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要