CMU PDL 17 107 November 2017

user-5ebe3bbdd0b15254d6c50b2c(2017)

引用 0|浏览40
暂无评分
摘要
The 3Sigma cluster scheduling system uses job runtime histories in a new way. Knowing how long each job will run allows a scheduler to more effectively pack jobs with diverse time concerns (eg, deadline vs. the-sooner-the-better) and placement preferences on heterogeneous cluster resources. But, existing schedulers use single-point estimates (eg, mean or median of relevant subset of historical runtimes), and we show that they are fragile in the face of real-world estimate error profiles. In particular, analysis of job traces from three different large-scale cluster environments shows that, while most job runtimes can be predicted well, even state-of-the-art predictors have wide error profiles with 8–23% of predictions off by a factor of two or more. Instead of reducing relevant history to a single point, 3Sigma schedules jobs based on full distributions of relevant runtime history, and explicitly creates plans that mitigate the effects of anticipated runtime uncertainty. Experiments with workloads derived from the same traces show that 3Sigma approaches the end-to-end performance of a hypothetical perfect predictor, and greatly outperforms a state-of-the-art scheduler using point estimates from a state-of-the-art predictor. 3Sigma reduces SLO miss rate, increases cluster goodput, and improves or matches latency for best effort jobs.Acknowledgements: We thank the member companies of the PDL Consortium (Broadcom, Dell EMC, Facebook, Google, Hewlett-Packard Labs, Hitachi, Intel, Microsoft Research, MongoDB, NetApp, Oracle, Salesforce, Samsung, Seagate Technology, Two Sigma, Toshiba, Veritas, Western Digital) for their interest, insights …
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要