Aladdin: Optimized Maximum Flow Management For Shared Production Clusters

2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019)(2019)

引用 15|浏览13
暂无评分
摘要
The rise in popularity of long-lived applications (LLAs), such as deep learning and latency-sensitive online Web services, has brought new challenges for cluster schedulers in shared production environments. Scheduling LLAs needs to support complex placement constraints (e.g., to run multiple containers of an application on different machines) and larger degrees of parallelism to provide global optimization. But existing schedulers usually suffer severe constraint violations, high latency and low resource efficiency. This paper describes Aladdin, a novel cluster scheduler that can maximize resource efficiency while avoiding constraint violations: (i) it proposes a multidimensional and nonlinear capacity function to support constraint expressions; (ii) it applies an optimized maximum flow algorithm to improve resource efficiency. Experiments with an Alibaba workload trace from a 10,000-machine cluster show that Aladdin can reduce violated constraints by as mush as 20%. Meanwhile, it improves resource efficiency by 50% compared with state-of-the-art schedulers.
更多
查看译文
关键词
Scheduling, Maximum flow, Long-lived applications, Resource efficiency
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要