GODEL: Unified Large-Scale Resource Management and Scheduling at ByteDance

Wu Xiang, Yakun Li, Yuquan Ren, Fan Jiang, Chaohui Xin, Varun Gupta, Chao Xiang, Xinyi Song, Meng Liu, Bing Li, Kaiyang Shao, Chen Xu, Wei Shao,Yuqi Fu, Wilson Wang, Cong Xu, Wei Xu, Caixue Lin,Rui Shi,Yuming Liang

SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing(2023)

引用 0|浏览6
暂无评分
摘要
Over the last few years, at ByteDance, our compute infrastructure scale has been expanding significantly due to expedited business growth. In this journey, to meet hyper-scale growth, some business groups resorted to managing their own compute infrastructure stack running different scheduling systems such as Kubernetes, YARN which created two major pain points: the increasing resource fragmentation across different business groups and the inadequate resource elasticity between workloads of different business priorities. Isolation across different business groups (and their compute infrastructure management) leads to inefficient compute resource utilization and prevents us from serving the business growth needs in the long run. To meet these challenges, we propose a resource management and scheduling system named GODEL, which provides a unified compute infrastructure for all business groups to run their diverse workloads under a unified resource pool. It co-locates various workloads on every machine to achieve better resource utilization and elasticity. GODEL is built upon Kubernetes, the de facto open-source container orchestration system, but with significant components replaced or enhanced to accommodate various workloads at a large scale. In production, it manages clusters with tens of thousands of machines, achieves high overall resource utilization of over 60%, and scheduling throughput of up to 5000 pods per second. This paper reports on our design and implementation with GODEL. Moreover, it discusses the lessons and best practices we learned in developing and operating it in production at ByteDance's scale.
更多
查看译文
关键词
Cluster,Resource Management,Schedule
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要