PPS: Fair and efficient black-box scheduling for multi-tenant GPU clusters

Kaihao Ma,Zhenkun Cai,Xiao Yan, Yang Zhang, Zhi Liu,Yihui Feng,Chao Li,Wei Lin,James Cheng

Parallel Computing(2024)

引用 0|浏览9
暂无评分
摘要
Multi-tenant GPU clusters are common, where users purchase GPU quota to run their neural network training jobs. However, strict quota-based scheduling often leads to cluster under-utilization, while allowing quota groups to use excess GPUs improves utilization but results in fairness problems. We propose PPS, a probabilistic prediction based scheduler, which uses job history statistics to predict future cluster status for making good scheduling decisions. Different from existing schedulers that rely on deep learning frameworks to adjust bad scheduling decisions and/or require detailed job information, PPS treats jobs as black boxes in that PPS runs a job to completion without adjustment once scheduled and requires only aggregate job statistics. The black-box feature is favorable due to its good generality, compatibility and security, and made possible by the predictability of aggregate resource utilization statistics of large clusters. Extensive experiments show that PPS achieves high cluster utilization and good fairness simultaneously.
更多
查看译文
关键词
Model training,GPU cluster,Cluster scheduling,Multi-tenant,Resource sharing,Cloud computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要