ExplSched: Maximizing Deep Learning Cluster Efficiency for Exploratory Jobs

2023 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, CLUSTER(2023)

引用 0|浏览2
暂无评分
摘要
Resource management for Deep Learning (DL) clusters is essential for system efficiency and model training quality. Existing schedulers provided by DL frameworks are mostly adaptations from traditional HPC clusters and usually work on jobs' makespan, assuming that DL training jobs finish completely. Unfortunately, it is reported that a fair amount of training jobs are exploratory jobs and often finish unsuccessfully (over 30%) in production clusters. This is due to the distinct characteristic of Deep Neural Network (DNN) training that it is an exploratory process of frequent user interventions, such as adjusting model structures, tuning hyperparameters, and exploring feature validity. Existing DL cluster schedulers using offline algorithms are not suitable for exploratory jobs when unexpected early terminations can cause noticeable resource waste. Moreover, DL training jobs are iterative and usually yield diminishing returns as they progress. Equally allocating resource among training iterations is not efficient, especially when dealing with exploratory jobs where it can worsen the degradation of system efficiency. The fundamental goal of a DL training job is to gain model quality improvement, usually indicated by the loss reduction (job profit) of a DNN model. This paper introduces a novel scheduling problem for exploratory jobs that seeks to maximize the overall training profit of a DL cluster. We propose ExplSched, an online scheduling solution based on the primal-dual framework, resulting in a competitive ratio of 2 alpha that belongs to O(ln n). It uses a resource price function that emphasizes the importance of job profit to resource consumption ratio to make quick resource allocation decisions. Experimental results show that ExplSched achieved an average system utility improvement of 87.28% compared with other related work.
更多
查看译文
关键词
Deep Learning,Distributed Computing,Exploratory Jobs,Resource Management,Scheduling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要