SCHED2 : Scheduling Deep Learning Training via Deep Reinforcement Learning

IEEE Global Communications Conference(2019)

引用 5|浏览112
暂无评分
摘要
Today's companies and organizations build GPU clusters for efficient deep learning training (DLT). However, the inherent heterogeneity of DLT workloads makes it challenging to perform efficient scheduling of the GPUs. On one hand, DLT jobs typically exhibit diverse performance sensitivity to GPU locality; the scheduler should allocate GPUs with appropriate degree of locality for better performance and utilization. On the other hand, DLT jobs are also diverse in terms of size and duration, which can lead to severe cluster fragmentation and less chance for finding GPUs with good locality. In this paper, we present SCHED2, a GPU cluster scheduler that leverages deep reinforcement learning (DRL) to perform smart locality-aware scheduling of DLT jobs. This is achieved by a novel design which captures both jobs' locality-sensitivity and cluster fragmentation condition in the whole learning stack, i.e., from job and cluster state definitions to the neural network architecture. Through this awareness, the DRL model can adjust its scheduling decisions dynamically and adaptively, to react to individual jobs' different locality-sensitivity and changing cluster fragmentation level. Experiments using realistic workloads demonstrate that SCHED2 reduces average JCT by 4.6x and makespan by 2.1x, compared to heuristic-based schedulers.
更多
查看译文
关键词
deep learning training,DLT workloads,DLT jobs,diverse performance sensitivity,GPU locality,severe cluster fragmentation,GPU cluster scheduler,learning stack,cluster state definitions,scheduling decisions,locality-sensitivity,cluster fragmentation level,smart locality-aware scheduling,deep reinforcement learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要