Poster Abstract: Deep Learning Workloads Scheduling With Reinforcement Learning On Gpu Clusters

Zhaoyun Chen,Lei Luo,Wei Quan,Mei Wen,Chunyuan Zhang

IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (IEEE INFOCOM 2019 WKSHPS)（2019）

引用 5|浏览29

暂无评分

摘要

With the recent widespread adoption of deep learning (DL) in academia and industry, more attention are attracted by DL platform, which can support research and development (R&D) of AI firms, institutes and universities. Towards an off-the-shelf distributed GPU cluster, prior work propose prediction-based schedulers to allocate resources for diverse DL workloads. However, the prediction-based schedulers have disadvantages on prediction accuracy and offline-profiling costs. In this paper, we propose a learning-based scheduler, which models the scheduling problem as a reinforcement learning problem, achieving minimum average job completion time and maximum system utilization. The scheduler contains the designs of state space, action space, reward function and update scheme. Furthermore, we will evaluate our proposed scheduler implemented as a plugin of Tensorflow on real cluster and large-scale simulation.

查看译文

关键词

DL platform, Reinforcement Learning, Scheduling, GPU clusters

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要