RIFLING: A reinforcement learning-based GPU scheduler for deep learning research and development platforms

SOFTWARE-PRACTICE & EXPERIENCE(2022)

引用 4|浏览12
暂无评分
摘要
GPU platforms have been widely adopted in both academia and industry to support deep learning (DL) research and development (R&D). Compared with giant companies who favor custom-designed AI platforms, most small-and-medium-sized enterprises, institutes and universities (EIUs) prefer to build or rent a cost-effective GPU cluster, usually in a limited-scale, to process diverse DL R&D workloads. Therefore, more attention has been attracted by DL scheduling with the aim of improving the system efficiency and task performance. However, prior prediction-based schedulers are limited in terms of their prediction accuracy and profiling overhead. Accordingly, in this article, we propose a reinforcement learning (RL)-based online GPU scheduler, RIFLING, to model the scheduling problem as an online decision-making process. Scheduling decisions are made according to Q-learning, which is a typical RL method. RIFLING can achieve high scheduling efficiency based on the online exploring and exploiting of diverse scheduling strategies for various DL workloads, without the need for expensive offline profiling or sophisticated prediction model. We implement RIFLING as a plugin of Tensorflow, and deploy it on a distributed GPU cluster. Experiments demonstrate that RIFLING achieves up to 47.8% reductions and 19.6% improvements in makespan and average normalized processing rate respectively compared to the best available baseline without any manual intervention.
更多
查看译文
关键词
cost-effective, deep learning platform, Q-learning, reinforcement learning, scheduling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要