Improving Cluster Utilization Through Adaptive Resource Management for Deep Neural Network and CPU Jobs Colocation

IEEE TRANSACTIONS ON COMPUTERS(2023)

引用 0|浏览28
暂无评分
摘要
While deep neural network (DNN) models are mainly trained using GPUs, many companies and research institutions build shared GPU clusters. These clusters host DNN training jobs, DNN inference jobs, and CPU jobs (jobs in traditional areas). DNN training jobs require GPU for main computation and CPU for auxiliary computation. Some DNN inference jobs could rely solely on CPU, while others must utilize both CPU and GPU. Our investigation demonstrates that the number of cores allocated to a training job significantly impacts its performance, and that DNN inference jobs can make use of the limited CPU cores on the GPU nodes. To accomplish this, we characterize representative deep learning models in terms of their CPU core requirements for their training jobs and inference jobs, and investigate their sensitivity to other CPU-side resource contention. Based on the characterization, we propose SODA, a scheduling system comprised of an adaptive CPU allocator, a multi-array job scheduler, a hardware-aware inference job placer, and a real-time contention eliminator. The experimental results indicate that SODA increases GPU utilization by an average of 19.9%, while maintaining the quality of service target for all DNN inference jobs and the queuing performance of CPU jobs.
更多
查看译文
关键词
Graphics processing units,Central Processing Unit,Training,Artificial neural networks,Resource management,Schedules,Bandwidth,DNN training,DNN inference,schedule
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要