A Topology-Aware Performance Prediction Model for Distributed Deep Learning on GPU Clusters

Zheyu Lin,Xukun Chen,Hanyu Zhao,Yunteng Luan,Zhi Yang,Yafei Dai

2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)（2020）

引用 2|浏览8

暂无评分

摘要

Today, multi-GPU training has become a common practice for deep learning workloads. The performance of a training job could be affected significantly by both the GPU connectivity in the system topology and the computation-communication pattern of the job. This highlights the necessity of the awareness of jobs’ performance characteristics for cluster schedulers to improve both job and cluster efficiency.In this paper, we propose an online resource-performance model for deep learning training jobs on GPU clusters. This model can estimate the training speed as a function of any given resource setting (i.e., the number and locality of GPUs) for a specific job. The model is based on systematic modeling of the system topology and the communication patterns of individual jobs with online fitting on a sample set of profiled performance data. Experiments show that our performance model achieves 94% prediction accuracy on average (up to 99.9%). Additionally, a large-scale simulation on a real production trace demonstrates that our model helps a typical scheduling algorithm decrease average job completion time by 3.4x and makespan by 1.7x.

查看译文

关键词

systematic modeling,system topology,communication patterns,individual jobs,profiled performance data,topology-aware performance prediction model,distributed deep,GPU clusters,multiGPU training,deep learning workloads,training job,GPU connectivity,computation-communication pattern,cluster schedulers,cluster efficiency,online resource-performance model,deep learning training jobs,training speed,given resource setting,specific job,typical scheduling algorithm

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要