PipePar: Enabling fast DNN pipeline parallel training in heterogeneous GPU clusters

Neurocomputing(2023)

引用 1|浏览18
暂无评分
摘要
Recently, pipeline parallelism for large-scale Deep Neural Network (DNN) training has been developed, which partitions the DNN model across multiple devices (e.g., GPUs) and improves the training efficiency by processing data divided into minibatches as a pipeline. However, existing model partitioning algorithms are mostly designed for homogeneous clusters with the same GPU devices and network connections (e.g., bandwidths), while heterogeneous GPU clusters are widely used in mainstream computing infrastructures. In heterogeneous environment, devices are equipped with different GPUs and network connections, and the efficiency of previous approaches is unsatisfactory due to the unbalanced load of the pipeline stages. In this paper, we propose PipePar, a model partitioning and task placement algorithm for pipeline parallel DNN training in heterogeneous GPU clusters. PipePar is based on dynamic programming with search space pruning that takes into consideration both the heterogeneity of GPUs and network bandwidth. PipePar can profile the DNN model for each type of GPU and conduct model partitioning and task placement based on given GPUs and network connections, which can optimize pipeline load balancing in heterogeneous environments and thus improve training efficiency. We design and implement a pipeline-based distributed deep learning training system in a heterogeneous GPU cluster and show through extensive experiments that PipePar outperforms the baseline approaches in the speed of large-scale DNN training.
更多
查看译文
关键词
Distributed deep learning,Heterogeneous GPU cluster,Pipeline parallelism
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要