HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training

2022 IEEE International Conference on Cluster Computing (CLUSTER)(2022)

引用 1|浏览9
暂无评分
摘要
As the deep learning model grows larger, training model with a single computational resource becomes impractical. To solve this, hybrid parallelism, which combines data and pipeline parallelism emerges to train large models with multiple GPUs. In practice, using heterogeneous GPU clusters to train large models is a common need due to the upgrade of a part of hardware. However, existing hybrid parallelism approaches in the heterogeneous environment do not work well in communication efficacy, workload balance among GPUs and utilizing the memory constrained GPU. To address these problems, we present a parallel DNN training approach, Hybrid Parallelism on Heterogeneous clusters (HPH). In HPH, we propose a topology designer that minimizes the communication time cost. Furthermore, HPH uses a partition algorithm that automatically partitions DNN layers among workers to maximize throughput. Besides, HPH adopts recomputation-aware scheduling to reduce memory consumption and further reschedule the pipeline to eliminate the extra time overhead of recomputation. Our experimental results on a 32-GPU heterogeneous cluster show that HPH achieves up to 1.42x training speed-ups compared with the state-of-the-art approach.
更多
查看译文
关键词
heterogeneous training,deep learning,hybrid parallelism
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要