Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems

2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid)(2021)

引用 6|浏览13
暂无评分
摘要
In recent years, GPU-enhanced clusters have become more prevalent in High-Performance Computing (HPC), leading to a demand for more efficient multi-GPU communication. This makes it increasingly important to explore performance enhancements that can be attained through the communication middleware such as MPI, in order to fully take advantage of the GPUs available on these systems. In this paper, we propose locality-aware and adaptive schemes for hierarchical All-to-all collective communication on large-scale dense GPU systems. The proposed algorithms utilize the high bandwidth made available through the NVLink interconnect between GPUs in order to overlap communication latency. We focus on personalized and non-personalized all-to-all collective communication. These are components of modern scientific computing applications that utilize matrix transpose and three-dimensional Fast Fourier Transforms (FFT) and becoming more relevant for Deep Learning workloads with model and hybrid parallelisms. The performance evaluation with an application kernel performing three-dimensional FFT indicates that the proposed schemes for personalized all-to-all can lead to up to 15-25% lower execution time on 256 GPUs on the Lassen system. We demonstrate approximately 8% enhancement in training time for distributed K-FAC used in Deep Learning training on up to 128 GPUs. We also demonstrate approximately 22% and 30% improvement in the performance of non-personalized and personalized all-to-all benchmarks, respectively, compared to the state-of-the-art MPI libraries on the Summit and Lassen systems.
更多
查看译文
关键词
Allgather,All-to-all,GPU,MPI,NVLink
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要