Accelerating communication with multi-HCA aware collectives in MPI

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE(2024)

引用 0|浏览12
暂无评分
摘要
To accelerate the communication between nodes, supercomputers are now equipped with multiple network adapters per node, also referred to as HCAs (Host Channel Adapters), resulting in a "multi-rail"/"multi-HCA" network. For example, the ThetaGPU system at Argonne National Laboratory (ANL) has eight adapters per node; with this many networking resources available, utilizing all of them becomes non-trivial. The Message Passing Interface (MPI) is a dominant model for high-performance computing clusters. Not all MPI collectives utilize all resources, and this becomes more apparent with advances in bandwidth and adapter count in a given cluster. In this work, we provide a thorough performance analysis of existing multirail solutions and their implications on collectives and present the necessity for further enhancement. Specifically, we propose novel designs for hierarchical, multi-HCA-aware Allgather. The proposed designs fully utilize all the available network adapters within a node and provide high overlap between inter-node and intra-node communication. At the micro-benchmark level, we see large inter-node improvements up to 62% and 61% better than HPC-X and MVAPICH2-X for 1024 processes. Because Allgather is used in Ring-Allreduce, our designs also improve its performance by 56% and 44% compared to HPC-X and MVAPICH2-X, respectively. At the application level, our enhanced Allgather shows 1.98x$$ 1.98\times $$ and 1.42x$$ 1.42\times $$ improvement in a matrix-vector multiplication kernel when compared to HPC-X and MVAPICH2-X, and Allreduce performs up to 7.83% better in deep learning training against MVAPICH2-X.
更多
查看译文
关键词
Allgather,Allreduce,collectives,HCA-aware,MPI,network-aware
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要