Configurable Algorithms for All-to-All Collectives

Ke Fan,Steve Petruzza, Thomas Gilray,Sidharth Kumar

ISC High Performance 2024 Research Paper Proceedings (39th International Conference)(2024)

引用 0|浏览0
暂无评分
摘要
MPI_Alltoall is a commonly used collective that allows a fixed-size data block to be exchanged between every pair of processes. The function can be implemented through a logarithmic number of point-to-point communication rounds, where the exact number of rounds and total data exchanged among processes depend on the log base (radix). This paper presents a mathematical foundation for studying all communication patterns for the all-to-all collective by developing parameterized formulas for total communication rounds and data exchanged. The model is used to narrow down a radix, $\sqrt{P} (P$ : process count), that effectively balances latency and bandwidth concerns, yielding optimal performance―as also confirmed via evaluation on the Theta and Polaris supercomputers at ANL. We also present a novel two-layer tunable radix algorithm to take advantage of the shared-memory parallelism offered by modern systems. The algorithm decouples communication rounds into two phases that can be individually optimized to take advantage of the shared memory and high-speed interconnect separately. Our approach demonstrates improvements of up to 3.8 × on Theta and 4.2 × on Polaris over the vendor-optimized MPICH-based implementation of MPI_Alltoall for fast Fourier transform application.
更多
查看译文
关键词
Fast Fourier Transform,Data Exchange,Modern Systems,Communication Patterns,Log Base,Argonne National Laboratory,Pair Of Processes,Communication Rounds,Shared Memory,Point-to-point Communication,Global Data,Red Circles,Discrete Fourier Transform,Performance In Cases,Short Message,Communication Cost,Data Packets,Location Shift,Local Phase,Global Communication,All-to-all Communication,Phase Rotation,Message Size
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要