Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct

Parallel Processing(2012)

引用 9|浏览0
暂无评分
摘要
The all-to-all collective communication operation is used by many scientific applications, and is one of the most time consuming and challenging collective operation to optimize. The algorithms for all-to-all operations typically fall into two classes, logarithmic and linear scaling algorithms, with Bruck's algorithm, a logarithmic scaling algorithm, used in many small-data all-to-all implementations. The recent addition of InfiniBand CORE-Direct support for network management of collective communications offers new opportunities for optimizing all-to-all operation as well as supporting truly asynchronous implementations of these operations. This paper presents several new enhancements to the Bruck small-data algorithm that leverage CORE-Direct and other InfiniBand network capabilities to produce efficient implementations of this collective operation. These include RDMA, SR-RNR, and SR-RTR algorithms. In addition, nonblocking implementations of these collective operations are also presented. Benchmark results show that the RDMA algorithm, which uses CORE-Direct capabilities to offload collective communication management to the Host Channel Adapter (HCA), hardware gather support for sending non-continuous data, and low-latency RDMA semantics, performs the best. For a 64 processes and 128 byte-per-process all-to-all, the RDMA algorithm performs 27% better than Bruck's algorithm implementation in Open MPI and 136% better than the SR-RTR algorithm. In addition, the nonblocking versions of these algorithms have the same performance characteristics as the blocking algorithms. Finally, measurements of computation/communication overlap capacity show that all offloaded algorithms achieve about 98% overlap for large data all-to-all, whereas implementations using host-based progress achieve only about 9.5% overlap.
更多
查看译文
关键词
computer network management,computer network performance evaluation,optimisation,Bruck small-data algorithm,CORE-Direct capabilities,ConnectX CORE-direct,HCA,InfiniBand CORE-Direct support,InfiniBand network capabilities,Open MPI,RDMA algorithm,all-to-all collective communication operation,all-to-all collective optimization space,asynchronous implementations,blocking algorithms,collective communication management,computation-communication overlap capacity,host channel adapter,host-based progress,large data all-to-all,linear scaling algorithms,logarithmic scaling algorithms,low-latency RDMA semantics,network management,nonblocking algorithms,noncontinuous data,offloaded algorithms,performance characteristics,small data all-to-all implementations,Alltoall,Collective Operations,Communication,ConnectX Core-Direct,High Performance Computing,InfiniBand,MPI
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要