Optimization of MPI collective operations on the IBM Blue Gene/Q supercomputer

International Journal of High Performance Computing Applications(2014)

引用 21|浏览61
暂无评分
摘要
The Blue Gene/Q (BG/Q) machine is the latest in the line of IBM massively parallel supercomputers, designed to scale to 262,144 nodes and 16 million threads. Each BG/Q node has 68 hardware threads. Hybrid programming paradigms, which use message passing among nodes and multi-threading within nodes, enable applications to achieve high throughput on BG/Q. In this paper, we present scalable algorithms to optimize MPI collective operations by taking advantage of the various features of the BG/Q torus and collective networks. We achieve an 8 byte double-sum MPI_Allreduce latency of 10.25 ms on 1,572,864 MPI ranks. We accelerate summing of network packets with local buffers by the use of the Quad Processing SIMD unit in the BG/Q cores and executing the sums on multiple communication threads supported by the optimized communication libraries. The achieved net gain is a peak throughput of 6.3 GB/s for double-sum allreduce. We also achieve over 90% of network peak for MPI_Alltoall with 65,536 MPI ranks.
更多
查看译文
关键词
Blue Gene, Q, MPI, collective optimization algorithms
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要