Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation.

Richard L. Graham,Lion Levi,Devendar Bureddy,Gil Bloch,Gilad Shainer,David Cho, George Elias, Daniel Klein,Joshua Ladd,Ophir Maor, Ami Marelli, Valentin Petrov, Evyatar Romlet,Yong Qin, Ido Zemah

ISC(2020)

引用 21|浏览35
暂无评分
摘要
This paper describes the new hardware-based streaming-aggregation capability added to Mellanox’s Scalable Hierarchical Aggregation and Reduction Protocol in its HDR InfiniBand switches. For large messages, this capability is designed to achieve reduction bandwidths similar to those of point-to-point messages of the same size, and complements the latency-optimized low-latency aggregation reduction capabilities, aimed at small data reductions. MPI_Allreduce() bandwidth measured on an HDR InfiniBand based system achieves about 95% of network bandwidth. For medium and large data reduction this also improves the reduction bandwidth by a factor of 2–5 relative to host-based (e.g., software-based) reduction algorithms. Using this capability also increased DL-Poly and PyTorch application performance by as much as 4% and 18%, respectively. This paper describes SHARP Streaming-Aggregation hardware architecture and a set of synthetic and application benchmarks used to study this new reduction capability, and the range of data sizes for which Streaming-Aggregation performs better than the low-latency aggregation algorithm.
更多
查看译文
关键词
scalable hierarchical aggregation,reduction protocol,sharptm,streaming-aggregation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要