Enhancing Collective Communication in MCM Accelerators for Deep Learning Training.

International Symposium on High-Performance Computer Architecture(2024)

引用 0|浏览0
暂无评分
摘要
With the widespread adoption of Deep Learning (DL) models, the demand for DL accelerator hardware has risen. On top of that, DL models are becoming massive in size. To accommodate those models, multi-chip-module (MCM) emerges as an effective approach for implementing large-scale DL accelerators. While MCMs have shown promising results for DL inference, its potential for Deep Learning Training remains largely unexplored. Current approaches fail to fully utilize available links in a mesh interconnection network of an MCM accelerator. To address this issue, we propose two novel AllReduce algorithms for mesh-based MCM accelerators - RingBiOdd and Three Tree Overlap (TTO). RingBiOdd is a ring-based algorithm that enhances the bandwidth of AllReduce by creating two unidirectional rings using bidirectional interconnects. On the other hand, TTO is a tree-based algorithm that improves AllReduce performance by overlapping data chunks. TTO constructs three topology-aware disjoint trees and runs different steps of the AllReduce operation in parallel. We present a detailed design and implementation of the proposed approaches. Our experimental results over seven DL models indicate that RingBiOdd achieves 50% and 8% training time reduction over unidirectional Ring AllReduce and MultiTree. Furthermore, TTO demonstrates 33% and 29% training time reduction over state-ofthe-art MultiTree and Bidirectional Ring AllReduce, respectively.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要