Modoru: Clos nanosecond optical switching for distributed deep training [Invited]

JOURNAL OF OPTICAL COMMUNICATIONS AND NETWORKING(2024)

引用 0|浏览0
暂无评分
摘要
Distributed deep training has become a significant consumer of bandwidth across datacenter-scale networks. The diverse parallel strategies employed in deep training require different communication patterns, necessitating the periodic adaptation of dynamic topologies. Since electrical switching approaches its capacity limit due to high bandwidths and has difficulties in regard to topology adaptation (i.e., logical and physical topologies are isomorphic), optical switching has become an attractive option to address these bottlenecks. In this paper, we propose Modoru, a wavelength-and datarate-agnostic Clos architecture with a switching speed of O(10 ns). Modoru is a drop-in replacement solution that has no constraints on achieving a high radix. To verify its topological flexibility, we also develop topology-as-a-service, which provisions sequentially dynamic topologies for training jobs and guarantees high topology availability over the entire network. Large-scale simulations show a basic 7.9x acceleration in deep training jobs using Modoru. Additionally, experiments on the Modoru prototype demonstrate acceleration of deep training jobs through the provisioning of adaptive topologies.(c) 2023 Optica Publishing Group
更多
查看译文
关键词
Optical switches,Topology,Switches,Training,Optical network units,Network topology,Scalability
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要