Preemptive Switch Memory Usage to Accelerate Training Jobs with Shared In-Network Aggregation

Hao Wang, Yuxuan Qin,ChonLam Lao,Yanfang Le,Wenfei Wu,Kai Chen

2023 IEEE 31ST INTERNATIONAL CONFERENCE ON NETWORK PROTOCOLS, ICNP（2023）

引用 0|浏览9

暂无评分

摘要

Recent works introduce In-Network Aggregation (INA) for distributed training (DT), which moves the gradient summation into network programmable switches. INA can reduce the traffic volume and accelerate communication in DT jobs. However, switch memory is a scarce resource, unable to support massive DT jobs in data centers, and existing INA solutions have not utilized switch memory to the best extent. We propose DSA, an Efficient Data-Plane switch memory Scheduler for in-network Aggregation. DSA introduces preemption to the switch memory management for INA jobs. In the data plane, DSA allows gradient tensors with high priority to preempt the switch aggregators (basic computation unit in INA) from tensors with low priority, which avoids an aggregator wasting time in idle. In the control plane, DSA devises a priority policy which assigns high priority to gradient tensors that benefit overall job efficiency more, e.g., communication intensive jobs. We prototype DSA and experiments show that DSA can improve the average JCT by up to 1.35x compared with baseline solutions.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要