Enabling Switch Memory Management for Distributed Training with In-Network Aggregation.
Distributed training (DT) in shared clusters usually deploys a scheduler for resource allocation to multiple concurrent jobs. Meanwhile, a recent acceleration primitive, In-Network Aggregation (INA), introduces switch memory as a new critical resource for DT jobs, out of the prior scheduler's management. Lacking switch memory management leads to inefficient cluster resource usage. We build INAlloc, a switch memory management system for DT job schedulers to improve INA-empowered DT jobs in shared clusters. INAlloc adds a switch memory management layer to organize the physical switch memory, allocate memory to jobs, and provide friendly interfaces to schedulers. INAlloc incorporates switch memory into modeling a job's completion time (JCT) and its resources, which assists the scheduler in deciding the switch memory allocation. INAlloc overcomes the challenges of consistent and nondisruptive runtime switch memory reallocation. Our prototype and evaluation on real-world traces show that INAlloc can reduce the jobs' deadline miss ratio by 75% and JCT by 27%.更多