Enabling Switch Memory Management for Distributed Training with In-Network Aggregation.

INFOCOM(2023)

引用 1|浏览9
暂无评分
摘要
Distributed training (DT) in shared clusters usually deploys a scheduler for resource allocation to multiple concurrent jobs. Meanwhile, a recent acceleration primitive, In-Network Aggregation (INA), introduces switch memory as a new critical resource for DT jobs, out of the prior scheduler's management. Lacking switch memory management leads to inefficient cluster resource usage. We build INAlloc, a switch memory management system for DT job schedulers to improve INA-empowered DT jobs in shared clusters. INAlloc adds a switch memory management layer to organize the physical switch memory, allocate memory to jobs, and provide friendly interfaces to schedulers. INAlloc incorporates switch memory into modeling a job's completion time (JCT) and its resources, which assists the scheduler in deciding the switch memory allocation. INAlloc overcomes the challenges of consistent and nondisruptive runtime switch memory reallocation. Our prototype and evaluation on real-world traces show that INAlloc can reduce the jobs' deadline miss ratio by 75% and JCT by 27%.
更多
查看译文
关键词
distributed training,DT job schedulers,in-network aggregation,INA-empowered DT jobs,INAlloc,JCT,job completion time,nondisruptive runtime switch memory reallocation,physical switch memory,resource allocation,shared clusters,switch memory allocation,switch memory management layer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要