CEFS: compute-efficient flow scheduling for iterative synchronous applications

CoNEXT '20: The 16th International Conference on emerging Networking EXperiments and Technologies Barcelona Spain December, 2020(2020)

引用 13|浏览173
暂无评分
摘要
Iterative Synchronous Applications (ISApps) are popular in today's data centers, represented by distributed deep learning (DL) training. In ISApps, multiple nodes carry out the computing task iteratively, with globally synchronizing the results in each iteration. To increase the scaling efficiency of ISApps, in this paper we propose a new flow scheduling approach, called CEFS. CEFS saves the waiting time of computing nodes from two aspects. For a single node, flows with data which can trigger earlier computation at the node are assigned with higher priority; among nodes, flows towards slower nodes are assigned with higher priority. To address the challenges of realizing CEFS in real systems, e.g., the limited number of priority queues on commodity switches, the combination of two types of priorities, and the adaption to different applications and hardware environments, we design an online Bayesian optimization based priority assignment algorithm which meets a two-dimension order-preserving rule. We implement a CEFS prototype and evaluate CEFS through both a 16-node GPU/RoCEv2 testbed by training typical DL models and NS-3 simulations. Compared with TensorFlow and two representative scheduling solutions: TicTac and ByteScheduler, CEFS improves the training throughput by up to 253%, 252% and 47%, respectively. Besides, the scaling efficiency of the 16-node system under TensorFlow, TicTac, ByteScheduler and CEFS is 26.6%~46.9%, 26.7%~47.0%, 63.9%~80.3%, and 92.9%~94.7%, respectively. The NS-3 simulation results show that CEFS can even achieve similar scaling efficiency at a larger scale.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要