Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models

ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference(2023)

引用 0|浏览3
Scaling models to large sizes to improve performance has led a trend in deep learning, and sparsely activated Mixture-of-Expert (MoE) is a promising architecture to scale models. However, training MoE models in existing systems is expensive, mainly due to the All-to-All communication between layers. All-to-All communication originates from expert-centric paradigm: keeping experts in-place and exchanging intermediate data to feed experts. We propose the novel data-centric paradigm: keeping data in-place and moving experts between GPUs. Since experts' size can be smaller than the size of data, data-centric paradigm can reduce communication workload. Based on this insight, we develop Janus. First, Janus supports fine-grained asynchronous communication, which can overlap computation and communication. Janus implements a hierarchical communication to further reduce cross-node traffic by sharing the fetched experts in the same machine. Second, when scheduling the "fetching expert" requests, Janus implements a topology-aware priority strategy to utilize intra-node and inter-node links efficiently. Finally, Janus allows experts to be prefetched, which allows the downstream computation to start immediately once the previous step completes. Evaluated on a 32-A100 cluster, Janus can reduce the traffic up to 16x and achieves up to 2.06x speedup compared with current MoE training system.
Distributed training,mixture of experts,deep learning
AI 理解论文