CO2: Efficient Distributed Training with Full Communication-Computation Overlap
ICLR 2024(2024)
摘要
The fundamental success of large language models hinges upon the efficacious
implementation of large-scale distributed training techniques. Nevertheless,
building a vast, high-performance cluster featuring high-speed communication
interconnectivity is prohibitively costly, and accessible only to prominent
entities. In this work, we aim to lower this barrier and democratize
large-scale training with limited bandwidth clusters. We propose a new approach
called CO2 that introduces local-updating and asynchronous communication to the
distributed data-parallel training, thereby facilitating the full overlap of
COmunication with COmputation. CO2 is able to attain a high scalability even on
extensive multi-node clusters constrained by very limited communication
bandwidth. We further propose the staleness gap penalty and outer momentum
clipping techniques together with CO2 to bolster its convergence and training
stability. Besides, CO2 exhibits seamless integration with well-established
ZeRO-series optimizers which mitigate memory consumption of model states with
large model training. We also provide a mathematical proof of convergence,
accompanied by the establishment of a stringent upper bound. Furthermore, we
validate our findings through an extensive set of practical experiments
encompassing a wide range of tasks in the fields of computer vision and natural
language processing. These experiments serve to demonstrate the capabilities of
CO2 in terms of convergence, generalization, and scalability when deployed
across configurations comprising up to 128 A100 GPUs. The outcomes emphasize
the outstanding capacity of CO2 to hugely improve scalability, no matter on
clusters with 800Gbps RDMA or 80Gbps TCP/IP inter-node connections.
更多查看译文
关键词
Distributed Training,Data Parallelism,Local Updating,Asynchronous Communication
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要