AMSP: Reducing Communication Overhead of ZeRO for Efficient LLM Training
arxiv(2023)
摘要
Training large language models (LLMs) encounters challenges in GPU memory
consumption due to the high memory requirements of model states. The widely
used Zero Redundancy Optimizer (ZeRO) addresses this issue through strategic
sharding but introduces communication challenges at scale. To tackle this
problem, we propose AMSP, a system designed to optimize ZeRO for scalable LLM
training. AMSP incorporates three flexible sharding strategies: Full-Replica,
Full-Sharding, and Partial-Sharding, and allows each component within the model
states (Parameters, Gradients, Optimizer States) to independently choose a
sharding strategy as well as the device mesh. We conduct a thorough analysis of
communication costs, formulating an optimization problem to discover the
optimal sharding strategy. Additionally, AMSP optimizes distributed LLM
training by efficiently overlapping communication with computation. Evaluations
demonstrate up to 52% Model FLOPs Utilization (MFU) when training the
LLaMA-based model on 1024 GPUs, resulting in a 1.56 times improvement in
training throughput compared to newly proposed systems like MiCS and ZeRO++.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要