Optimizing Distributed Training on Frontier for Large Language Models
CoRR(2023)
摘要
Large language models (LLM) are showing tremendous success as foundation
models, and many downstream applications benefit from fine-tuning. Prior works
on loss scaling have demonstrated that the larger LLMs perform better than
their smaller counterparts. However, training LLMs with billions of parameters
requires considerable computational resources; to train a one trillion
GPT-style model on 20 trillion tokens, we need to perform 120 million exaflops.
Frontier is the world's first and fastest exascale supercomputer for open
science and is equipped with 75264 MI250X GPUs. This work explores efficient
distributed strategies such as tensor parallelism, pipeline parallelism, and
sharded data parallelism to train a trillion-parameter model on the Frontier
exascale supercomputer. We analyze these distributed training techniques and
associated parameters individually to decide which techniques to use and what
associated parameters to select for a particular technique. We perform
hyperparameter tuning on these techniques to understand their complex
interplay. Combined with these two tuning efforts, we have found optimal
strategies to train three models of size 22B, 175B, and 1T parameters with
$38.38\%$ , $36.14\%$ , and $31.96\%$ achieved throughput. For training the
175B parameter model and 1T model, we have achieved $100\%$ weak scaling
efficiency and $89\%$ and $87\%$ strong scaling efficiency, respectively. Our
work presents a set of strategies for distributed training of LLMs through
experimental findings and hyperparameter tuning.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要