Rethinking Memory and Communication Cost for Efficient Large Language Model Training

Cong Wu,Hanxiao Zhang, Jianzhong Lin, Junxian Huang, Yue Xiao,Zhaoxin Huan, Siyuan Liu, Fanhao Meng,Lei Liang,Xiaolu Zhang,Jie Zhou

arXiv (Cornell University)(2023)

引用 0|浏览3
暂无评分
摘要
Recently, various distributed strategies for large language model training have been proposed. However, these methods provided limited solutions for the trade-off between memory consumption and communication cost. In this paper, we rethink the impact of memory consumption and communication costs on the training speed of large language models, and propose a memory-communication balanced strategy set Partial Redundancy Optimizer (PaRO). PaRO provides comprehensive options which reduces the amount and frequency of inter-group communication with minor memory redundancy by fine-grained sharding strategy, thereby improving the training efficiency in various training scenarios. Additionally, we propose a Hierarchical Overlapping Ring (HO-Ring) communication topology to enhance communication efficiency between nodes or across switches in large language model training. Our experiments demonstrate that PaRO significantly improves training throughput by 1.19x-2.50x compared to the SOTA method and achieves a near-linear scalability. The HO-Ring algorithm improves communication efficiency by 36.5% compared to the traditional Ring algorithm.
更多
查看译文
关键词
efficient large language,model training,memory,communication cost
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要