AMPeD: An Analytical Model for Performance in Distributed Training of Transformers.

ISPASS(2023)

引用 0|浏览2
暂无评分
摘要
Transformers are a class of machine learning models that have piqued high interest recently due to a multitude of reasons. They can process multiple modalities efficiently and have excellent scalability. Despite these obvious advantages, training these large models is very time-consuming. Hence, there have been efforts to speed up the training process using efficient distributed implementations. Many different types of parallelism have been identified that can be employed standalone or in combination. However, naively combining different parallelization schemes can incur significant communication overheads, thereby potentially defeating the purpose of distributed training. Thus, it becomes vital to predict the right mapping of different parallelisms to the underlying system architecture. In this work, we propose AMPeD, an analytical model for performance in distributed training of transformers. It exposes all the transformer model parameters, potential parallelism choices (along with their mapping onto the system), the accelerator as well as system architecture specifications as tunable knobs, thereby enabling hardware-software co-design. With the help of 3 case studies, we show that the combinations of parallelisms predicted to be efficient by AMPeD conform with the results from the state-of-the-art literature. Using AMPeD, we also show that future distributed systems consisting of optical communication substrates can train large models up to 4x faster as compared to the current state-of-the-art systems without modifying the peak computational power of the accelerators. Finally, we validate AMPeD with in-house experiments on real systems and via published literature. The max. observed error is limited to 12%. The model is available here: https://github.com/CSA-infra/AMPeD
更多
查看译文
关键词
Analytical Modeling,Transformers,Distributed Training,performance
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要