What Language Model to Train if You Have One Million GPU Hours?

semanticscholar(2022)

引用 40|浏览167
暂无评分
摘要
The crystallization of modeling methods 001 around the Transformer architecture has been a 002 boon for practitioners. Simple, well-motivated 003 architectural variations may transfer across 004 tasks and scale, increasing the impact and lever005 age of modeling research. However, with the 006 emergence of state-of-the-art 100B+ param007 eters models, large language models are in008 creasingly expensive to accurately design and 009 train. Notably, it can be difficult to evalu010 ate how modeling decisions may impact emer011 gent capabilities, given that these capabilities 012 arise mainly from sheer scale. Targeting a 013 multilingual language model in the 100B+ pa014 rameters scale, our goal is to identify an ar015 chitecture and training setup that makes the 016 best use of our 1,000,000 A100-hours budget. 017 Specifically, we perform an ablation study com018 paring different modeling practices and their 019 impact on zero-shot generalization. We per020 form all our experiments on 1.3B models, pro021 viding a compromise between compute costs 022 and the likelihood that our conclusions will 023 hold for the target 100B+ model. In addi024 tion, we study the impact of various popular 025 pretraining corpora on zero-shot generaliza026 tion. We also study the performance of a mul027 tilingual model and how it compares to the 028 English-only one. Finally, we consider the 029 scaling behaviour of Transformers to chose the 030 target model size, shape, and training setup. 031 All our models and code are open-sourced at 032 https://github.com/anonymous . 033
更多
查看译文
关键词
gpu hours,language model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要