Spike No More: Stabilizing the Pre-training of Large Language Models
CoRR(2023)
摘要
The loss spike often occurs during pre-training of a large language model.
The spikes degrade the performance of a large language model, and sometimes
ruin the pre-training. Since the pre-training needs a vast computational
budget, we should avoid such spikes. To investigate a cause of loss spikes, we
focus on gradients of internal layers in this study. Through theoretical
analyses, we introduce two causes of the exploding gradients, and provide
requirements to prevent the explosion. In addition, we introduce the
combination of the initialization method and a simple modification to
embeddings as a method to satisfy the requirements. We conduct various
experiments to verify our theoretical analyses empirically. Experimental
results indicate that the combination is effective in preventing spikes during
pre-training.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要