Why Transformers Need Adam: A Hessian Perspective
CoRR(2024)
摘要
SGD performs worse than Adam by a significant margin on Transformers, but the
reason remains unclear. In this work, we provide an explanation of SGD's
failure on Transformers through the lens of Hessian: (i) Transformers are
“heterogeneous”: the Hessian spectrum across parameter blocks vary
dramatically, a phenomenon we call “block heterogeneity"; (ii) Heterogeneity
hampers SGD: SGD performs badly on problems with block heterogeneity. To
validate that heterogeneity hampers SGD, we check various Transformers, CNNs,
MLPs, and quadratic problems, and find that SGD works well on problems without
block heterogeneity but performs badly when the heterogeneity exists. Our
initial theoretical analysis indicates that SGD fails because it applies one
single learning rate for all blocks, which cannot handle the heterogeneity
among blocks. The failure could be rescued if we could assign different
learning rates across blocks, as designed in Adam.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要