Transformers Learn Nonlinear Features In Context: Nonconvex Mean-field Dynamics on the Attention Landscape
CoRR(2024)
摘要
Large language models based on the Transformer architecture have demonstrated
impressive capabilities to learn in context. However, existing theoretical
studies on how this phenomenon arises are limited to the dynamics of a single
layer of attention trained on linear regression tasks. In this paper, we study
the optimization of a Transformer consisting of a fully connected layer
followed by a linear attention layer. The MLP acts as a common nonlinear
representation or feature map, greatly enhancing the power of in-context
learning. We prove in the mean-field and two-timescale limit that the
infinite-dimensional loss landscape for the distribution of parameters, while
highly nonconvex, becomes quite benign. We also analyze the second-order
stability of mean-field dynamics and show that Wasserstein gradient flow almost
always avoids saddle points. Furthermore, we establish novel methods for
obtaining concrete improvement rates both away from and near critical points.
This represents the first saddle point analysis of mean-field dynamics in
general and the techniques are of independent interest.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要