$\sigma$Reparam: Stable Transformer Training with Spectral Reparametrization

ICLR 2023(2023)

引用 0|浏览87
暂无评分
摘要
Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the "attention entropy" for each attention head during the course of training, which is a proxy of the attention's sharpness. We observe a common, non monotonic evolution of attention entropy across different settings: the attention entropy first quickly decreases in the initial phase of training, followed by quickly increasing, and finally entering a long stable phase. While the exact shape can be affected by hyperparameters such as warmup, initialization, learning rate etc., we found that there is a close correlation between the minima of attention entropy and the model's training stability. To this end, we propose a simple and efficient solution dubbed $\sigma$Reparam, where we reparametrize all linear layers with Spectral Normalization and an additional learned scalar. We provide a lower bound on the attention entropy as a function of the spectral norms of the query and key projections, which suggests that small attention entropy can be obtained with large spectral norms. $\sigma$Reparam decouples the growth rate of a weight matrix's spectral norm from its dimensionality, which we verify empirically. We conduct experiments with $\sigma$Reparam on image classification, image self supervised learning, automatic speech recognition and language modeling tasks. We show that $\sigma$Reparam provides great stability and robustness with respect to the choice of hyperparameters.
更多
查看译文
关键词
Transformers,self-attention,optimization,stability,spectral normalization,self-supervised learning,vision,speech,language,contrastive learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要