The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

ICLR 2023(2023)

引用 19|浏览63
暂无评分
摘要
This paper studies a curious phenomenon that machine learning model with Transformer architectures have sparse activation maps. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by "sparse" we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Moreover, larger Transformers with more layers and wider MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that enforcing an even sparser activation via Top-k thresholding with a small k brings a collection of desired properties, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.
更多
查看译文
关键词
Transformers,Sparse,Calibration,Robustness,Label Noise,Efficiency
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要