Cross-token Modeling with Conditional Computation

Yuxuan Lou,Fuzhao Xue,Zangwei Zheng,Yang You

semanticscholar（2021）

引用 4|浏览11

暂无评分

摘要

Mixture-of-Experts (MoE), a conditional computation architecture, achieved promising performance by scaling local module (i.e., feed-forward network) of transformer. However, scaling the crosstoken module (i.e., self-attention) is challenging due to the unstable training. This work proposes Sparse-MLP, an all-MLP model which applies sparsely-activated MLPs to cross-token modeling. Specifically, in each Sparse block of our all-MLP model, we apply two stages of MoE layers: one with MLP experts mixing information within channels along image patch dimension, the other with MLP experts mixing information within patches along the channel dimension. In addition, by proposing importance-score routing strategy for MoE and redesigning the image representation shape, we further improve our model’s computational efficiency. Experimentally, we are more computation-efficient than Vision Transformers with comparable accuracy. Also, our models can outperform MLP-Mixer by 2.5% on ImageNet Top-1 accuracy with fewer parameters and computational cost. On downstream tasks, i.e., Cifar10 and Cifar100, our models can still achieve better performance than baselines.

查看译文

关键词

modeling,cross-token

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要