Fully Attentional Networks with Self-emerging Token Labeling
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)(2024)
摘要
Recent studies indicate that Vision Transformers (ViTs) are robust against
out-of-distribution scenarios. In particular, the Fully Attentional Network
(FAN) - a family of ViT backbones, has achieved state-of-the-art robustness. In
this paper, we revisit the FAN models and improve their pre-training with a
self-emerging token labeling (STL) framework. Our method contains a two-stage
training framework. Specifically, we first train a FAN token labeler (FAN-TL)
to generate semantically meaningful patch token labels, followed by a FAN
student model training stage that uses both the token labels and the original
class label. With the proposed STL framework, our best model based on
FAN-L-Hybrid (77.3M parameters) achieves 84.8
ImageNet-1K and ImageNet-C, and sets a new state-of-the-art for ImageNet-A
(46.1
original FAN counterpart by significant margins. The proposed framework also
demonstrates significantly enhanced performance on downstream tasks such as
semantic segmentation, with up to 1.7
counterpart model. Code is available at https://github.com/NVlabs/STL.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要