Attention-guided contrastive masked image modeling for transformer-based self-supervised learning

2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP(2023)

引用 0|浏览4
暂无评分
摘要
Self-supervised learning with vision transformer (ViT) has gained much attention recently. Most existing methods rely on either contrastive learning or masked image modeling. The former is suitable for global feature extraction but underperforms in fine-grained tasks. The later explores the internal structure of images but ignores the high information sparsity and unbalanced information distribution. In this paper, we propose a new approach called Attention-guided Contrastive Masked Image Modeling (ACoMIM), which integrates the merits of both paradigms and leverages the attention mechanism of ViT for effective representation. Specifically, it has two pretext tasks, predicting the features of masked regions guided by attention and comparing the global features of masked and unmasked images. We show that these two pretext tasks complement each other and improve our method's performance. The experiments demonstrate that our model transfers well to various downstream tasks such as classification and object detection. Code is available at https://github.com/yczhan/ACoMIM.
更多
查看译文
关键词
Self-Supervised Learning,Vision transformer,Masked image modeling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要