CASTing Your Model: Learning to Localize Improves Self-Supervised Representations

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021(2021)

引用 67|浏览57
暂无评分
摘要
Recent advances in self-supervised learning (SSL) have largely closed the gap with supervised ImageNet pretraining. Despite their success these methods have been primarily applied to unlabeled ImageNet images, and show marginal gains when trained on larger sets of uncurated images. We hypothesize that current SSL methods perform best on iconic images, and struggle on complex scene images with many objects. Analyzing contrastive SSL methods shows that they have poor visual grounding and receive poor supervisory signal when trained on scene images. We propose Contrastive Attention-Supervised Tuning (CAST) to overcome these limitations. CAST uses unsupervised saliency maps to intelligently sample crops, and to provide grounding supervision via a Grad-CAM attention loss. Experiments on COCO show that CAST significantly improves the features learned by SSL methods on scene images, and further experiments show that CAST-trained models are more robust to changes in backgrounds. Our code is available at https://github.com/salesforce/CAST/.
更多
查看译文
关键词
contrastive SSL methods,poor visual grounding,poor supervisory signal,Contrastive Attention-Supervised Tuning,unsupervised saliency maps,grounding supervision,Grad-CAM attention loss,CAST-trained models,improves self-Supervised representations,self-supervised learning,supervised ImageNet pretraining,success these methods,unlabeled ImageNet images,marginal gains,uncurated images,current SSL methods,iconic images,complex scene images
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要