Unified Contrastive Learning in Image-Text-Label Space

IEEE Conference on Computer Vision and Pattern Recognition(2022)

引用 163|浏览195
暂无评分
摘要
Visual recognition is recently learned via either super-vised learning on human-annotated image-label data or language-image contrastive learning with webly-crawled image-text pairs. While supervised learning may result in a more discriminative representation, language-image pretraining shows unprecedented zero-shot recognition ca-pability, largely due to the different properties of data sources and learning objectives. In this work, we intro-duce a new formulation by combining the two data sources into a common image-text-label space. In this space, we propose a new learning paradigm, called Unified Con-trastive Learning (UniCL) with a single learning objective to seamlessly prompt the synergy of two data types. Ex-tensive experiments show that our UniCL is an effective way of learning semantically rich yet discriminative repre-sentations, universally for image recognition in zero-shot, linear-probing, fully finetuning and transfer learning sce-narios. Particularly, it attains gains up to 9.2% and 14.5% in average on zero-shot recognition benchmarks over the language-image contrastive learning and supervised learning methods, respectively. In linear probe setting, it also boosts the performance over the two methods by 7.3% and 3.4%, respectively. Our study also indicates that UniCL stand-alone is a good learner on pure image-label data, rivaling the supervised learning methods across three im-age classification datasets and two types of vision back-bones, ResNet and Swin Transformer. Code is available at: https://github.com/microsoft/UniCL.
更多
查看译文
关键词
Representation learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要