Improving weakly supervised phrase grounding via visual representation contextualization with contrastive learning

Applied Intelligence(2022)

引用 0|浏览29
暂无评分
摘要
Weakly supervised phrase grounding aims to map the phrases in an image caption to the objects appearing in the image under the supervision of image-caption correspondence. We observe that the current studies are insufficient to model the complicated interactions between the visual components (i.e., the visual regions) and between the visual and textual components (i.e., the phrases). Therefore, this paper presents a novel weakly supervised learning approach to phrase grounding in which we systematically model the visual contextualized representation with three modules: (1) object proposals pooling (OPP), (2) visual self-attention (VSA) and (3) visual-textual cross-modal attention (VTCA). OPP alleviates the suppression of the object proposals and benefits the visual representation in terms of trading off the richness of the visual components and the computational efficiency. VSA aims to capture the correlation among the object proposals and generate a representation of each proposal by incorporating the visual information of the others. To measure the cross-modal compatibility in terms of topics, we introduce the VTCA module to represent the visual topic corresponding to each textual component in a cross-modal common vector space. In the training process, we build a mixed contrastive loss function by considering both the cross-modal compatibility and the differences in the visual representations in the VSA module. Compared with the state-of-the-art methods, the proposed approach improves the performance by 3.88% points and 1.24% points on R@1 , and by 2.23% points and 0.26% points on Pt_Acc , when trained on the MS COCO and Flickr30K Entities training sets, respectively. We have made our code available for follow-up research.
更多
查看译文
关键词
Visual representation, Phrase grounding, Contrastive learning, Weakly supervised learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要