Object Prior Embedded Network for Query-Agnostic Image Retrieval

IEEE Conference on Computer Vision and Pattern Recognition(2022)

引用 0|浏览32
暂无评分
摘要
The Text-to-Image retrieval task plays an important role in bridging the gap between vision and language modalities. This task is challenging and far from being solved, because of the large visual-semantic discrepancy between language and vision. Recent studies on vision-language contrastive learning have shown that it can effectively learn good representations from massive image-text pairs. However, most existing methods simply concatenate image and text features as input and resort to the deep network to learn the visual-semantic relationship between image and text in a brute force manner. The insufficient alignments information pose a challenging weakly-supervised learning task, and results in only limited accuracy in previous methods. Motivated by the observation that the salient objects in an image can be accurately detected and are often mentioned in the paired text, in this paper, we propose a novel cross-attention transformer that uses objects detected in image as anchor points and prior to significantly ease the learning of image-text alignments, and thus boost the text-to-image search accuracy. In addition, unlike the query-dependent architectures adopted by most previous methods, our proposed method is query-agnostic and is thus significantly faster in the inference process. The extensive experiments on Flickr30K and MSCOCO captions datasets demonstrate that our proposed method can outperform the SOTA method while preserving the inference efficiency.
更多
查看译文
关键词
paired text,image-text alignments,text-to-image search accuracy,query-dependent architectures,SOTA method,object prior embedded network,query-agnostic Image retrieval,Text-to-Image retrieval task,language modalities,visual-semantic discrepancy,vision-language contrastive learning,good representations,massive image-text pairs,text features,deep network,visual-semantic relationship,brute force manner,insufficient alignments information,weakly-supervised learning task,salient objects
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要