Cross-Modal Metric Learning And Local Attention For Referring Relationships In Images

DEVELOPMENTS OF ARTIFICIAL INTELLIGENCE TECHNOLOGIES IN COMPUTATION AND ROBOTICS(2020)

引用 0|浏览5
暂无评分
摘要
The task of referring relationships in images aims to locate the entities (subject and object) described by a relationship triple < subject - relationship object > in images, which can be viewed as a retrieval problem between structured texts and images. However, existing works extract features of the input text and image separately, leading to capture the correlations between these two modalities insufficiently. Moreover, the attention mechanisms used in cross-modal retrieval tasks do not consider local correlation in images. To address these issues, a cross-modal similarity attention network is proposed in this work, including a cross-modal metric learning module and a cross-modal local attention module. The cross-modal metric learning module adaptively models the similarity between query text and input image, and refines image features to obtain cross-modal features. Regarding the cross-modal local attention module, it concentrates on the query entity in cross-modal features both on image channels and spatial local regions. The experiments demonstrate the superiority of the proposed approach as compared with current powerful frameworks on two challenging benchmark datasets - Visual Genome and VRD.
更多
查看译文
关键词
Referring relationships, Cross modal, Metric learning, Attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要