Relation Enhanced Vision Language Pre-Training

ICIP（2022）

引用 1|浏览3

暂无评分

摘要

In this paper, we propose a relation enhanced vision-language pre-training (VLP) method for a transformer model (TM) to improve performance in vision-language (V+L) tasks. Current VLP studies attempted to generate a multimodal representation with individual objects as input and relied on a self-attention to learn semantic representation in a brute force manner. However, the relations among objects in an image are largely ignored. To address the problem, we generate a paired visual feature (PVF) that is organized to express the relations between objects. Prior knowledge that reflects co-occurrences of paired objects and a pair-wise distance matrix adjusts the relations, and a triplet is used for sentence embedding. Experimental results demonstrate that the proposed method is efficiently used for VLP by bridging relations between objects, and thus improves performance on V+L downstream tasks.

查看译文

关键词

vision-language pre-training

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要