Relation Enhanced Vision Language Pre-Training

Ju-Hee Lee,Je-Won Kang

ICIP(2022)

引用 1|浏览3
暂无评分
摘要
In this paper, we propose a relation enhanced vision-language pre-training (VLP) method for a transformer model (TM) to improve performance in vision-language (V+L) tasks. Current VLP studies attempted to generate a multimodal representation with individual objects as input and relied on a self-attention to learn semantic representation in a brute force manner. However, the relations among objects in an image are largely ignored. To address the problem, we generate a paired visual feature (PVF) that is organized to express the relations between objects. Prior knowledge that reflects co-occurrences of paired objects and a pair-wise distance matrix adjusts the relations, and a triplet is used for sentence embedding. Experimental results demonstrate that the proposed method is efficiently used for VLP by bridging relations between objects, and thus improves performance on V+L downstream tasks.
更多
查看译文
关键词
vision-language pre-training
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要