A Text-Image Pair Is Not Enough: Language-Vision Relation Inference with Auxiliary Modality Translation.

NLPCC (2)(2023)

引用 0|浏览1
暂无评分
摘要
The semantic relations between language and vision modalities become more and more vital since they can effectively facilitate downstream multi-modal tasks. Although several approaches have been proposed to handle language-vision relation inference (LVRI), they normally rely on the limited information of the posted text-image pair. In this paper, to extend the information width of the original input, we introduce a concept of modality translation and propose the auxiliary modality translation framework (AMT) for LVRI. Specifically, the original input and the text pair (original and generated) are passed into two separate multi-layer bidirectional transformer structures respectively. The different linguistic and visual hybrid features are extracted and subsequently feed into a feature fusion module followed by a classifier. Systematic experiments and extensive analysis demonstrate the effectiveness of our approach with auxiliary modality translation.
更多
查看译文
关键词
relation,text-image,language-vision
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要