A Text-Image Pair Is Not Enough: Language-Vision Relation Inference with Auxiliary Modality Translation.

Wenjie Lu,Dong Zhang,Shoushan Li,Guodong Zhou

NLPCC (2)（2023）

引用 0|浏览1

暂无评分

摘要

The semantic relations between language and vision modalities become more and more vital since they can effectively facilitate downstream multi-modal tasks. Although several approaches have been proposed to handle language-vision relation inference (LVRI), they normally rely on the limited information of the posted text-image pair. In this paper, to extend the information width of the original input, we introduce a concept of modality translation and propose the auxiliary modality translation framework (AMT) for LVRI. Specifically, the original input and the text pair (original and generated) are passed into two separate multi-layer bidirectional transformer structures respectively. The different linguistic and visual hybrid features are extracted and subsequently feed into a feature fusion module followed by a classifier. Systematic experiments and extensive analysis demonstrate the effectiveness of our approach with auxiliary modality translation.

查看译文

关键词

relation,text-image,language-vision

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要