Modality Eigen-Encodings Are Keys to Open Modality Informative Containers

International Multimedia Conference(2022)

引用 0|浏览12
暂无评分
摘要
ABSTRACTVision-Language fusion relies heavily on precise cross-modal information synergy. Nevertheless, modality divergence makes mutual description with the other modality extremely difficult. Despite various attempts to tap into semantic unity in vision and language, most existing approaches utilize modality-specific features via the high-dimensionality tensors as the smallest unit of information, limiting the interactivity of multi-modal fine-grained fusion. Furthermore, in previous works, cross-modal interaction is commonly depicted by the similarity between semantically insufficient global features. Differently, we propose a novel scheme for multi-modal fusion named Vision Language Interaction (VLI). To represent more fine-grained and flexible information of the modality, we consider high-dimensional features as containers of modality-specific information, while homogeneous semantic information between heterogeneous modalities is the key stored in the containers. We first construct information containers via multi-scale alignment and then utilize modality eigen-encodings to take out the homogeneous semantics on the vector level. Finally, we iteratively embed the eigen-encodings of one modality into the eigen-encodings of the other modality to perform cross-modal semantic interaction. After embeddings interaction, vision and language information can break the existing representation bottleneck through the representation level of granularity never achieved in previous work. Extensive experimental results on vision-language tasks validate the effectiveness of VLI. On the three benchmarks of Referring Expression Comprehension (REC), Referring Expression Segmentation (RES), and Visual Question Answering (VQA), VLI significantly outperforms the existing state-of-the-art methods.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要