Referring Expression Comprehension Based on Cross Modal Feature Fusion and Iterative Reasoning.

Image and Graphics : 12th International Conference, ICIG 2023, Nanjing, China, September 22–24, 2023, Proceedings, Part IV(2023)

引用 0|浏览0
暂无评分
摘要
The task of Referring Expression Comprehension is a multimodal task, which involves two different fields: Computer Vision and Natural Language Processing. Specifically, the task is to locate image region that correspond to the description provided in the given a image and a natural language expression. This paper aims to address the problem that the current task can not effectively fuse visual and textual features in the multimodal alignment stage and can not effectively utilize visual and textual formation in the prediction stage. Two improvement measures are proposed: multimodal feature fusion and iterative reasoning based on multimodal attention mechanism. In the multimodal feature fusion stage, three feature fusion modules are used to fuse visual and textual features from different perspectives to obtain rich visual and textual information; in the iterative reasoning stage, visual and textual features are accessed several times to gradually optimize the target prediction region. In order to verify the performance of the proposed method in this paper, a large number of experiments were conducted on three public datasets.
更多
查看译文
关键词
expression comprehension,cross modal feature fusion
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要