Towards Language-Guided Visual Recognition via Dynamic Convolutions

INTERNATIONAL JOURNAL OF COMPUTER VISION(2023)

引用 0|浏览54
暂无评分
摘要
In this paper, we are committed to establishing a unified and end-to-end multi-modal network via exploring language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called Language-guided Dynamic Convolution (LaConv). Its convolution kernels are dynamically generated based on natural language information, which can help extract differentiated visual features for different multi-modal examples. Based on the LaConv module, we further build a fully language-driven convolution network, termed as LaConvNet, which can unify the visual recognition and multi-modal reasoning in one forward structure. To validate LaConv and LaConvNet, we conduct extensive experiments on seven benchmark datasets of three vision-and-language tasks, i.e., visual question answering, referring expression comprehension and segmentation. The experimental results not only show the competitive or better performance of LaConvNet against existing multi-modal networks, but also witness the merits of LaConvNet as an unified structure, including compact network, low computational cost and high generalization ability. Our source code is released in SimREC project: .
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要