Dynamic Multi-modal Prompting for Efficient Visual Grounding

Wansen Wu, Ting Liu,Youkai Wang,Kai Xu,Quanjun Yin,Yue Hu

PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VII（2024）

引用 0|浏览8

暂无评分

摘要

Prompt tuning has emerged as a flexible approach for adapting pre-trained models by solely learning additional inputs while keeping the model parameters frozen. However, simplistic prompts are insufficient to effectively address the challenges posed by complex multi-modal tasks such as visual grounding. In this paper, we propose a novel prompting architecture called Dynamic Multi-modAl Prompting (DMAP) for visual grounding. DMAP incorporates input-dependent prompting to tailor instance-level prompts for more accurate representation and dynamic multi-modal prompting to capture the relationship between the textual and visual inputs. To this end, we design a Dynamic Prompt Network (DPN) to generate multi-modal prompts based on the specific inputs, enhancing both adaptive prompt generation and multi-modal feature fusion. Extensive experimental results demonstrate the superiority of DMAP over competing methods in parameter-efficient settings. Furthermore, DMAP consistently outperforms state-of-the-art VG methods even when fine-tuning all parameters.

查看译文

关键词

Visual Grounding,Prompting Tuning,Vision and Language

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要