Dynamic Multi-modal Prompting for Efficient Visual Grounding

PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VII(2024)

引用 0|浏览8
暂无评分
摘要
Prompt tuning has emerged as a flexible approach for adapting pre-trained models by solely learning additional inputs while keeping the model parameters frozen. However, simplistic prompts are insufficient to effectively address the challenges posed by complex multi-modal tasks such as visual grounding. In this paper, we propose a novel prompting architecture called Dynamic Multi-modAl Prompting (DMAP) for visual grounding. DMAP incorporates input-dependent prompting to tailor instance-level prompts for more accurate representation and dynamic multi-modal prompting to capture the relationship between the textual and visual inputs. To this end, we design a Dynamic Prompt Network (DPN) to generate multi-modal prompts based on the specific inputs, enhancing both adaptive prompt generation and multi-modal feature fusion. Extensive experimental results demonstrate the superiority of DMAP over competing methods in parameter-efficient settings. Furthermore, DMAP consistently outperforms state-of-the-art VG methods even when fine-tuning all parameters.
更多
查看译文
关键词
Visual Grounding,Prompting Tuning,Vision and Language
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要