Visual Prompt Tuning for Weakly Supervised Phrase Grounding

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)

引用 0|浏览16
暂无评分
摘要
Previous works on the task of weakly supervised phrase grounding (WSG) rely heavily on object detectors providing RoIs for the localization. However, such methods cannot be applied effectively to real-world scenarios largely because that the detectors are trained with limited categories. In this paper, we propose a refinement-based approach to WSG through fine-tuning a detector-free phrase grounding model with a visual prompt. This visual prompt is extracted from the text-related representations in CLIP. Furthermore, we combine the visual prompt with learnable features and then fine-tune the grounding network. Our experimental results significantly outperform state-of-the-art methods on the WSG task and shows the effectiveness of our method.
更多
查看译文
关键词
Weakly supervised,Phrase grounding,Visual prompt tuning,CLIP,Detector-free
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要