Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

IEEE Conference on Computer Vision and Pattern Recognition(2022)

引用 36|浏览59
暂无评分
摘要
Visual grounding, i.e., localizing objects in images ac-cording to natural language queries, is an important topic in visual language understanding. The most effective approaches for this task are based on deep learning, which generally require expensive manually labeled image-query or patch-query pairs. To eliminate the heavy depen-dence on human annotations, we present a novel method, named Pseudo-Q, to automatically generate pseudo language queries for supervised training. Our method lever-ages an off-the-shelf object detector to identify visual ob-jects from unlabeled images, and then language queries for these objects are obtained in an unsupervised fashion with a pseudo-query generation module. Then, we design a task-related query prompt module to specifically tailor generated pseudo language queries for visual grounding tasks. Further, in order to fully capture the contextual re-lationships between images and language queries, we de-velop a visual-language model equipped with multi-level cross-modality attention mechanism. Extensive experimen-tal results demonstrate that our method has two notable benefits: (1) it can reduce human annotation costs signifi-cantly, e.g., 31% on Ref Coco [65] without degrading orig-inal model's performance under the fully supervised set-ting, and (2) without bells and whistles, it achieves supe-rior or comparable performance compared to state-of-the-art weakly-supervised visual grounding methods on all the five datasets we have experimented. Code is available at https://github.com/LeapLabTHU/Pseudo-Q.
更多
查看译文
关键词
Vision + language, Visual reasoning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要