Leveraging visual prompts to guide language modeling for referring video object segmentation

2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP(2023)

引用 0|浏览8
暂无评分
摘要
Referring Video Object Segmentation (R-VOS) aims to segment object masks in a target video given a language query describing the object. It is a challenging task that requires modeling the semantics of a natural language query and its correspondence to the target video. Previous works directly use visual-agnostic language features from uni-modal language models, and only interact with visual features in late decoding stages. We propose to encode visual-enriched language features by using visual prompts as guidance in the early encoding stage. The proposed visual prompt is constructed by modulating visual features of key frames with alignment scores to text inputs. The alignment score is computed with a pre-trained visual-language contrastive model. We concatenate visual prompts with text inputs to encode visual-enriched language features, which serve as queries for target object segmentation in a Transformer-based decoder. Our method outperforms the previous state-of-the-art method (+2.3) on Refer-Youtube-VOS benchmark.
更多
查看译文
关键词
multi-modal,referring video object segmentation,prompt
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要