SLVP: Self-Supervised Language-Video Pre-Training for Referring Video Object Segmentation

2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)(2024)

引用 0|浏览0
暂无评分
摘要
The referring video object segmentation (R-VOS) task requires a model to understand both referring expression and video input. Most recent works are mainly based on an encoder-decoder type of architecture. Although their text and visual encoders can benefit from separately pretrained backbones, their decoder is trained from scratch on a combination of image/video segmentation datasets. However, pixel-wise annotation with referring expressions is extremely expensive which makes it challenging to further improve the performance. Due to the same reason, current vision-language pre-training works mainly focus on learning general feature representations for image-level or object-level tasks, which may be not optimal for the down-stream pixel-level segmentation task. To bridge this gap, we present a general self-supervised language-video pre-training (SLVP) architecture. With the relatively cheap video caption dataset, SLVP can learn pixel-level features by introducing optical flow as the intermediate target during pre-training. Correspondingly, we propose simple transfer learning models that can reuse pre-trained modules for the downstream R-VOS task. Furthermore, the proposed general SLVP architecture can support either ‘language as query’ fusion or ‘vision as query’ fusion. Experiments show the superiority of the under-studied ‘vision as query’ method which can achieve better performance than the state-of-the-art methods on Ref-Davis17 and Ref-Youtube-VOS benchmarks even with fewer model parameters. We further adopt the challenging VISOR benchmark to the R-VOS task and our SLVP serves as the first strong baseline for R-VOS task on it.
更多
查看译文
关键词
Video Object Segmentation,Benchmark,Transfer Learning,Optical Flow,Segmentation Task,General Architecture,Segmentation Dataset,Intermediate Target,Text Encoder,Sequence Features,Bounding Box,Temporal Information,Target Object,Fusion Method,Sequence Of Frames,Textual Features,Self-supervised Learning,Binary Cross Entropy,Binary Cross-entropy Loss,L1 Loss,Raw Video,Video Object,Frame Features,Description Language,Original Frame,Forward Network,Pre-training Dataset,Pixel-level Annotations,Pre-trained Encoder,Dice Loss
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要