What is Point Supervision Worth in Video Instance Segmentation?
CoRR(2024)
Abstract
Video instance segmentation (VIS) is a challenging vision task that aims to
detect, segment, and track objects in videos. Conventional VIS methods rely on
densely-annotated object masks which are expensive. We reduce the human
annotations to only one point for each object in a video frame during training,
and obtain high-quality mask predictions close to fully supervised models. Our
proposed training method consists of a class-agnostic proposal generation
module to provide rich negative samples and a spatio-temporal point-based
matcher to match the object queries with the provided point annotations.
Comprehensive experiments on three VIS benchmarks demonstrate competitive
performance of the proposed framework, nearly matching fully supervised
methods.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined