Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision
CoRR(2024)
摘要
Contemporary cutting-edge open-vocabulary segmentation approaches commonly
rely on image-mask-text triplets, yet this restricted annotation is
labour-intensive and encounters scalability hurdles in complex real-world
scenarios. Although some methods are proposed to reduce the annotation cost
with only text supervision, the incompleteness of supervision severely limits
the versatility and performance. In this paper, we liberate the strict
correspondence between masks and texts by using independent image-mask and
image-text pairs, which can be easily collected respectively. With this
unpaired mask-text supervision, we propose a new weakly-supervised
open-vocabulary segmentation framework (Uni-OVSeg) that leverages confident
pairs of mask predictions and entities in text descriptions. Using the
independent image-mask and image-text pairs, we predict a set of binary masks
and associate them with entities by resorting to the CLIP embedding space.
However, the inherent noise in the correspondence between masks and entities
poses a significant challenge when obtaining reliable pairs. In light of this,
we advocate using the large vision-language model (LVLM) to refine text
descriptions and devise a multi-scale ensemble to stablise the matching between
masks and entities. Compared to text-only weakly-supervised methods, our
Uni-OVSeg achieves substantial improvements of 15.5
datasets, and even surpasses fully-supervised methods on the challenging PASCAL
Context-459 dataset.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要