Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation
NeurIPS(2023)
摘要
This paper studies the problem of weakly open-vocabulary semantic
segmentation (WOVSS), which learns to segment objects of arbitrary classes
using mere image-text pairs. Existing works turn to enhance the vanilla vision
transformer by introducing explicit grouping recognition, i.e., employing
several group tokens/centroids to cluster the image tokens and perform the
group-text alignment. Nevertheless, these methods suffer from a granularity
inconsistency regarding the usage of group tokens, which are aligned in the
all-to-one v.s. one-to-one manners during the training and inference phases,
respectively. We argue that this discrepancy arises from the lack of elaborate
supervision for each group token. To bridge this granularity gap, this paper
explores explicit supervision for the group tokens from the prototypical
knowledge. To this end, this paper proposes the non-learnable prototypical
regularization (NPR) where non-learnable prototypes are estimated from source
features to serve as supervision and enable contrastive matching of the group
tokens. This regularization encourages the group tokens to segment objects with
less redundancy and capture more comprehensive semantic regions, leading to
increased compactness and richness. Based on NPR, we propose the prototypical
guidance segmentation network (PGSeg) that incorporates multi-modal
regularization by leveraging prototypical sources from both images and texts at
different levels, progressively enhancing the segmentation capability with
diverse prototypical patterns. Experimental results show that our proposed
method achieves state-of-the-art performance on several benchmark datasets. The
source code is available at https://github.com/Ferenas/PGSeg.
更多查看译文
关键词
semantic
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要