Audio-Visual Segmentation via Unlabeled Frame Exploitation
CVPR 2024(2024)
摘要
Audio-visual segmentation (AVS) aims to segment the sounding objects in video
frames. Although great progress has been witnessed, we experimentally reveal
that current methods reach marginal performance gain within the use of the
unlabeled frames, leading to the underutilization issue. To fully explore the
potential of the unlabeled frames for AVS, we explicitly divide them into two
categories based on their temporal characteristics, i.e., neighboring frame
(NF) and distant frame (DF). NFs, temporally adjacent to the labeled frame,
often contain rich motion information that assists in the accurate localization
of sounding objects. Contrary to NFs, DFs have long temporal distances from the
labeled frame, which share semantic-similar objects with appearance variations.
Considering their unique characteristics, we propose a versatile framework that
effectively leverages them to tackle AVS. Specifically, for NFs, we exploit the
motion cues as the dynamic guidance to improve the objectness localization.
Besides, we exploit the semantic cues in DFs by treating them as valid
augmentations to the labeled frames, which are then used to enrich data
diversity in a self-training manner. Extensive experimental results demonstrate
the versatility and superiority of our method, unleashing the power of the
abundant unlabeled frames.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要