Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues
CoRR(2024)
摘要
How to effectively interact audio with vision has garnered considerable
interest within the multi-modality research field. Recently, a novel
audio-visual segmentation (AVS) task has been proposed, aiming to segment the
sounding objects in video frames under the guidance of audio cues. However,
most existing AVS methods are hindered by a modality imbalance where the visual
features tend to dominate those of the audio modality, due to a unidirectional
and insufficient integration of audio cues. This imbalance skews the feature
representation towards the visual aspect, impeding the learning of joint
audio-visual representations and potentially causing segmentation inaccuracies.
To address this issue, we propose AVSAC. Our approach features a Bidirectional
Audio-Visual Decoder (BAVD) with integrated bidirectional bridges, enhancing
audio cues and fostering continuous interplay between audio and visual
modalities. This bidirectional interaction narrows the modality imbalance,
facilitating more effective learning of integrated audio-visual
representations. Additionally, we present a strategy for audio-visual
frame-wise synchrony as fine-grained guidance of BAVD. This strategy enhances
the share of auditory components in visual features, contributing to a more
balanced audio-visual representation learning. Extensive experiments show that
our method attains new benchmarks in AVS performance.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要