Blind Audio-Visual Localization and Separation via Low-Rank and Sparsity.
IEEE Transactions on Cybernetics(2020)
摘要
The ability to localize visual objects that are associated with an audio source and at the same time to separate the audio signal is a cornerstone in audio–visual signal-processing applications. However, available methods mainly focus on localizing only the visual objects, without audio separation abilities. Besides that, these methods often rely on either laborious preprocessing steps to segment video frames into semantic regions, or additional supervisions to guide their localization. In this paper, we aim to address the problem of visual source localization and audio separation in an unsupervised manner and avoid all preprocessing or post-processing steps. To this end, we devise a novel structured matrix decomposition method that decomposes the data matrix of each modality as a superposition of three terms: 1) a low-rank matrix capturing the background information; 2) a sparse matrix capturing the correlated components among the two modalities and, hence, uncovering the sound source in visual modality and the associated sound in audio modality; and 3) a third sparse matrix accounting for uncorrelated components, such as distracting objects in visual modality and irrelevant sound in audio modality. The generality of the proposed method is demonstrated by applying it onto three applications, namely: 1) visual localization of a sound source; 2) visually assisted audio separation; and 3) active speaker detection. Experimental results indicate the effectiveness of the proposed method on these application domains.
更多查看译文
关键词
Visualization,Feature extraction,Sparse matrices,Matrix decomposition,Task analysis,Microphones,Spectrogram
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络