Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection in Low-Resource Realistic Scenarios

Ya Jiang,Qing Wang,Jun Du, Maocheng Hu,Pengfei Hu, Zeyan Liu, Shi Cheng,Zhaoxu Nian, Yuxuan Dong,Mingqi Cai, Xin Fang,Chin-Hui Lee

IEEE International Conference on Multimedia and Expo（2024）

Cited 0|Views18

Abstract

This study presents an audio-visual information fusion approach to soundevent localization and detection (SELD) in low-resource scenarios. We aim atutilizing audio and video modality information through cross-modal learning andmulti-modal fusion. First, we propose a cross-modal teacher-student learning(TSL) framework to transfer information from an audio-only teacher model,trained on a rich collection of audio data with multiple data augmentationtechniques, to an audio-visual student model trained with only a limited set ofmulti-modal data. Next, we propose a two-stage audio-visual fusion strategy,consisting of an early feature fusion and a late video-guided decision fusionto exploit synergies between audio and video modalities. Finally, we introducean innovative video pixel swapping (VPS) technique to extend an audio channelswapping (ACS) method to an audio-visual joint augmentation. Evaluation resultson the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023Challenge data set demonstrate significant improvements in SELD performances.Furthermore, our submission to the SELD task of the DCASE 2023 Challenge ranksfirst place by effectively integrating the proposed techniques into a modelensemble.

Translated text

Key words

DCASE,sound event localization and detection,cross-modal teacher-student learning,multi-modal fusion,audio channel swapping,video pixel swapping

Bibtex

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Data Disclaimer

The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn

Chat Paper

Summary is being generated by the instructions you defined