Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection in Low-Resource Realistic Scenarios
IEEE International Conference on Multimedia and Expo(2024)
Abstract
This study presents an audio-visual information fusion approach to soundevent localization and detection (SELD) in low-resource scenarios. We aim atutilizing audio and video modality information through cross-modal learning andmulti-modal fusion. First, we propose a cross-modal teacher-student learning(TSL) framework to transfer information from an audio-only teacher model,trained on a rich collection of audio data with multiple data augmentationtechniques, to an audio-visual student model trained with only a limited set ofmulti-modal data. Next, we propose a two-stage audio-visual fusion strategy,consisting of an early feature fusion and a late video-guided decision fusionto exploit synergies between audio and video modalities. Finally, we introducean innovative video pixel swapping (VPS) technique to extend an audio channelswapping (ACS) method to an audio-visual joint augmentation. Evaluation resultson the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023Challenge data set demonstrate significant improvements in SELD performances.Furthermore, our submission to the SELD task of the DCASE 2023 Challenge ranksfirst place by effectively integrating the proposed techniques into a modelensemble.
MoreTranslated text
Key words
DCASE,sound event localization and detection,cross-modal teacher-student learning,multi-modal fusion,audio channel swapping,video pixel swapping
PDF
View via Publisher
AI Read Science
Must-Reading Tree
Example

Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper
Summary is being generated by the instructions you defined