Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection
CoRR(2023)
摘要
Sound event localization and detection (SELD) combines two subtasks: sound
event detection (SED) and direction of arrival (DOA) estimation. SELD is
usually tackled as an audio-only problem, but visual information has been
recently included. Few audio-visual (AV)-SELD works have been published and
most employ vision via face/object bounding boxes, or human pose keypoints. In
contrast, we explore the integration of audio and visual feature embeddings
extracted with pre-trained deep networks. For the visual modality, we tested
ResNet50 and Inflated 3D ConvNet (I3D). Our comparison of AV fusion methods
includes the AV-Conformer and Cross-Modal Attentive Fusion (CMAF) model. Our
best models outperform the DCASE 2023 Task3 audio-only and AV baselines by a
wide margin on the development set of the STARSS23 dataset, making them
competitive amongst state-of-the-art results of the AV challenge, without model
ensembling, heavy data augmentation, or prediction post-processing. Such
techniques and further pre-training could be applied as next steps to improve
performance.
更多查看译文
关键词
microphone array,360 video,sound event localization and detection,audio-visual fusion,cross-modal attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要