Sound3DVDet: 3D Sound Source Detection using Multiview Microphone Array and RGB Images.

IEEE/CVF Winter Conference on Applications of Computer Vision(2024)

引用 0|浏览0
暂无评分
摘要
Spatial localization of 3D sound sources is an important problem in many real world scenarios, especially when the sources may not have any visually distinguishable characteristic; e.g., finding a gas leak, a malfunctioning motor, etc. In this paper, we cast this task in a novel audio-visual setting, by introducing an acoustic-camera rig consisting of a centered pinhole RGB camera and a uniform circular array of four coplanar microphones. Using this setup, we propose Sound3DVDet – a 3D sound source localization Transformer model that treats this task as a set prediction problem. It first learns a set of initial sound source locations (dubbed queries) from a single view of the microphone array signal, then feeds the query set to a sequence of Transformerlike layers for refinement. Each query arising from each layer repeatedly aggregates sound source cues from other views. We deeply supervise the initial sound source queries, intermediate layer queries, and the final output by measuring their respective discrepancy against ground truth queries via bipartite matching. To evaluate our method, we introduce a new dataset: Sound3DVDet Dataset, consisting of nearly 6k scenes produced using the SoundSpaces simulator. We conduct extensive experiments on our dataset and show the efficacy of our approach against closely related methods, demonstrating significant improvements in the localization accuracy. Code is available at https://github.com/yuhanghe01/Sound3DVDet.
更多
查看译文
关键词
Algorithms,Vision + language and/or other modalities,Algorithms,3D computer vision,Algorithms,Machine learning architectures,formulations,and algorithms
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要