GLMB 3D Speaker Tracking with Video-Assisted Multi-Channel Audio Optimization Functions

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)

引用 0|浏览1
暂无评分
摘要
Speaker tracking plays a significant role in numerous real-world human robot interaction (HRI) applications. In recent years, there has been a growing interest in utilizing multi-sensory information, such as complementary audio and visual signals, to address the challenges of speaker tracking. Despite the promising results, existing approaches still encounter difficulties in accurately determining the speaker’s true location, particularly in adverse conditions such as speech pauses, reverberation, or visual occlusions, leading to missed detections or spurious estimates. In this paper, we propose a novel speaker tracking method based on the Generalized Labelled Multi-Bernoulli (GLMB) filter. Our method operates in 3D space using audio information captured by a microphone array and video streams obtained from a monocular camera. The GLMB-based tracker effectively handles outliers in location estimates and maintains tracking during periods of missed detections. Experiments conducted on the publicly available AV16.3 dataset show that our proposal surpasses other competitive methods with improved results.
更多
查看译文
关键词
human-robot interaction,audio-visual fusion,speaker localization and tracking,GLMB filter
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要