Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs

IEEE Transactions on Multimedia(2014)

引用 42|浏览85
暂无评分
摘要
Humans can extract speech signals that they need to understand from a mixture of background noise, interfering sound sources, and reverberation for effective communication. Voice Activity Detection (VAD) and Sound Source Localization (SSL) are the key signal processing components that humans perform by processing sound signals received at both ears, sometimes with the help of visual cues by locating and observing the lip movements of the speaker. Both VAD and SSL serve as the crucial design elements for building applications involving human speech. For example, systems with microphone arrays can benefit from these for robust speech capture in video conferencing applications, or for speaker identification and speech recognition in Human Computer Interfaces (HCIs). The design and implementation of robust VAD and SSL algorithms in practical acoustic environments are still challenging problems, particularly when multiple simultaneous speakers exist in the same audiovisual scene. In this work we propose a multimodal approach that uses Support Vector Machines (SVMs) and Hidden Markov Models (HMMs) for assessing the video and audio modalities through an RGB camera and a microphone array. By analyzing the individual speakers’ spatio-temporal activities and mouth movements, we propose a mid-fusion approach to perform both VAD and SSL for multiple active and inactive speakers. We tested the proposed algorithm in scenarios with up to three simultaneous speakers, showing an average VAD accuracy of 95.06% with an average error of 10.9 cm when estimating the three-dimensional locations of the speakers.
更多
查看译文
关键词
Speech,Array signal processing,Microphone arrays,Visualization,Human computer interaction,Accuracy
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要