Voice activity detection and speaker localization using audiovisual cues

Pattern Recognition Letters(2012)

引用 16|浏览0
暂无评分
摘要
This paper proposes a multimodal approach to distinguish silence from speech situations, and to identify the location of the active speaker in the latter case. In our approach, a video camera is used to track the faces of the participants, and a microphone array is used to estimate the Sound Source Location (SSL) using the Steered Response Power with the phase transform (SRP-PHAT) method. The audiovisual cues are combined, and two competing Hidden Markov Models (HMMs) are used to detect silence or the presence of a person speaking. If speech is detected, the corresponding HMM also provides the spatio-temporally coherent location of the speaker. Experimental results show that incorporating the HMM improves the results over the unimodal SRP-PHAT, and the inclusion of video cues provides even further improvements.
更多
查看译文
关键词
hidden markov models,active speaker,sound source,video camera,audiovisual cue,spatio-temporally coherent location,video cue,speech situation,unimodal srp-phat,speaker localization,multimodal approach,voice activity detection,corresponding hmm,user interfaces
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要