Head-pose and illumination invariant three-dimensional audio-visual speech recognition

Head-pose and illumination invariant three-dimensional audio-visual speech recognition(2007)

引用 23|浏览6
暂无评分
摘要
Speech perception is bimodal, employing not only the acoustic signal, but also visual cues. Audio-visual speech recognition aims to improve the performance of conventional automated speech recognition by incorporating visual information. Due to a fundamentally limited two-dimensional representation employed, current approaches for visual feature extraction lack invariance to speaker's pose and illumination in the environment. The research presented in this thesis aims to develop three-dimensional methods for visual feature extraction that alleviate the above-mentioned limitation.Following the concepts of Grenander's General Pattern Theory, the prior knowledge of speaker's face is described by a prototype, which consists of a 3-D surface and a texture. The variability in observed video images of a speaker associated with pose, articulatory facial motion, and illumination is represented by transformations acting on the prototype and forming the group of geometric and photometric variability. Facial motion is described as smooth deformations of the prototype surface and is learned from motion capture data. The effects of illumination are accommodated by analytically constructing surface scalar fields that express relative changes in the face surface irradiance.We derive a multi-resolution tracking algorithm for estimation of speaker's pose, articulatory facial motion and illumination from uncalibrated monocular video sequences. The inferred facial motion parameters are utilized as visual features in audio-visual speech recognition. An application of our approach to large-vocabulary audio-visual speech recognition is presented. Speaker-independent speech recognition combines audio and visual models at the utterance level. We demonstrate that the visual features derived using our 3-D approach significantly improve speech recognition performance across a wide range of acoustic signal-to-noise ratios.
更多
查看译文
关键词
visual feature extraction,speech recognition performance,conventional automated speech recognition,Speaker-independent speech recognition,visual feature extraction lack,visual cue,speech perception,articulatory facial motion,visual feature,audio-visual speech recognition,invariant three-dimensional audio-visual speech
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要