End-To-End Generation Of Talking Faces From Noisy Speech

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING(2020)

引用 22|浏览83
暂无评分
摘要
Acoustic cues are not the only component in speech communication; if the visual counterpart is present, it is shown to benefit speech comprehension. In this work, we propose an end-to-end (no pre- or post-processing) system that can generate talking faces from arbitrarily long noisy speech. We propose a mouth region mask to encourage the network to focus on mouth movements rather than speech irrelevant movements. In addition, we use generative adversarial network (GAN) training to improve the image quality and mouth-speech synchronization. Furthermore, we employ noise-resilient training to make our network robust to unseen non-stationary noise. We evaluate our system with image quality and mouth shape (landmark) measures on noisy speech utterances with five types of unseen non-stationary noise between -10 dB and 30 dB signal-to-noise ratio (SNR) with increments of 1 dB SNR. Results show that our system outperforms a state-of-the-art baseline system significantly, and our noise-resilient training improves performance for noisy speech in a wide range of SNR.
更多
查看译文
关键词
generative models, talking face from speech, speech animation, generative adversarial networks
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要