Sound to Visual: Hierarchical Cross-Modal Talking Face Video Generation

IEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops(2019)

引用 4|浏览51
暂无评分
摘要
Modeling the dynamics of a moving human face/body conditioned on another modality is a fundamental problem in computer vision, where applications are ranging from audio-to-video generation [3] to text-to-video generation and to skeleton-to-image/video generation [7]. This paper considers such a task: given a target face image and an arbitrary speech audio recording, generating a photo-realistic talking face of the target subject saying that speech with natural lip synchronization while maintaining a smooth transition of facial images over time (see Fig. 1). Note that the model should have a robust generalization capability to different types of faces (e.g., cartoon faces, animal faces) and to noisy speech conditions. Solving this task is crucial to enabling many applications, e.g., lip-reading from over-thephone audio for hearing-impaired people, generating virtual characters with synchronized facial movements to speech audio for movies and games. The main difference between still image generation and video generation is temporal-dependency modeling. There are two main reasons why it imposes additional challenges: people are sensitive to any pixel jittering (e.g., temporal discontinuities and subtle artifacts) in a video; they are also sensitive to slight misalignment between facial movements and speech audio. However, recent researchers [3, 2] tended to formulate video generation as a temporally independent image generation problem. In this paper, we propose a novel temporal GAN structure, which consists of a multi-modal convolutional-RNN-based (MMCRNN) generator and a novel regression-based discriminator structure. By modeling temporal dependencies, our MMCRNNbased generator yields smoother transactions between adjacent frames. Our regression-based discriminator structure combines sequence-level (temporal) information and frame-level (pixel variations) information to evaluate the generated video. Another challenge of the talking face generation is to handle various visual dynamics (e.g., camera angles, head movements) that are not relevant to and hence cannot be inferred from speech audio. Those complicated dynamics, if modeled in the pixel space, will result in low-quality videos. For example, in web videos (e.g., LRW and VoxCeleb datasets), speakers move significantly when they are talking. Nonetheless, all the recent photo-realistic talking face generation methods [3, 9] failed to consider this problem. In this paper, we propose a hierarchical structure Audio signal Tell where to change
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要