DualLip: A System for Joint Lip Reading and Generation

MM '20: The 28th ACM International Conference on Multimedia Seattle WA USA October, 2020(2020)

引用 26|浏览163
Lip reading aims to recognize text from talking lip, while lip generation aims to synthesize talking lip according to text, which is a key component in talking face generation and is a dual task of lip reading. Both tasks require a large amount of paired lip video and text training data, and perform poorly in low-resource scenarios with limited paired training data. In this paper, we develop DualLip, a system that jointly improves lip reading and generation by leveraging the task duality and using unlabeled text and lip video data. The key ideas of the DualLip include: 1) Generate lip video from unlabeled text using a lip generation model, and use the pseudo data pairs to improve lip reading; 2) Generate text from unlabeled lip video using a lip reading model, and use the pseudo data pairs to improve lip generation. To leverage the benefit of DualLip on lip generation, we further extend DualLip to talking face generation with two additionally introduced components: lip to face generation and text to speech generation, which share the same duration for synchronization. Experiments on GRID and TCD-TIMIT datasets demonstrate the effectiveness of DualLip on improving lip reading, lip generation and talking face generation by utilizing unlabeled data, especially in low-resource scenarios. Specifically, on the GRID dataset, the lip generation model in our DualLip system trained with only 10% paired data and 90% unpaired data surpasses the performance of that trained with the whole paired data, and our lip reading model achieves 1.16% character error rate and 2.71% word error rate, outperforming the state-of-the-art models using the same amount of paired data.
lip reading, lip generation, task duality, talking face generation, lip to face, text to speech
AI 理解论文
Chat Paper