Large-scale multilingual audio visual dubbing

Yi Yang
Yi Yang
Yannis Assael
Yannis Assael
Miaosen Wang
Miaosen Wang
Wendi Liu
Wendi Liu
Eren Sezener
Eren Sezener
Luis C. Cobo
Luis C. Cobo
Cited by: 0|Bibtex|Views12|Links
Keywords:
online educationfine tuningtarget languageaudiovisual translationvisual contentMore(8+)
Weibo:
We extend audio only dubbing to include a visual dubbing component that translates the lip movements of speakers to match the phonemes of the translated audio

Abstract:

We describe a system for large-scale audiovisual translation and dubbing, which translates videos from one language to another. The source language's speech content is transcribed to text, translated, and automatically synthesized into target language speech using the original speaker's voice. The visual content is translated by synthes...More

Code:

Data:

0
Introduction
  • Audio dubbing is better for beginner readers but still results in loss of important aspects of the acting and user engagement.
  • This happens because the mouth motions play a crucial role in speech understanding [6]
Highlights
  • Automatic translation of educational videos offers an important avenue for improving online education and diversity in many fields of technology
  • This work focuses on translating audiovisual content, where today more than 500 hours of videos are uploaded to the Internet per minute [3]
  • We extend audio only dubbing to include a visual dubbing component that translates the lip movements of speakers to match the phonemes of the translated audio
  • Spatiotemporal discriminators we introduce a dual discriminator setup D to improve the quality of the generated images G, and the motion naturalness, since the reconstruction loss treats the outputs as independent frames
  • In this report we have described a large-scale system for audiovisual translation and dubbing
Results
  • 8 Conclusions

    A Further implementation details

    Tenso!low is an open source machine learning pla#orm for everyone.

    Tenso!low es una plataforma de aprendizaje automático de código abie"o para todos.
  • Tenso!low is an open source machine learning pla#orm for everyone.
  • Tenso!low es una plataforma de aprendizaje automático de código abie"o para todos.
  • The authors aim to demonstrate with quantitative metrics, the impact of a) the architectural decisions, b) multilingual datasets, and c) fine-tuning on a single speaker.
  • Table 2 demonstrates the performance impact of each carefully chosen architectural decision of the model on the dataset described in section 3.1.1.
  • Random 1 Random 10 KMeans 10 KMeans 10 KMeans 10 KMeans 10 KMeans 10 KMeans 10 You said that [30] Towards Automatic Face-to-Face Translation [34] Realistic Speech-Driven Facial Animation with GANs [40] FID↓
Conclusion
  • In this report the authors have described a large-scale system for audiovisual translation and dubbing.
  • The system translates both the audio and visual content of a target video, creating a seamless viewing experience in the target language.
  • The key challenge in audiovisual translation is to modify the lip movements of the speaker to match the translated audio
  • To tackle this problem the authors collected a large multilingual dataset and used it to train a large multilingual multi-speaker lipsync model.
  • The authors' quantitative and qualitative evaluations justify the choices of architecture and procedure
Summary
  • Introduction:

    Audio dubbing is better for beginner readers but still results in loss of important aspects of the acting and user engagement.
  • This happens because the mouth motions play a crucial role in speech understanding [6]
  • Objectives:

    The authors aim to demonstrate with quantitative metrics, the impact of a) the architectural decisions, b) multilingual datasets, and c) fine-tuning on a single speaker.
  • Results:

    8 Conclusions

    A Further implementation details

    Tenso!low is an open source machine learning pla#orm for everyone.

    Tenso!low es una plataforma de aprendizaje automático de código abie"o para todos.
  • Tenso!low is an open source machine learning pla#orm for everyone.
  • Tenso!low es una plataforma de aprendizaje automático de código abie"o para todos.
  • The authors aim to demonstrate with quantitative metrics, the impact of a) the architectural decisions, b) multilingual datasets, and c) fine-tuning on a single speaker.
  • Table 2 demonstrates the performance impact of each carefully chosen architectural decision of the model on the dataset described in section 3.1.1.
  • Random 1 Random 10 KMeans 10 KMeans 10 KMeans 10 KMeans 10 KMeans 10 KMeans 10 You said that [30] Towards Automatic Face-to-Face Translation [34] Realistic Speech-Driven Facial Animation with GANs [40] FID↓
  • Conclusion:

    In this report the authors have described a large-scale system for audiovisual translation and dubbing.
  • The system translates both the audio and visual content of a target video, creating a seamless viewing experience in the target language.
  • The key challenge in audiovisual translation is to modify the lip movements of the speaker to match the translated audio
  • To tackle this problem the authors collected a large multilingual dataset and used it to train a large multilingual multi-speaker lipsync model.
  • The authors' quantitative and qualitative evaluations justify the choices of architecture and procedure
Tables
  • Table1: A comparison of our dataset to existing audio-visual video dubbing datasets. The identity count for MLVD is a rough estimate that assumes all utterances from the same video are spoken by the same person
  • Table2: Performance of various model configurations
  • Table3: Dataset ablations
  • Table4: Human evaluated naturalness (no audio; scale 1 to 5), synchronization with audio (scale 1 to 3)
  • Table5: Lip Sync image encoder and reference encoder architecture
  • Table6: Lip Sync audio encoder architecture
  • Table7: Lip Sync image decoder architecture
Download tables as Excel
Related work
  • Research in face synthesis conditioned on facial structure, audio, or even text, has gained significant attention in the recent years. While early literature commonly predicts mouth shapes from audio [28, 29], many recent approaches generate full talking heads [30,31,32]. Some other works only synthesize the mouth region in a given video [14, 33, 34]. Below we discuss related approaches to our work.

    Talking heads Generating talking faces or heads by conditioning on audio [30,31,32, 35,36,37,38,39,40,41] or facial structure extracted from other videos (e.g. 3D meshes or landmarks) [42, 43] is widely studied in recent years. These approaches aim at generating a complete video of a person from scratch, often given just a single reference image. Our work instead focuses on generating only the mouth region for the purpose of audiovisual dubbing.
Funding
  • 59% of the Internet content is published in English [1], but only a quarter of its users speak English as their first language [2]
  • Result After all of these filtering steps, approximately 5% of the original speech segments remain
Study subjects and analysis
milestone papers: 3
Our model utilizes 10 reference frames selected with K-Means, an attention and a temporal network, and the four training losses (L1, MS-SSIM, landmarks and GAN). For a fair comparison with earlier literature and re-implemented the models from three milestone papers [30, 34, 40] and trained them on our dataset. The works were re-implemented as close as possible in our own experimental setup

Reference
  • [30] Towards Automatic Face-to-Face Translation
    Google ScholarFindings
  • [34] Realistic Speech-Driven Facial Animation with GANs [40]
    Google ScholarFindings
  • 9. Broader impact
    Google ScholarFindings
  • [1] W3Techs. Usage statistics of content languages for websites. https://w3techs.com/technologies/overview/content_language, July 2020.
    Findings
  • [2] Interworldstats. Internet world users by language. https://www.internetworldstats.com/stats7.htm, March 2020.
    Findings
  • [3] J. Clement. Hours of video uploaded to youtube every minute as of may 2019. https://www.statista.com/statistics/259477/hours-of-video-uploaded-to-youtube-every-minute/, August 2019.
    Findings
  • [4] C. M. Koolstra, A. L. Peeters, and H. Spinhof. The pros and cons of dubbing and subtitling. European Journal of Communication, 17(3):325–354, 2002.
    Google ScholarLocate open access versionFindings
  • [5] B. Wissmath, D. Weibel, and R. Groner. Dubbing or subtitling? effects on spatial presence, transportation, flow, and enjoyment. Journal of Media Psychology, 21(3):114–125, 2009.
    Google ScholarLocate open access versionFindings
  • [6] H. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, 264(5588):746–748, 1976.
    Google ScholarLocate open access versionFindings
  • [7] Y. Chen, Y. Assael, B. Shillingford, D. Budden, S. Reed, H. Zen, Q. Wang, L. C. Cobo, A. Trask, B. Laurie, et al. Sample efficient adaptive text-to-speech. arXiv preprint arXiv:1809.10460, 2018.
    Findings
  • [8] H. Liao, E. McDermott, and A. Senior. Large scale deep neural network acoustic modeling with semi-supervised training data for youtube video transcription. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pages 368–373. IEEE, 2013.
    Google ScholarLocate open access versionFindings
  • [9] B. Shillingford, Y. Assael, M. W. Hoffman, T. Paine, C. Hughes, U. Prabhu, H. Liao, H. Sak, K. Rao, L. Bennett, M. Mulville, M. Denil, B. Coppin, B. Laurie, A. Senior, and N. de Freitas. Large-scale visual speech recognition. In INTERSPEECH, pages 4135–4139, 2019.
    Google ScholarLocate open access versionFindings
  • [10] M. Cooke, J. Barker, S. Cunningham, and X. Shao. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5):2421–2424, 2006.
    Google ScholarLocate open access versionFindings
  • [11] C. Richie, S. Warburton, and M. Carter. Audiovisual database of spoken American English. Linguistic Data Consortium, 2009.
    Google ScholarLocate open access versionFindings
  • [12] N. Harte and E. Gillen. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia, 17(5):603–615, 2015.
    Google ScholarLocate open access versionFindings
  • [13] J. S. Chung and A. Zisserman. Lip reading in the wild. In Asian Conference on Computer Vision, pages 87–103.
    Google ScholarLocate open access versionFindings
  • [14] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG), 36(4):95, 2017.
    Google ScholarLocate open access versionFindings
  • [15] A. Nagrani, J. S. Chung, and A. Zisserman. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612, 2017.
    Findings
  • [16] J. S. Chung, A. Nagrani, and A. Zisserman. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018.
    Findings
  • [17] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giro-i Nieto. Wav2pix: speech-conditioned face generation using generative adversarial networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 3, 2019.
    Google ScholarLocate open access versionFindings
  • [18] J. L. Pech-Pacheco, G. Cristóbal, J. Chamorro-Martinez, and J. Fernández-Valdivia. Diatom autofocusing in brightfield microscopy: a comparative study. In Proceedings 15th International Conference on Pattern Recognition. ICPR-2000, volume 3, pages 314–317. IEEE, 2000.
    Google ScholarLocate open access versionFindings
  • [19] J. Roth, S. Chaudhuri, O. Klejch, R. Marvin, A. Gallagher, L. Kaver, S. Ramaswamy, A. Stopczynski, C. Schmid, Z. Xi, et al. Ava active speaker: An audio-visual dataset for active speaker detection. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4492–4496. IEEE, 2020.
    Google ScholarLocate open access versionFindings
  • [20] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
    Google ScholarLocate open access versionFindings
  • [21] H. Zhao, O. Gallo, I. Frosio, and J. Kautz. Loss functions for image restoration with neural networks. IEEE Transactions on computational imaging, 3(1):47–57, 2016.
    Google ScholarLocate open access versionFindings
  • [22] A. Clark, J. Donahue, and K. Simonyan. Efficient video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019.
    Findings
  • [23] J. H. Lim and J. C. Ye. Geometric gan. arXiv preprint arXiv:1705.02894, 2017.
    Findings
  • [24] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
    Findings
  • [25] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. SkerrvRyan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • [26] Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran. Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning. Proc. Interspeech 2019, pages 2080–2084, 2019.
    Google ScholarLocate open access versionFindings
  • [27] Y. Chen, Y. M. Assael, B. Shillingford, D. Budden, S. E. Reed, H. Zen, Q. Wang, L. C. Cobo, A. Trask, B. Laurie, Ç. Gülçehre, A. van den Oord, O. Vinyals, and N. de Freitas. Sample efficient adaptive text-to-speech. In Proceedings of the International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • [28] A. Simons. Generation of mouthshape for a synthetic talking head. Proc. of the Institute of Acoustics, 1990.
    Google ScholarLocate open access versionFindings
  • [29] E. Yamamoto, S. Nakamura, and K. Shikano. Lip movement synthesis from speech based on hidden Markov models. Speech Communication, 26(1-2):105–115, 1998.
    Google ScholarLocate open access versionFindings
  • [30] J. S. Chung, A. Jamaludin, and A. Zisserman. You said that? arXiv preprint arXiv:1705.02966, 2017.
    Findings
  • [31] L. Chen, Z. Li, R. K Maddox, Z. Duan, and C. Xu. Lip movements generation at a glance. In Proceedings of the European Conference on Computer Vision (ECCV), pages 520–535, 2018.
    Google ScholarLocate open access versionFindings
  • [32] H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9299–9306, 2019.
    Google ScholarLocate open access versionFindings
  • [33] R. Kumar, J. Sotelo, K. Kumar, A. de Brébisson, and Y. Bengio. Obamanet: Photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442, 2017.
    Findings
  • [34] P. KR, R. Mukhopadhyay, J. Philip, A. Jha, V. Namboodiri, and C. Jawahar. Towards automatic faceto-face translation. In Proceedings of the 27th ACM International Conference on Multimedia, pages 1428–1436, 2019.
    Google ScholarLocate open access versionFindings
  • [35] O. Wiles, A. Sophia Koepke, and A. Zisserman. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European Conference on Computer Vision (ECCV), pages 670–686, 2018.
    Google ScholarLocate open access versionFindings
  • [36] H. Zhu, A. Zheng, H. Huang, and R. He. High-resolution talking face generation via mutual information approximation. arXiv preprint arXiv:1812.06589, 2018.
    Findings
  • [37] Y. Song, J. Zhu, X. Wang, and H. Qi. Talking face generation by conditional recurrent adversarial network. arXiv preprint arXiv:1804.04786, 2018.
    Findings
  • [38] A. Jamaludin, J. S. Chung, and A. Zisserman. You said that?: Synthesising talking faces from audio. International Journal of Computer Vision, 127(11-12):1767–1779, 2019.
    Google ScholarLocate open access versionFindings
  • [39] L. Chen, R. K. Maddox, Z. Duan, and C. Xu. Hierarchical cross-modal talking face generationwith dynamic pixel-wise loss. arXiv preprint arXiv:1905.03820, 2019.
    Findings
  • [40] K. Vougioukas, S. Petridis, and M. Pantic. Realistic speech-driven facial animation with gans. arXiv preprint arXiv:1906.06337, 2019.
    Findings
  • [41] R. Yi, Z. Ye, J. Zhang, H. Bao, and Y.-J. Liu. Audio-driven talking face video generation with natural head pose. arXiv preprint arXiv:2002.10137, 2020.
    Findings
  • [42] E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky. Few-shot adversarial learning of realistic neural talking head models. arXiv preprint arXiv:1905.08233, 2019.
    Findings
  • [43] J. Thies, M. Elgharib, A. Tewari, C. Theobalt, and M. Nießner. Neural voice puppetry: Audio-driven facial reenactment. arXiv preprint arXiv:1912.05566, 2019.
    Findings
  • [44] L. Song, W. Wu, C. Qian, R. He, and C. C. Loy. Everybody’s talkin’: Let me talk as you want. arXiv preprint arXiv:2001.05201, 2020.
    Findings
  • [45] O. Fried, A. Tewari, M. Zollhöfer, A. Finkelstein, E. Shechtman, D. B. Goldman, K. Genova, Z. Jin, C. Theobalt, and M. Agrawala. Text-based editing of talking-head video. arXiv preprint arXiv:1906.01524, 2019.
    Findings
  • [46] H. Kim, M. Elgharib, M. Zollhöfer, H.-P. Seidel, T. Beeler, C. Richardt, and C. Theobalt. Neural style-preserving visual dubbing. ACM Transactions on Graphics (TOG), 38(6):1–13, 2019.
    Google ScholarLocate open access versionFindings
  • [47] A. Jha, V. Voleti, V. Namboodiri, and C. Jawahar. Cross-language speech dependent lip-synchronization. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7140–7144. IEEE, 2019.
    Google ScholarLocate open access versionFindings
  • [48] T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG), 36(4):94, 2017.
    Google ScholarLocate open access versionFindings
  • [49] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman. Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, 60:101027, 2020.
    Google ScholarLocate open access versionFindings
  • [50] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pages 6626–6637, 2017.
    Google ScholarLocate open access versionFindings
  • [51] G. Van Rossum and F. L. Drake. Python 3 Reference Manual. CreateSpace, Scotts Valley, CA, 2009. ISBN 1441412697.
    Google ScholarFindings
  • [52] S. van der Walt, S. C. Colbert, and G. Varoquaux. The NumPy Array: A Structure for Efficient Numerical Computation. Computing in Science Engineering, 13(2):22–30, 2011.
    Google ScholarLocate open access versionFindings
  • [53] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pages 265–283, 2016.
    Google ScholarLocate open access versionFindings
  • [54] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.
    Google ScholarLocate open access versionFindings
  • [55] C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C. Chang, M. G. Yong, J. Lee, et al. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019.
    Findings
Your rating :
0

 

Tags
Comments