StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing
CoRR(2024)
摘要
Given a script, the challenge in Movie Dubbing (Visual Voice Cloning, V2C) is
to generate speech that aligns well with the video in both time and emotion,
based on the tone of a reference audio track. Existing state-of-the-art V2C
models break the phonemes in the script according to the divisions between
video frames, which solves the temporal alignment problem but leads to
incomplete phoneme pronunciation and poor identity stability. To address this
problem, we propose StyleDubber, which switches dubbing learning from the frame
level to phoneme level. It contains three main components: (1) A multimodal
style adaptor operating at the phoneme level to learn pronunciation style from
the reference audio, and generate intermediate representations informed by the
facial emotion presented in the video; (2) An utterance-level style learning
module, which guides both the mel-spectrogram decoding and the refining
processes from the intermediate embeddings to improve the overall style
expression; And (3) a phoneme-guided lip aligner to maintain lip sync.
Extensive experiments on two of the primary benchmarks, V2C and Grid,
demonstrate the favorable performance of the proposed method as compared to the
current state-of-the-art.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要