DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-encoder
ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2023)
摘要
Generating high-quality and person-generic visual dubbing remains a
challenge. Recent innovation has seen the advent of a two-stage paradigm,
decoupling the rendering and lip synchronization process facilitated by
intermediate representation as a conduit. Still, previous methodologies rely on
rough landmarks or are confined to a single speaker, thus limiting their
performance. In this paper, we propose DiffDub: Diffusion-based dubbing. We
first craft the Diffusion auto-encoder by an inpainting renderer incorporating
a mask to delineate editable zones and unaltered regions. This allows for
seamless filling of the lower-face region while preserving the remaining parts.
Throughout our experiments, we encountered several challenges. Primarily, the
semantic encoder lacks robustness, constricting its ability to capture
high-level features. Besides, the modeling ignored facial positioning, causing
mouth or nose jitters across frames. To tackle these issues, we employ
versatile strategies, including data augmentation and supplementary eye
guidance. Moreover, we encapsulated a conformer-based reference encoder and
motion generator fortified by a cross-attention mechanism. This enables our
model to learn person-specific textures with varying references and reduces
reliance on paired audio-visual data. Our rigorous experiments comprehensively
highlight that our ground-breaking approach outpaces existing methods with
considerable margins and delivers seamless, intelligible videos in
person-generic and multilingual scenarios.
更多查看译文
关键词
auto-encoder auto-encoder,person-generic
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要