Unsupervised Audiovisual Synthesis via Exemplar Autoencoders
ICLR(2021)
摘要
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers. Our approach builds on simple autoencoders that project out-of-sample data onto the distribution of the training set. We use exemplar autoencoders to learn the voice, stylistic prosody (emotions and ambiance), and visual appearance of a specific target exemplar speech. In contrast to existing methods, the proposed approach can be easily extended to an arbitrarily large number of speakers and styles using only 3 minutes of target audio-video data, without requiring any training data for the input speaker. To the best of our knowledge, we are the first work to demonstrate audiovisual synthesis from an audio signal. To do so, we learn audiovisual bottleneck representations that capture the structured linguistic content of speech. We outperform prior approaches on both audio and video synthesis, and present extensive qualitative analysis in supplementary material.
更多查看译文
关键词
unsupervised audiovisual synthesis,autoencoders,exemplar
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络