VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis
arxiv(2024)
摘要
Achieving nuanced and accurate emulation of human voice has been a
longstanding goal in artificial intelligence. Although significant progress has
been made in recent years, the mainstream of speech synthesis models still
relies on supervised speaker modeling and explicit reference utterances.
However, there are many aspects of human voice, such as emotion, intonation,
and speaking style, for which it is hard to obtain accurate labels. In this
paper, we propose VoxGenesis, a novel unsupervised speech synthesis framework
that can discover a latent speaker manifold and meaningful voice editing
directions without supervision. VoxGenesis is conceptually simple. Instead of
mapping speech features to waveforms deterministically, VoxGenesis transforms a
Gaussian distribution into speech distributions conditioned and aligned by
semantic tokens. This forces the model to learn a speaker distribution
disentangled from the semantic content. During the inference, sampling from the
Gaussian distribution enables the creation of novel speakers with distinct
characteristics. More importantly, the exploration of latent space uncovers
human-interpretable directions associated with specific speaker characteristics
such as gender attributes, pitch, tone, and emotion, allowing for voice editing
by manipulating the latent codes along these identified directions. We conduct
extensive experiments to evaluate the proposed VoxGenesis using both subjective
and objective metrics, finding that it produces significantly more diverse and
realistic speakers with distinct characteristics than the previous approaches.
We also show that latent space manipulation produces consistent and
human-identifiable effects that are not detrimental to the speech quality,
which was not possible with previous approaches. Audio samples of VoxGenesis
can be found at: .
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要