Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis.

2018 IEEE Spoken Language Technology Workshop (SLT)(2018)

引用 73|浏览48
暂无评分
摘要
Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expressive factors of variation in speaking style. In this work, we introduce the Text-Predicting Global Style Token (TP-GST) architecture, which treats GST combination weights or style embeddings as “virtual” speaking style labels within Tacotron. TP-GST learns to predict stylistic renderings from text alone, requiring neither explicit labels during training, nor auxiliary inputs for inference. We show that, when trained on an expressive speech dataset, our system can render text with more pitch and energy variation than two state-of-the-art baseline models. We further demonstrate that TP-GSTs can synthesize speech with background noise removed, and corroborate these analyses with positive results on human-rated listener preference audiobook tasks. Finally, we demonstrate that multi-speaker TP-GST models successfully factorize speaker identity and speaking style. We provide a website with audio samples 1 for each of our findings.
更多
查看译文
关键词
Predictive models,Training,Speech synthesis,Spectrogram,Rendering (computer graphics),Computational modeling,Data models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要