CAMNet: A controllable acoustic model for efficient, expressive, high-quality text-to-speech

Applied Acoustics(2022)

引用 6|浏览5
暂无评分
摘要
Spoken language is becoming one of the key components of human–machine interaction, both to send information to the machine – e.g. voice control – and to receive from it – e.g. virtual assistants. In this scenario, text-to-speech (TTS) models have become an essential artificial intelligence capacity. Even though this interaction can be based on neutral style speech, generating speech with different styles, pitches and speaking rates may improve user experience. With this in view, this paper presents CAMNet, a controllable acoustic model for efficient, expressive, high-quality TTS. CAMNet is based on deep convolutional TTS (DCTTS), a state-of-art acoustic model which is efficient and produces neutral speech. DCTTS was first adapted to generate Bark cepstrum acoustic features in order to integrate well with the LPCNet (linear prediction coefficient) neural vocoder and to remove the reduction factor which demanded the presence of an upsampling network before the vocoder – i.e. the CAMNet output can be directly fed into LPCNet. Next, style transfer functionality was added by means of a novel characterisation of the prosodic information from the Bark cepstrum acoustic features and a new approach to inject this information into the convolutional layers. Finally, controllability is provided via a variational auto-encoder module which creates a smoothed disentangled latent space which allows interpolation and extrapolation of reference styles as well as independent and simultaneous control of two generative factors: pitch and speaking rate. Moreover, this controllability is implemented using a simple offset-based approach. To sum up, CAMNet is an efficient acoustic model which provides a simple but consistent controllability on coarse-grained expression, pitch and speaking rate while still providing high-quality synthesised speech.
更多
查看译文
关键词
Text-to-speech,Expressive TTS,Acoustic model,VAE,Disentanglement,Speech synthesis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要