TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models
arxiv(2023)
摘要
Recently, there has been a growing interest in the field of controllable
Text-to-Speech (TTS). While previous studies have relied on users providing
specific style factor values based on acoustic knowledge or selecting reference
speeches that meet certain requirements, generating speech solely from natural
text prompts has emerged as a new challenge for researchers. This challenge
arises due to the scarcity of high-quality speech datasets with natural text
style prompt and the absence of advanced text-controllable TTS models. In light
of this, 1) we propose TextrolSpeech, which is the first large-scale speech
emotion dataset annotated with rich text attributes. The dataset comprises
236,220 pairs of style prompt in natural text descriptions with five style
factors and corresponding speech samples. Through iterative experimentation, we
introduce a multi-stage prompt programming approach that effectively utilizes
the GPT model for generating natural style descriptions in large volumes. 2)
Furthermore, to address the need for generating audio with greater style
diversity, we propose an efficient architecture called Salle. This architecture
treats text controllable TTS as a language model task, utilizing audio codec
codes as an intermediate representation to replace the conventional
mel-spectrogram. Finally, we successfully demonstrate the ability of the
proposed model by showing a comparable performance in the controllable TTS
task. Audio samples are available at https://sall-e.github.io/
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要