Improving phonetic realizations in tts by using phoneme-aligned graphemes

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)(2022)

引用 0|浏览7
暂无评分
摘要
Most text-to-speech acoustic models, such as WaveNet, Tacotron, ClariNet, etc., use either a phoneme sequence or a letter sequence as the fundamental unit of speech. Although the letter (or grapheme) sequence closely matches the actual runtime input of the TTS system, it often fails to represent the fine-grained phonetic variations. A purely phonemic input seems to perform better in practice, though is heavily dependent on a meticulously crafted phonology and lexicon. This reliance poses issues (with quality and consistency) which can lead to the need for a trade-off between quality and scalability. To overcome this, we propose using a mix of the two inputs, namely providing phoneme-aligned graphemes to the model. In this paper, we show that this approach can help the model learn to disambiguate some of the more subtle phonemic variations (such as the realization of reduced vowels), and that this effect improves the fidelity to the accent of the original voice talent. For evaluation, we present a way of generating an unbiased targeted test using phoneme spectral diffs, and using that, show improvement over the baseline approach for multiple voice technologies and multiple locales.
更多
查看译文
关键词
Graphemes,Phonology,Schwa,Vowels,Accent
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要