CONCATENATIVE SPEECH SYNTHESIS FOR EUROPEAN PORTUGUESE

SSW(1998)

引用 33|浏览3
暂无评分
摘要
This paper describes our on-going work in the area of text-to- speech synthesis, specifically on concatenative techniques. Our preliminary work consisted in investigating the current trends in concatenative synthesis and the problems that could arise when we apply the existing state-of-the art solutions to the specific case of European Portuguese. Our ultimate goal is to develop a text-to-speech system that could be trained for any speaker's voice in a fully automatic way, i.e., we would like to develop a customized text-to-speech synthesizer for any voice reading a predetermined text. Our first steps in this direction involved such issues as automatic segmentation and alignment of recorded speech, optimized inventory design for concatenative synthesis, unit selection and optimal coupling of the selected units. 1. INTRODUTION This paper presents our latest progress concerning text-to- speech synthesis in European Portuguese. The joint effort of the two complementary teams (linguists and engineers) involved in this project started in the beginning of this decade with the development of a rule-based formant synthesizer (DIXI) (1). Several versions of this synthesizer were implemented in the following years, namely to cope with the needs of the handicap community in Portugal. In parallel with the development of these special-purpose applications, we have been investing in different synthesis models based on concatenative techniques. This includes not only the development of classic PSOLA diphone-based techniques (12), but also the development of CHATR-like systems (5)(11), where larger units are selected and concatenated based on prosodic criteria. Concatenative text-to-speech systems can, in theory, produce very naturally sounding synthetic speech, since they simply join pre-recorded segments or units to form any sentence. In practice, several factors contribute for less perfect speech output quality. For instance, the choice of the best set of pre-recorded speech units that can be used as building blocks is a difficult task. Moreover, the concatenation of units recorded using different intonation or phonetic contexts may produce sub- optimal results even if the set is reasonably complete and if some prosodic transformations are performed during the concatenation phase. Time domain discontinuities and spectral mismatch may also arise and need to be dealt with in the concatenation process. We have tried to address these problems in the context of the development of a customized text-to-speech synthesizer, i.e., a system that could be trained in a fully automatic way for any user's voice reading a predetermined text. The fully automatic restriction implies that some tradeoffs must be accepted namely in what concerns the construction of an inventory of acoustic units and the determination of the optimal coupling of inventory
更多
查看译文
关键词
text to speech,speech synthesis,rule based,time domain
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要