Electrolaryngeal Speech Intelligibility Enhancement Through Robust Linguistic Encoders
arXiv (Cornell University)(2023)
摘要
We propose a novel framework for electrolaryngeal speech intelligibility
enhancement through the use of robust linguistic encoders. Pretraining and
fine-tuning approaches have proven to work well in this task, but in most
cases, various mismatches, such as the speech type mismatch (electrolaryngeal
vs. typical) or a speaker mismatch between the datasets used in each stage, can
deteriorate the conversion performance of this framework. To resolve this
issue, we propose a linguistic encoder robust enough to project both EL and
typical speech in the same latent space, while still being able to extract
accurate linguistic information, creating a unified representation to reduce
the speech type mismatch. Furthermore, we introduce HuBERT output features to
the proposed framework for reducing the speaker mismatch, making it possible to
effectively use a large-scale parallel dataset during pretraining. We show that
compared to the conventional framework using mel-spectrogram input and output
features, using the proposed framework enables the model to synthesize more
intelligible and naturally sounding speech, as shown by a significant 16
improvement in character error rate and 0.83 improvement in naturalness score.
更多查看译文
关键词
electrolaryngeal speech intelligibility enhancement
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要