Joint Speech-Text Embeddings with Disentangled Speaker Features

2023 34TH IRISH SIGNALS AND SYSTEMS CONFERENCE, ISSC(2023)

引用 0|浏览3
暂无评分
摘要
This paper presents a novel model architecture for speech processing that takes advantage of a joint speech-text embedding space and disentangled speaker features. Here unsupervised representation learning extracts latent features from the input without labels which results in task-agnostic, but information-entangled embeddings. On the other hand, a unified embedding space of speech and text aims to leverage acoustics and semantic knowledge from the two modalities, respectively. The model was trained on 4 speakers from the CMU Arctic dataset and evaluated on three downstream tasks: speaker recognition, automatic speech recognition (ASR), and text-to-speech (TTS). Results show 96.87% speaker classification accuracy, 11.57% Word Error Rate, and 8.91 Mel Cepstral Distortion mean on the evaluation set.
更多
查看译文
关键词
speech-text,speech recognition,speech synthesis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要