Joint Speech-Text Embeddings with Disentangled Speaker Features

Michael Gian Gonzales,Peter Corcoran,Naomi Harte,Michael Schukat

2023 34TH IRISH SIGNALS AND SYSTEMS CONFERENCE, ISSC（2023）

引用 0|浏览3

暂无评分

摘要

This paper presents a novel model architecture for speech processing that takes advantage of a joint speech-text embedding space and disentangled speaker features. Here unsupervised representation learning extracts latent features from the input without labels which results in task-agnostic, but information-entangled embeddings. On the other hand, a unified embedding space of speech and text aims to leverage acoustics and semantic knowledge from the two modalities, respectively. The model was trained on 4 speakers from the CMU Arctic dataset and evaluated on three downstream tasks: speaker recognition, automatic speech recognition (ASR), and text-to-speech (TTS). Results show 96.87% speaker classification accuracy, 11.57% Word Error Rate, and 8.91 Mel Cepstral Distortion mean on the evaluation set.

查看译文

关键词

speech-text,speech recognition,speech synthesis

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要