End-To-End Multi-Talker Overlapping Speech Recognition

ICASSP（2020）

引用 33|浏览66

暂无评分

摘要

In this paper we present an end-to-end speech recognition system that can recognize single-channel speech where multiple talkers can speak at the same time (overlapping speech) by using a neural network model based on Recurrent Neural Network Transducer (RNNT) architecture. We augment the conventional RNN-T architecture by including a masking model for separation of encoded audio features, and multiple label encoders to encode transcripts from different speakers. We use a masking L2 loss to prevent transcripts to align to wrong speakers' audio, and a speaker embedding loss to facilitate speaker tracking. We show that by using these additional training objectives, the proposed augmented RNN-T model can be trained with simulated overlapping speech data and can achieve a WER of 32% on words in overlapping speech segments from real-life telephone conversations. Our analysis of manual transcription task on the same test set shows that transcribing overlapping speech is hard even for humans who can get a WER of 37% compared to ground-truth.

查看译文

关键词

multi-talker, multi-speaker, overlapping speech, end-to-end

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要