Multi-Speaker Data Augmentation for Improved end-to-end Automatic Speech Recognition

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2023)

引用 0|浏览2
暂无评分
摘要
Publicly available datasets traditionally used to train E2E ASR models for conversational telephone speech recognition are based on clean, short duration, single speaker utterances collected on separate channels. While E2E ASR models achieve state-of-the-art performance on recognition tasks that match well with such training data, they are observed to fail on test recordings that contain multiple speakers, significant channel or background noise or span longer durations than training data utterances. To mitigate these issues, we propose an on-the-fly data augmentation strategy that transforms single speaker training data into multiple speaker data by appending together multiple single speaker utterances. The proposed technique encourages the E2E model to become robust to speaker changes and also process longer utterances effectively. During training, the model is also guided by a teacher model trained on single speaker utterances to map its multi-speaker encoder embeddings to better performing single speaker representations. With the proposed technique we obtain 7-14% relative improvement on various single speaker and multiple speaker test sets. We also show how this technique is able to improve recognition performance by up to 14% by capturing useful information from preceding spoken utterances used as dialog history.
更多
查看译文
关键词
Automatic speech recognition,end-to-end,multi-speaker,dialog history
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要