Contextual Learning for Missing Speech Automatic Speech Recognition.

Yeona Hong, Miseul Kim, Woo-Jin Chung,Hong-Goo Kang

International Conference on Electronics, Information and Communications(2024)

引用 0|浏览0
暂无评分
摘要
In this paper, we present an automatic speech recognition (ASR) system that is capable of decoding complete transcriptions from speech even in cases where there are missing segments in the audio. To predict complete transcriptions from speech that may have missing segments, we utilize a contextual learning approach inspired by recent language model training approaches, in which our model leverages surrounding speech segments as cues for the prediction. Our model consists of two modules: a contextual feature extractor designed with the structure of wav2vec 2.0, and a projection layer. We further explore various masking lengths for model training so as to optimally benefit the ASR system without compromising its performance. Our proposed methodology demonstrates high-quality ASR performance on missing speech segments of various lengths, ranging from a word error rate (WER) of 4.7% on 0.25 seconds segments to 18.5% on 1 second segments.
更多
查看译文
关键词
wav2vec 2.0,Automatic Speech Recognition,Language Model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要