ScoutWav: Two-Step Fine-Tuning on Self-Supervised Automatic Speech Recognition for Low-Resource Environments

Conference of the International Speech Communication Association (INTERSPEECH)(2022)

引用 1|浏览8
暂无评分
摘要
Recent improvements in Automatic Speech Recognition (ASR) systems obtain extraordinary results. However, there are specific domains where training data can be either limited or not representative enough, which are known as Low-Resource Environments (LRE). In this paper, we present ScoutWav, a network that integrates context-based word boundaries with self-supervised learning, wav2vec 2.0, to present a low-resource ASR model. First, we pre-train a model on High-Resource Environment (HRE) datasets and then fine-tune with the LRE datasets to obtain context-based word boundaries. The resulting word boundaries are used for fine-tuning with a pre-trained and iteratively refined wav2vec 2.0 to learn appropriate representations for the downstream ASR task. Our refinement strategy for wav2vec 2.0 comes determined by using canonical correlation analysis (CCA) to detect which layers need updating. This dynamic refinement allows wav2vec 2.0 to learn more descriptive LRE-based representations. Finally, the learned representations in the two-step fine-tuned wav2vec 2.0 framework are fed back to the Scout Network for the downstream task. We carried out experiments with two different LRE datasets: I-CUBE and UASpeech. Our experiments demonstrate that using the target domain word boundary after pre-training and automatic layer analysis, ScoutWav shows up to 12% relative WER reduction on the LR data.
更多
查看译文
关键词
Automatic Speech Recognition, Self-Supervised Learning, Fine-tuning, Low-resource Environment
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要