Internal Language Model Personalization of E2E Automatic Speech Recognition Using Random Encoder Features

2022 IEEE Spoken Language Technology Workshop (SLT)(2023)

引用 1|浏览29
暂无评分
摘要
End-to-end (E2E) speech-to-text models generally require transcribed audio for training and personalization. We introduce the use of random audio encoder features, rather than speech, to fine-tune the final model layers and acquire new vocabulary from text-only data. This technique can be used for on-device personalization before the user has provided any speech data. We show improvements in the recall of new vocabulary and word error rate (WER) on held-out test sets using simulated user experiments on hybrid autoregressive transducer (HAT) models using conformer-based encoders and simple text embeddings for label processing. We compare this approach to the use of synthetic audio, finding random encoder features to be more beneficial with lower computational cost. Experiments show that the maximum benefit is gained by updating specific network components comprising a subset of those expressing the internal language model.
更多
查看译文
关键词
On-device personalization,end-to-end,automatic speech recognition,fine-tuning,rare vocabulary,data augmentation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要