End-to-End Speech Recognition Technology Based on Multi-Stream CNN.
TrustCom(2022)
摘要
At a time when end-to-end speech recognition technology is becoming more and more popular, we conduct research on various end-to-end speech technologies, and use the Transformer-based speech framework to study and find that its multi-head attention is not effective in local feature acquisition. And in the face of noise problems in real scenes, the training convergence speed is too slow. In order to solve the problems caused by Transformer, a new speech recognition framework based on MCNN-Transformer-CTC speech recognition method is proposed. Through MCNN (multi-stream convolutional neural network) in the pre-acoustic unit through multiple parallel channels Local feature extraction is carried out in terms of time width and spectral capability, which makes up for the lack of selfattention mechanism in local feature extraction, and the multitask learning method is used to add CTC structure to make up for the problem of slow training convergence. The training effect of this model on the Aishell1 dataset has reached a CER of 6.23%, which is a further improvement compared to the Transformer model.
更多查看译文
关键词
Speech Recognition, MCNN, Transformer, CTC
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要