E2e-Sincnet: Toward Fully End-To-End Speech Recognition

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING(2020)

引用 33|浏览35
暂无评分
摘要
Modern end-to-end (E2E) Automatic Speech Recognition (ASR) systems rely on Deep Neural Networks (DNN) that are mostly trained on handcrafted and pre-computed acoustic features such as Mel-filter-banks or Mel-frequency cepstral coefficients. Nonetheless, and despite worse performances, E2E ASR models processing raw waveforms are an active research field due to the lossless nature of the input signal. In this paper, we propose the E2E-SincNet, a novel fully E2E ASR model that goes from the raw waveform to the text transcripts by merging two recent and powerful paradigms: SincNet and the joint CTC-attention training scheme. The conducted experiments on two different speech recognition tasks show that our approach outperforms previously investigated E2E systems relying either on the raw waveform or pre-computed acoustic features, with a reported top-of-the-line Word Error Rate (WER) of 4:7% on the Wall Street Journal (WSJ) dataset.
更多
查看译文
关键词
End-to-end speech recognition, SincNet
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要