Streaming End-to-End Speech Recognition for Hybrid RNN-T/Attention Architecture.

Interspeech(2021)

引用 3|浏览16
暂无评分
摘要
We present a novel architecture with its decoding approach for improving recurrent neural network-transducer (RNN-T) performance. RNN-T is promising for building time-synchronous automatic speech recognition (ASR) systems and thus enhancing streaming ASR applications. We note that encoder-decoderbased sequence-to-sequence models (S2S) have been also used successfully by the ASR community. In this paper, we integrate these popular models in the RNN-T+S2S approach; higher recognition performance than either is achieved due to their integration. However, it is generally deemed to be complicated to use S2S in streaming systems, because the attention mechanism can use arbitrarily long past and future contexts during decoding. Our RNN-T+S2S is composed of the shared encoder, an RNN-T decoder and a triggered attention-based decoder which uses time restricted encoder outputs for attention weight computation. By using the trigger points generated from RNN-T outputs, the S2S branch of RNN-T+S2S activates only when the triggers are detected, which makes streaming ASR practical. Experiments on public and private datasets created to research various tasks demonstrate that our proposal can yield superior recognition performance.
更多
查看译文
关键词
speech recognition,end-to-end,recurrent neural network-transducer,attention-based encoder-decoder
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要