Monotonic Recurrent Neural Network Transducer And Decoding Strategies

2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019)(2019)

引用 28|浏览74
暂无评分
摘要
Recurrent Neural Network Transducer (RNNT) is an endto-end model which transduces discrete input sequences to output sequences by learning alignments between the sequences. In speech recognition tasks we generally have a strictly monotonic alignment between time frames and label sequence. However, the standard RNNT loss does not enforce this constraint. This can cause some anomalies in alignments such as the model outputting a sequence of labels at a single time frame. There is also no bound on the decoding time steps. To address these problems, we introduce a monotonic version of the RNNT loss. Under the assumption that the output sequence is not longer than the input sequence, this loss can be used with forward-backward algorithm to learn strictly monotonic alignments between the sequences. We present experimental studies showing that speech recognition accuracy for monotonic RNNT is equivalent to standard RNNT. We also explore best-first and breadth-first decoding strategies for both monotonic and standard RNNT models. Our experiments show that breadth-first search is effective in exploring and combining alternative alignments. Additionally, it also allows batching of hypotheses during search label expansion, allowing better resource utilization, and resulting in decoding speedup.
更多
查看译文
关键词
output sequence,monotonic alignment,speech recognition accuracy,monotonic RNNT,breadth-first decoding strategies,search label expansion,decoding speedup,monotonic recurrent neural network transducer,end-to-end model,discrete input sequences,speech recognition tasks,standard RNNT loss,single time frame,decoding time steps
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要