Joint Endpointing And Decoding With End-To-End Models

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)(2019)

引用 39|浏览159
暂无评分
摘要
The tradeoff between word error rate (WWER)and latency is very aimportant for streaming automatic speech recognition (ASR) applications. We want the system to endpoint and close the microphone as quickly as possible, without degrading WER. Conventional ASR systems rely on a separately trained endpointing module, which interacts with the acoustic, pronunciation and language model (AM, PM, and LM) components, and can result in a higher W I iR or a larger latency. In going with the all -neural spirit of end -to -end (11,211) models, which fold the AM, PM and LM into a single neural network, in this work we look at folding the endpointer into this E2E model to assist with the endpointing task. We refer to this jointly optimized model which performs both recognition and endpointing as an E2E enpointer. On a large vocabulary Voice Search task, we show that the combination of such an Ii2E endpoiner with a conventional endpointer results in no quality degradation, while reducing latency by more than a factor of 2 compared to using a separate endpointer with the E2E, model.
更多
查看译文
关键词
joint endpointing,word error rate,automatic speech recognition applications,language model,all-neural spirit,E2E,single neural network,endpointing task,jointly optimized model,conventional endpointer results,ASR systems,decoding,vocabulary voice search task,endpointing module
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要