Multistate Encoding With End-To-End Speech Rnn Transducer Network

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING(2020)

Cited 9|Views206
No score
Abstract
Recurrent Neural Network Transducer (RNN-T) models [1] for automatic speech recognition (ASR) provide high accuracy speech recognition. Such end-to-end (E2E) models combine acoustic, pronunciation and language models (AM, PM, LM) of a conventional ASR system into a single neural network, dramatically reducing complexity and model size.In this paper, we propose a technique for incorporating contextual signals, such as intelligent assistant device state or dialog state, directly into RNN-T models. We explore different encoding methods and demonstrate that RNN-T models can effectively utilize such context. Our technique results in reduction in Word Error Rate (WER) of up to 10.4% relative on a variety of contextual recognition tasks. We also demonstrate that proper regularization can be used to model context independently for improved overall quality.
More
Translated text
Key words
E2E ASR, contextual ASR, sequence to sequence
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined