Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
ASRU, pp. 905-912, 2019.
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training co...More
PPT (Upload PPT)