Deep Audio-visual System for Closed-set Word-level Speech Recognition.

ICMI(2019)

引用 1|浏览10
暂无评分
摘要
Audio-visual understanding is usually challenged by the complementary gap between audio and visual informative bridging. Motivated by the recent audio-visual studies, a closed-set word-level speech recognition scheme is proposed for the Mandarin Audio-Visual Speech Recognition (MAVSR) Challenge in this study. To achieve respective audio and visual encoder initialization more effectively, a 3-dimensional convolutional neural network (CNN) and an attention-based bi-directional long short-term memory (Bi-LSTM) network are trained. With two fully connected layers in addition to the concatenated encoder outputs for the audio-visual joint training, the proposed scheme won the first place with a relative word accuracy improvement of 7.9% over the solitary audio system. Experiments on LRW-1000 dataset have substantially demonstrated that the proposed joint training scheme by audio-visual incorporation is capable of enhancing the recognition performance of relatively short duration samples, unveiling the multi-modal complementarity.
更多
查看译文
关键词
Audio-visual, convolutional neural network, long short-term memory, multi-model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要