Multimodal Learning of Audio-Visual Speech Recognition with Liquid State Machine.

ICONIP (6)(2022)

引用 2|浏览2
暂无评分
摘要
Audio-visual speech recognition is to solve the multimodal lip-reading task using audio and visual information, which is an important way to improve the performance of speech recognition in noisy conditions. Deep learning methods have achieved promising results in this regard. However, these methods have complex network architecture and are computationally intensive. Recently, Spiking Neural Networks (SNNs) have attracted attention due to their being event-driven and can enable low-power computing. SNNs can capture richer motion information and have been successful in work such as gesture recognition. But it has not been widely used in lipreading tasks. Liquid State Machines (LSMs) have been recognized in SNNs due to their low training costs and are well suited for spatiotemporal sequence problems of event streams. Multimodal lipreading based on Dynamic Vision Sensors (DVS) is also such a problem. Hence, we propose a soft fusion framework with LSM. The framework fuses visual and audio information to achieve the effect of higher reliability lip recognition. On the well-known public LRW dataset, our fusion network achieves a recognition accuracy of 86.8%. Compared with single modality recognition, the accuracy of the fusion method is improved by 5% to 6%. In addition, we add extra noise to the raw data, and the experimental results show that the fusion model outperforms the audio-only model significantly, proving the robustness of our model.
更多
查看译文
关键词
Liquid State Machine, Multimodal Fusion, Audio-visual Speech Recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要