The Sjtu System For Multimodal Information Based Speech Processing Challenge 2021.

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)(2022)

引用 6|浏览38
暂无评分
摘要
This paper describes the SJTU system for ICASSP Multi-modal Information based Speech Processing Challenge (MISP) 2021. To solve the speech recognition problem in real complex environments where time-synchronized near- and far-field signals are available for training an enhancement frontend. We build a joint system with speech enhancement frontend and speech recognition backend. These two modules are optimized jointly by both ASR and enhancement criteria. Audio-visual fusion is explored to further boost the ASR performance. ROVER and test time augmentation techniques are used to combine recognition results from multiple systems. The final system achieves Chinese character error rates (CCER) of 34.9% on dev set and 34.0% on test set, which achieved third place in the MISP challenge. The absolute CCER reduction compared with the official baseline system is 26.9% on dev set and 28.7% on test set.
更多
查看译文
关键词
multi-modality,speech recognition,end-to-end
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要