The DKU Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)(2022)

引用 6|浏览15
暂无评分
摘要
This paper describes the system developed by the DKU team for the MISP Challenge 2021. We present a two-stage approach consisting of end-to-end neural networks for the audio-visual wake word spotting task. We first process audio and video data to give them a similar structure and then train two unimodal models with unified network architecture separately. Second, we propose a Hierarchical Modality Aggregation (HMA) module that fuses multi-scale audio-visual information from pre-trained unimodal models. Our system has a clear and concise framework consisting of end-to-end neural networks. With this framework and extensive data augmentation methods, our presented system achieves a false reject rate of 3.85% and a false alarm rate of 3.42% on far-field audio in the development set of the competition database, which ranks 2nd in the wake word spotting track of the MISP challenge.
更多
查看译文
关键词
MISP Challenge,Audio-visual Wake Word Spotting,Deep Neural Network,Multimodal Fusion
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要