Are You Speaking: Real-Time Speech Activity Detection via Landmark Pooling Network

2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019)(2019)

引用 3|浏览38
暂无评分
摘要
In this paper, we propose a novel visual information based framework to solve the real-time speech activity detection problem. Unlike conventional methods which commonly use the audio signal as input, our approach incorporates facial information into a deep neural network for feature learning. Instead of using the whole input image, we further develop a novel end-to-end landmark pooling network to act as an attention-guide scheme to help the deep neural network only focus the related portion of the input image. This helps the network to precisely and efficiently learn highly discriminative features for speech activities. What's more, we implement a recurrent neural network with the gated recurrent unit scheme to make use of the sequential information from video to produce the final decision. To give a comprehensive evaluation of the proposed method, we collect a large-scale dataset from unconstrained speech activities, which consists of a large number of speech/non-speech video sequences under various kinds of degradation. Experimental results demonstrate the superiority of our proposed pipeline over previous approach in terms of performance and efficiency.
更多
查看译文
关键词
audio signal,facial information,deep neural network,feature learning,attention-guide scheme,recurrent neural network,speech activities,visual information,speech activity detection,end-to-end landmark pooling network,landmark pooling network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要