ImprovedWord-level Lipreading with Temporal Shrinkage Network and NetVLAD

Multimodal Interfaces and Machine Learning for Multimodal Interaction(2022)

引用 1|浏览4
暂无评分
摘要
In most word-level lipreading architectures of recent years, temporal feature extraction module tend to employ Multi-scale Temporal Convolution Network (MS-TCN). In our experiments, we have noticed it is hard for MS-TCN to deal with noise information that may contain in image sequences. In order to solve the problems, we propose a lipreading architecture based on temporal shrinkage network and NetVLAD. We frst propose Temporal Shrinkage Unit according to Residual Shrinkage Network and then replace temporal convolution unit with it. The improved network which named Multi-scale Temporal Shrinkage Network (MS-TSN) could focus more on relevant information. Following with MS-TSN that deals with noise frames, NetVLAD is proposed to integrate local information into global feature. Compared with Global Average Pooling, NetVLAD could extract key features by clustering. Our experiments on Lipreading in the Wild (LRW) show that the architecture we propose achieves an accuracy of 89.41%, attaining new state-of-the-art in word-level lipreading. In addition, we build a new Mandarin Chinese lipreading dataset named MCLR-100 and verify our proposed architecture on it.
更多
查看译文
关键词
Word-level lipreading, Temporal Shrinkage Network, NetVLAD
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要