Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization
arxiv(2024)
摘要
Lip reading, the process of interpreting silent speech from visual lip
movements, has gained rising attention for its wide range of realistic
applications. Deep learning approaches greatly improve current lip reading
systems. However, lip reading in cross-speaker scenarios where the speaker
identity changes, poses a challenging problem due to inter-speaker variability.
A well-trained lip reading system may perform poorly when handling a brand new
speaker. To learn a speaker-robust lip reading model, a key insight is to
reduce visual variations across speakers, avoiding the model overfitting to
specific speakers. In this work, in view of both input visual clues and latent
representations based on a hybrid CTC/attention architecture, we propose to
exploit the lip landmark-guided fine-grained visual clues instead of
frequently-used mouth-cropped images as input features, diminishing
speaker-specific appearance characteristics. Furthermore, a max-min mutual
information regularization approach is proposed to capture speaker-insensitive
latent representations. Experimental evaluations on public lip reading datasets
demonstrate the effectiveness of the proposed approach under the intra-speaker
and inter-speaker conditions.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要