Audio-Visual Speech Enhancement in Noisy Environments via Emotion-Based Contextual Cues
CoRR(2024)
摘要
In real-world environments, background noise significantly degrades the
intelligibility and clarity of human speech. Audio-visual speech enhancement
(AVSE) attempts to restore speech quality, but existing methods often fall
short, particularly in dynamic noise conditions. This study investigates the
inclusion of emotion as a novel contextual cue within AVSE, hypothesizing that
incorporating emotional understanding can improve speech enhancement
performance. We propose a novel emotion-aware AVSE system that leverages both
auditory and visual information. It extracts emotional features from the facial
landmarks of the speaker and fuses them with corresponding audio and visual
modalities. This enriched data serves as input to a deep UNet-based
encoder-decoder network, specifically designed to orchestrate the fusion of
multimodal information enhanced with emotion. The network iteratively refines
the enhanced speech representation through an encoder-decoder architecture,
guided by perceptually-inspired loss functions for joint learning and
optimization. We train and evaluate the model on the CMU Multimodal Opinion
Sentiment and Emotion Intensity (CMU-MOSEI) dataset, a rich repository of
audio-visual recordings with annotated emotions. Our comprehensive evaluation
demonstrates the effectiveness of emotion as a contextual cue for AVSE. By
integrating emotional features, the proposed system achieves significant
improvements in both objective and subjective assessments of speech quality and
intelligibility, especially in challenging noise environments. Compared to
baseline AVSE and audio-only speech enhancement systems, our approach exhibits
a noticeable increase in PESQ and STOI, indicating higher perceptual quality
and intelligibility. Large-scale listening tests corroborate these findings,
suggesting improved human understanding of enhanced speech.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要