Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy
arxiv(2024)
摘要
Audio-visual target speech extraction (AV-TSE) is one of the enabling
technologies in robotics and many audio-visual applications. One of the
challenges of AV-TSE is how to effectively utilize audio-visual synchronization
information in the process. AV-HuBERT can be a useful pre-trained model for
lip-reading, which has not been adopted by AV-TSE. In this paper, we would like
to explore the way to integrate a pre-trained AV-HuBERT into our AV-TSE system.
We have good reasons to expect an improved performance. To benefit from the
inter and intra-modality correlations, we also propose a novel Mask-And-Recover
(MAR) strategy for self-supervised learning. The experimental results on the
VoxCeleb2 dataset show that our proposed model outperforms the baselines both
in terms of subjective and objective metrics, suggesting that the pre-trained
AV-HuBERT model provides more informative visual cues for target speech
extraction. Furthermore, through a comparative study, we confirm that the
proposed Mask-And-Recover strategy is significantly effective.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要