A Visual-Pilot Deep Fusion For Target Speech Separation In Multi-Talker Noisy Environment

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING(2020)

引用 2|浏览8
暂无评分
摘要
Separating the target speech in multi-talker noisy environment is a challenging problem for audio-only source separation algorithms. The major problem behind is that the separated speech from the same talker can switch among the outputs across consecutive segments, causing the talker permutation issue. In this paper, we deploy face tracking and propose the low-dimension hand-crafted visual features and the low-cost deep fusion architectures to separate the unseen but visible target sources in multi-talker noisy environment. It is shown that our approach is not only capable of addressing the talker permutation issue but also producing additional separation improvement in challenging mixtures such as the same-gender overlapping ones on the public dataset. We also show that the significant improvement of the target speech recognition is achieved on the simulated real-world dataset. Our training is independent of the number of visible sources providing flexibility in deployment.
更多
查看译文
关键词
audio-visual speech separation, target speech separation, facial tracking, mask estimation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要