Multimodal pre-train then transfer learning approach for speaker recognition

Summaira Jabeen, Muhammad Shoib Amin, Xi Li

Multimedia Tools and Applications(2024)

引用 0|浏览2
暂无评分
摘要
Cognitive science has well-established the correlation between faces and voices because neuro-cognitive pathways of both information share the same structure. Recently, the task has come to the attention of the computer vision community with the introduction of large-scale face-voice data. To this end, our work aims to leverage the structure of faces and voices along with the availability of large-scale face-voice information to improve speaker recognition tasks including identification and verification. To achieve this task, we propose novel multimodal systems to leverage the structure of face and voice, one with weight sharing and another without weight sharing, to learn joint representations of multiple modalities establishing the Face-voice association. Afterwards, features are extracted from the trained multimodal networks capturing face-voice association to perform speaker recognition tasks. We evaluated our proposed multimodal networks for speaker recognition along with Face-voice association tasks on challenging benchmark datasets including VoxCeleb1 and MAV-Celeb. Our results show that adding facial information improved speaker recognition tasks’ performance.
更多
查看译文
关键词
Multimodal,Face-voice association,Speaker recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要