"Sheldon Speaking, Bonjour !" - Leveraging Multilingual Tracks For (Weakly) Supervised Speaker Identification

MM '14: 2014 ACM Multimedia Conference Orlando Florida USA November, 2014(2014)

引用 7|浏览16
暂无评分
摘要
We address the problem of speaker identification in multimedia data, and TV series in particular. While speaker identification is traditionally a supervised machine-learning task, our first contribution is to significantly reduce the need for costly preliminary manual annotations through the use of automatically aligned (and potentially noisy) fan-generated transcripts and subtitles. We show that both speech activity detection and speech turn identification modules trained in this weakly supervised manner achieve similar performance as their fully supervised counterparts (i.e. relying on fine manual speech/non-speech/speaker annotation). Our second contribution relates to the use of multilingual audio tracks usually available with this kind of content to significantly improve the overall speaker identification performance. Reproducible experiments (including dataset, manual annotations and source code) performed on the first six episodes of The Big Bang Theory TV series show that combining the French audio track (containing dubbed actor voices) with the English one (with the original actor voices) improves the overall English speaker identification performance by 5% absolute and up to 70% relative on the five main characters.
更多
查看译文
关键词
speech activity detection,speaker identification,multimedia data,weak supervision,multilingual fusion
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要