Dive: End-to-End Speech Diarization Via Iterative Speaker Embedding

2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)(2021)

引用 5|浏览83
暂无评分
摘要
We introduce DIVE, an end-to-end speaker diarization sys-tem. DIVE presents the diarization task as an iterative pro-cess: it repeatedly builds a representation for each speaker before predicting their voice activity conditioned on the ex-tracted representations. This strategy intrinsically resolves the speaker ordering ambiguity without requiring the classi-cal permutation invariant training loss. In contrast with prior work, our model does not rely on pretrained speaker represen-tations and jointly optimizes all parameters of the system with a multi-speaker voice activity loss. DIVE does not require the training speaker identities and allows efficient window-based training. Importantly, our loss explicitly excludes unreliable speaker turn boundaries from training, which is adapted to the standard collar-based Diarization Error Rate (DER) eval-uation. Overall, these contributions yield a system redefining the state-of-the-art on the CALLHOME benchmark, with 6.7% DER compared to 7.8% for the best alternative.
更多
查看译文
关键词
diarization,speech,end-to-end learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要