Dive: End-to-End Speech Diarization Via Iterative Speaker Embedding

Neil Zeghidour,Olivier Teboul,David Grangier

2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)（2021）

引用 5|浏览83

暂无评分

摘要

We introduce DIVE, an end-to-end speaker diarization sys-tem. DIVE presents the diarization task as an iterative pro-cess: it repeatedly builds a representation for each speaker before predicting their voice activity conditioned on the ex-tracted representations. This strategy intrinsically resolves the speaker ordering ambiguity without requiring the classi-cal permutation invariant training loss. In contrast with prior work, our model does not rely on pretrained speaker represen-tations and jointly optimizes all parameters of the system with a multi-speaker voice activity loss. DIVE does not require the training speaker identities and allows efficient window-based training. Importantly, our loss explicitly excludes unreliable speaker turn boundaries from training, which is adapted to the standard collar-based Diarization Error Rate (DER) eval-uation. Overall, these contributions yield a system redefining the state-of-the-art on the CALLHOME benchmark, with 6.7% DER compared to 7.8% for the best alternative.

查看译文

关键词

diarization,speech,end-to-end learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要