Revisiting the Statistics Pooling Layer in Deep Speaker Embedding Learning

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)(2021)

引用 10|浏览14
暂无评分
摘要
The pooling function plays a vital role in the segment-level deep speaker embedding learning framework. One common method is to calculate the statistics of the temporal features, while the mean based temporal average pooling (TAP) and temporal statistics pooling (TSTP) which combine mean and standard deviation are two typical approaches. Empirically, researchers observe a big performance degradation in x-vector when removing the standard deviation. Based on this observation, in this paper, we designed a set of experiments to analyze the effectiveness of different statistics quantitatively, including the investigation and comparison on pooling functions based on standard deviation, covariance and ℓ p -norm. Experiments are carried out on Vox-Celeb and SRE16, and the results show that the second-order statistics based pooling functions yield better performance than TAP, and only the simple standard deviation can achieve the best performance on all the evaluation conditions.
更多
查看译文
关键词
speaker embedding,statistics pooling,speaker recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要