Multiview Shared Subspace Learning Across Speakers and Speech Commands

INTERSPEECH(2019)

引用 5|浏览27
暂无评分
摘要
In many speech processing applications, the objective is to model different modes of variability to obtain robust speech features. In this paper, we learn speech representations in a multiview paradigm by constraining the views to known modes of variability such as speakers or spoken words. We use deep multiset canonical correlation (dMCCA) because it can model more than two views in parallel to learn a shared subspace across them. In order to model thousands of views (e.g., speakers), we demonstrate that stochastically sampling a small number of views generalizes dMCCA to the larger set of views. To evaluate our approach, we study two different aspects of the Speech Commands Dataset: variability among the speakers and speech commands. We show that, by treating observations from one mode of variability as multiple parallel views, we can learn representations that are discriminative to the other mode. We first consider different speakers as views of the same word to learn their shared subspace to represent an utterance. We then constrain the different words spoken by the same person as multiple views to learn speaker representations. Using classification and unsupervised clustering, we evaluate the efficacy of multiview representations to identify speech commands and speakers.
更多
查看译文
关键词
multiview learning, speech commands, multiset canonical correlation analysis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要