Permutation invariant training of deep models for speaker-independent multi-talker speech separation

Kohei Takahashi, Toshihiko Shiraishi

MECHANICAL ENGINEERING JOURNAL(2023)

引用 895|浏览348
暂无评分
摘要
The previous research of speech separation has significantly improved separation performance based on the time-domain method: encoder, separator, and decoder. Most research has focused on revising the architecture of the separator. In contrast, a single 1-D convolution layer and 1-D transposed convolution layer have been used as encoder and decoder, respectively. This study proposes deep encoder and decoder architectures, consisting of stacked 1-D convolution layers, 1-D transposed convolution layers, or residual blocks, for the time-domain speech separation. The intentions of revising them are to improve separation performance and overcome the tradeoff between separation performance and computational cost due to their stride by enhancing their mapping ability. We applied them to Conv-TasNet, the typical model in the time-domain speech separation. Our results indicate that the better separation performance is archived as the number of their layers increases and that changing the number of their layers from 1 to 12 results in more than 1 dB improvement of SI-SDR on WSJ02mix. Additionally, it is suggested that the encoder and decoder should be deeper, corresponding to their stride since their task may be more difficult as the stride becomes larger. This study represents the importance of improving these architectures as well as separators.
更多
查看译文
关键词
Deep encoder,Deep decoder,Deep learning,Speech separation,Time-domain
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要