Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction
arXiv (Cornell University)(2023)
摘要
Existing Self-Supervised Learning (SSL) models for speech typically process
speech signals at a fixed resolution of 20 milliseconds. This approach
overlooks the varying informational content present at different resolutions in
speech signals. In contrast, this paper aims to incorporate multi-resolution
information into speech self-supervised representation learning. We introduce a
SSL model that leverages a hierarchical Transformer architecture, complemented
by HuBERT-style masked prediction objectives, to process speech at multiple
resolutions. Experimental results indicate that the proposed model not only
achieves more efficient inference but also exhibits superior or comparable
performance to the original HuBERT model over various tasks. Specifically,
significant performance improvements over the original HuBERT have been
observed in fine-tuning experiments on the LibriSpeech speech recognition
benchmark as well as in evaluations using the Speech Universal PERformance
Benchmark (SUPERB) and Multilingual SUPERB (ML-SUPERB).
更多查看译文
关键词
speech,learning,prediction,multi-resolution,multi-resolution,self-supervised
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要