MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training
ICLR 2024(2023)
摘要
Self-supervised learning (SSL) has recently emerged as a promising paradigm
for training generalisable models on large-scale data in the fields of vision,
text, and speech. Although SSL has been proven effective in speech and audio,
its application to music audio has yet to be thoroughly explored. This is
partially due to the distinctive challenges associated with modelling musical
knowledge, particularly tonal and pitched characteristics of music. To address
this research gap, we propose an acoustic Music undERstanding model with
large-scale self-supervised Training (MERT), which incorporates teacher models
to provide pseudo labels in the masked language modelling (MLM) style acoustic
pre-training. In our exploration, we identified an effective combination of
teacher models, which outperforms conventional speech and audio approaches in
terms of performance. This combination includes an acoustic teacher based on
Residual Vector Quantisation - Variational AutoEncoder (RVQ-VAE) and a musical
teacher based on the Constant-Q Transform (CQT). Furthermore, we explore a wide
range of settings to overcome the instability in acoustic language model
pre-training, which allows our designed paradigm to scale from 95M to 330M
parameters. Experimental results indicate that our model can generalise and
perform well on 14 music understanding tasks and attain state-of-the-art (SOTA)
overall scores.
更多查看译文
关键词
self-supervised learning,music,audio,language model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要