TIPAA-SSL: Text Independent Phone-to-Audio Alignment based on Self-Supervised Learning and Knowledge Transfer
arxiv(2024)
摘要
In this paper, we present a novel approach for text independent
phone-to-audio alignment based on phoneme recognition, representation learning
and knowledge transfer. Our method leverages a self-supervised model (wav2vec2)
fine-tuned for phoneme recognition using a Connectionist Temporal
Classification (CTC) loss, a dimension reduction model and a frame-level
phoneme classifier trained thanks to forced-alignment labels (using Montreal
Forced Aligner) to produce multi-lingual phonetic representations, thus
requiring minimal additional training. We evaluate our model using synthetic
native data from the TIMIT dataset and the SCRIBE dataset for American and
British English, respectively. Our proposed model outperforms the
state-of-the-art (charsiu) in statistical metrics and has applications in
language learning and speech processing systems. We leave experiments on other
languages for future work but the design of the system makes it easily
adaptable to other languages.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要