Tibetan Syllable Prediction with Pre-trained Cross-lingual Language Model

2022 IEEE 5th International Conference on Computer and Communication Engineering Technology (CCET)(2022)

引用 0|浏览6
暂无评分
摘要
In recent years, with the development of Tibetan language information technologies, the Internet Tibetan data is increasing year by year. Due to the need for the Tibetan input method and Tibetan error correction, Tibetan language prediction has become an urgent problem to be solved. At present, the challenges of Tibetan prediction are that the Tibetan syllable composition is complex, the vocabulary of Tibetan words which is composed of syllables is extremely large, and the Tibetan word separation technology is not mature. To solve the above problems, this paper proposes a Tibetan syllable prediction method based on a pre-trained cross-lingual language model using Tibetan syllables instead of Tibetan words as the token for prediction. The method uses the cross-lingual language model XLM-R and fine-tunes it using Tibetan news texts to make it more suitable for predicting Tibetan in the news domain. We conduct experiments on Tibetan syllable prediction for texts crawled on the Tibetan news website. The experiments show that the precision of our model for Tibetan text prediction is higher than that of the current n-gram methods.
更多
查看译文
关键词
Tibetan text,syllable prediction,cross-lingual language model,fine-tuning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要