Language identification, a tool at the service of Corsican and the evaluation of linguistic resources

TRAITEMENT AUTOMATIQUE DES LANGUES(2021)

引用 0|浏览0
暂无评分
摘要
The constitution of corpora is one of the first priorities faced by less-resourced languages. The emergence of Internet-based resources of increasing size and covering more and more languages may suggest that this issue has been resolved, but this is not the case. Following Caswell et al. (2021), who evaluated several large resources, including one with Corsican content, we conducted an analysis of two corpora including this language: An Crubadan and W2C. In parallel to a manual evaluation, we considered the possibility of using one or more language identification modules to filter the content of these resources, which turns out to be possible but at the cost of low recall. For this task, we tested and re-trained various systems in order to adapt them to Corsican. This work makes it possible to provide a model allowing the identification of 17 European languages as well as Corsican.
更多
查看译文
关键词
corpora, quality, language identification, less-resourced languages, Corsican
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要