Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian.

LREC(2020)

引用 0|浏览7
暂无评分
摘要
The training of new tagger models for Serbian is primarily motivated by the enhancement of the existing tagset with the grammatical category of a gender. The harmonization of resources that were manually annotated within different projects over a long period of time was an important task, enabled by the development of tools that support partial automation. The supporting tools take into account different taggers and tagsets. This paper focuses on TreeTagger and spaCy taggers, and the annotation schema alignment between Serbian morphological dictionaries, MULTEXT-East and Universal Part-of-Speech tagset. The trained models will be used to publish the new version of the Corpus of Contemporary Serbian as well as the Serbian literary corpus. The performance of developed taggers were compared and the impact of training set size was investigated, which resulted in around 98% PoS-tagging precision per token for both new models. The SR BASIC annotated dataset will also be published.
更多
查看译文
关键词
Part-of-Speech tagging, lemmatization, corpus, evaluation, Serbian, morphological dictionary
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要