Dataset Enhancement and Multilingual Transfer for Named Entity Recognition in the Indonesian Language

ACM Transactions on Asian and Low-Resource Language Information Processing(2023)

引用 0|浏览2
暂无评分
摘要
Named entity recognition in the Indonesian language has significantly developed in recent years. However, it still lacks standardized publicly available corpora; a small dataset is available but suffers from inconsistent annotations. Therefore, we re-annotated the dataset to improve its consistency and benefit the community. Our re-annotation led to better training results from an effective baseline model consisting of bidirectional long short-term memory and conditional random fields. To fully utilize the limited available data, we utilized better contextualization and transferred external knowledge by exploiting monolingual and multilingual pre-trained language models, such as IndoBERT and XLM-RoBERTa. In addition to the general improvement from the language models, we observed that the monolingual model is more sensitive, while the multilingual ones show advantages in rich morphological knowledge. We also applied cross-lingual transfer learning to utilize high-resource corpora in other languages. We adopted English, Spanish, Dutch, and German as the source languages for the target Indonesian language and found that Dutch plays a special role in the data transfer method due to morphological similarity attributable to historical reasons.
更多
查看译文
关键词
named entity recognition,multilingual transfer,dataset,language
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要