Expanding UlyssesNER-Br Named Entity Recognition Corpus with Informal User-Generated Text

PROGRESS IN ARTIFICIAL INTELLIGENCE, EPIA 2022(2022)

引用 0|浏览2
暂无评分
摘要
Named Entity Recognition (NER) is a challenging Natural Language Processing task for a language as rich as Portuguese. When applied in a scenario appropriate to informal language and short texts, the task acquires a new layer of complexity, handling a particular lexicon to the domain in question. In this paper, we expanded the UlyssesNER-Br corpus for NER task with Brazilian Portuguese comments about bills. Additionally, we enriched the annotated set with a formal corpora, in order to analyze whether the combination of formal and informal texts from the same domain could improve NER. Finally, we carry out experiments with a Conditional Random Fields (CRF) model, a Bidirectional LSTM-CRF (BiLSTM-CRF) model, and subsequently, we realized finetuning of a language model BERT on NER task with our dataset. We concluded that formal texts helped identification of entities in informal texts. The best model was the fine-tuned BERT which achieved an F1-score of 73.90%, beating the benchmark of related works.
更多
查看译文
关键词
Named Entity Recognition, Corpus annotation, Informal text, Legal text
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要