DODFMiner: An automated tool for Named Entity Recognition from Official Gazettes

Gabriel M. C. Guimaraes, Felipe X. B. da Silva, Andrei L. Queiroz,Ricardo M. Marcacini,Thiago P. Faleiros, Vinicius R. P. Borges,Luis P. F. Garcia

NEUROCOMPUTING(2024)

引用 0|浏览4
暂无评分
摘要
Official gazettes are documents published by governments to publicize their actions, spanning long periods of time and making an important transparency mechanism. These documents have information on laws, contracts, and bidding processes, as well as on civil servants and their careers in public service. Automatic information extraction of these documents may contribute to public transparency, with two tasks being especially useful: the classification of the different segments of these documents, the so called acts; and the Named Entity Recognition (NER) within the acts. The variety of official gazettes and their patterns brings up the necessity of constructing different tools for specific gazettes. In this paper, we propose DODFMiner, a command-line interface tool to classify acts and extract named entities from the Official Gazette of the Federal District. The tool follows a 3-step approach: the pre-processing of the input data; text classification using rule-based systems with regular expressions; and NER with Machine Learning algorithms. It allows users to input JSON files and receive CSV as output, providing information that allows users to track government procurements through years, contracts duration and total amount, among others. We also propose a set of experiments to support the choice of models included in the tool, covering the classification and NER steps. Text classification achieved a mean F1-score of 0.778, while to the NER, we compared 3 different architectures, CRF with a mean F1-score of 0.851, CNN-biLSTM-CRF with 0.787 and CNN-CNN-LSTM with 0.841.
更多
查看译文
关键词
Legal documents,Official gazettes,Natural Language Processing,Classification,Named Entity Recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要