Assessing and Minimizing the Impact of OCR Quality on Named Entity Recognition
TPDL(2020)
摘要
In digital libraries, the accessibility of digitized documents is directly related to the way they are indexed. Named entities are one of the main entry points used to search and retrieve digital documents. However, most digitized documents are indexed through their OCRed version and OCR errors may hinder their accessibility. This paper aims to quantitatively estimate the impact of OCR quality on the performance of named entity recognition (NER). We tested state-of-the-art NER techniques over several evaluation benchmarks, and experimented with various levels and types of synthesised OCR noise so as to estimate the impact of OCR noise on NER performance. We share all corresponding datasets. To the best of our knowledge, no other research work has systematically studied the impact of OCR on named entity recognition over datasets in multiple languages. The final outcome of this study is an evaluation over historical newspaper data of the national library of Finland, resulting in an increase of around 11% points in terms of F1-measure over the best-known results to this day.
更多查看译文
关键词
Digitized documents, Indexing, OCR, Named entity recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络