A Tool for Facilitating OCR Postediting in Historical Documents

Poncelas, Alberto ORCID: 0000-0002-5089-1687 <https://orcid.org/0000-0002-5089-1687>, Aboomar, Mohammad ORCID: 0000-0002-1391-5061 <https://orcid.org/0000-0002-1391-5061>, Buts, Jan ORCID: 0000-0002-7657-804X <https://orcid.org/0000-0002-7657-804X>, Hadley, James ORCID: 0000-0003-1950-2679 <https://orcid.org/0000-0003-1950-2679> and Way, Andy ORCID: 0000-0001-5736-5930 <https://orcid.org/0000-0001-5736-5930> (2020) A tool for facilitating OCR postediting in historical documents. In: Workshop on Language Technologies for Historical and Ancient Languages, LT4HALA (2020), 12 May 2020, Marseille, France. (In Press)(2020)

引用 0|浏览18
暂无评分
摘要
Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom (Cary, 1719). As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要