User-driven correction of OCR errors: combining crowdsourcing and information retrieval technology


引用 11|浏览2
In this paper we describe a new approach to the correction of noisy OCR text which combines the power of crowdsourcing with information retrieval technology. Searching a given full-text, validating the results and correcting single terms are modeled as a joint effort. Users are given the possibility of correcting exactly those words in a document collection which are of specific interest to them, i.e. users are able to take influence on the precision and recall of specific search terms and to directly correct erroneous strings. The Graphical User Interface (GUI) offers two main features for the completion of this task: for improving precision it provides a view of the word snippets of a specific search string and the possibility of validating each word snippet with a simple yes/no decision. In order to improve recall, standard features of the search engine, such as fuzzy search or wildcards, are utilized. The corrected and/or approved words are immediately available for searching and the underlying XML files are updated simultaneously. The method has been developed as a prototype tool which is based on standard technology (JAVA, Lucene, Ajax). It will be published as an Open Source software package during 2014 (working title: corr4ocr).
digital humanities,digitization,human factors,languages,large text archives,measurement,ocr correction tool,optical character recognition,reliability,verification
AI 理解论文
Chat Paper