User-driven correction of OCR errors: combining crowdsourcing and information retrieval technology

DATeCH(2014)

引用 11|浏览2
暂无评分
摘要
In this paper we describe a new approach to the correction of noisy OCR text which combines the power of crowdsourcing with information retrieval technology. Searching a given full-text, validating the results and correcting single terms are modeled as a joint effort. Users are given the possibility of correcting exactly those words in a document collection which are of specific interest to them, i.e. users are able to take influence on the precision and recall of specific search terms and to directly correct erroneous strings. The Graphical User Interface (GUI) offers two main features for the completion of this task: for improving precision it provides a view of the word snippets of a specific search string and the possibility of validating each word snippet with a simple yes/no decision. In order to improve recall, standard features of the search engine, such as fuzzy search or wildcards, are utilized. The corrected and/or approved words are immediately available for searching and the underlying XML files are updated simultaneously. The method has been developed as a prototype tool which is based on standard technology (JAVA, Lucene, Ajax). It will be published as an Open Source software package during 2014 (working title: corr4ocr).
更多
查看译文
关键词
digital humanities,digitization,human factors,languages,large text archives,measurement,ocr correction tool,optical character recognition,reliability,verification
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要