Lexically Aware Semi-Supervised Learning for OCR Post-Correction

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS(2021)

引用 9|浏览48
暂无评分
摘要
Much of the existing linguistic data in many languages of the world is locked away in nondigitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general-purpose OCR systems on recognition of lesswell-resourced languages. However, these methods rely on manually curated postcorrection data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding. Results on four endangered languages demonstrate the utility of the proposed method, with relative error reductions of 15%-29%, where we find the combination of self-training and lexically aware decoding essential for achieving consistent improvements.(1)
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要