Information Extraction in Handwritten Marriage Licenses Books

Proceedings of the 5th International Workshop on Historical Document Imaging and Processing(2019)

引用 5|浏览1
暂无评分
摘要
Handwritten marriage licenses books are characterized by a simple structure of the text in the records with an evolutionary vocabulary, mainly composed of proper names that change along the time. This distinct vocabulary makes automatic transcription and semantic information extraction difficult tasks. Previous works have shown that the use of category-based language models and a Grammatical Inference technique known as MGGI can improve the accuracy of these tasks. However, the application of the MGGI algorithm requires an a priori knowledge to label the words of the training strings, that is not always easy to obtain. In this paper we study how to automatically obtain the information required by the MGGI algorithm using a technique based on Confusion Networks. Using the resulting language model, full handwritten text recognition and information extraction experiments have been carried out with results supporting the proposed approach.
更多
查看译文
关键词
Handwritten recognition, Information extraction, Marriage Licenses
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要