The “ScribbleLens” Dutch Historical Handwriting Corpus

2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR)(2020)

引用 3|浏览14
暂无评分
摘要
Historical handwritten documents guard an important part of human knowledge only at the reach of a few scholars and experts. Recent developments in machine learning have the potential of rendering this information accessible to a larger audience. Data-driven approaches to automatic manuscript recognition require large amounts of transcribed scans to work. To this end, we introduce a new handwritten corpus based on 400-year-old, cursive, early modern Dutch documents such as ship journals and daily logbooks. This is a 1000 page collection, segmented into lines, to facilitate fully-, weakly- and un-supervised research and with textual transcriptions on 20% of the pages. Other annotations such as handwriting slant, year of origin, complexity, and writer identity have been manually added. With over 80 writers this corpus is significantly larger and more varied than other existing historical data sets such as Spanish RODRIGO. We provide train/test splits, experimental results from an automatic transcription baseline and tools to facilitate its use in deep learning research. The manuscripts span over 150 years of significant journeys by captains and traders from the Vereenigde Oost-indische Company (VOC) such as Tasman, Brouwer and Van Neck, making this resource also valuable to historians and the paleography community.
更多
查看译文
关键词
Handwriting analysis,Document analysis,Optical character recognition,Machine learning,Paleography
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要