EEBO-Verse: Sifting for Poetry in Large Early Modern Corpora Using Visual Features.

ICDAR (5)(2023)

引用 0|浏览2
暂无评分
摘要
One branch of important digital humanities research focuses on the study of poetry and verse, leveraging large corpora to reveal patterns and trends. However, this work is limited by currently available poetry corpora, which are restricted to few languages and consist mainly of works by well-known classic poets. In this paper, we develop a new large-scale poetry collection, EEBO-verse (Code and dataset is available on https://github.com/taineleau/ebbo-verse ), by automatically identifying the poems in a large Early Modern books collection — English Early-modern Printed Books Online (EEBO). Instead of training text-based classifiers to sub-select the 3.5% of EEBO that actually consists of poetry, we develop an image-based classifier that can operate directly on page scans, removing the need to perform OCR – which, in this domain, is often unreliable. We leverage large visual document encoders (DiT and BEiT), which are pretrained on general domain document images, by fine-tuning them on an in-domain annotated subset of EEBO. In experiments, we find that an appropriately trained image-only classifier performs as well or better than text-based poetry classifiers on human transcribed text, and far surpasses the performance of text-based classifiers on OCR output.
更多
查看译文
关键词
large early modern corpora,poetry,visual features,sifting,eebo-verse
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要