DuoSearch: A Novel Search Engine for Bulgarian Historical Documents

Angel Beshirov, Suzan Hadzhieva,Ivan Koychev,Milena Dobreva

ADVANCES IN INFORMATION RETRIEVAL, PT II(2022)

引用 1|浏览6
暂无评分
摘要
Search in collections of digitised historical documents is hindered by a two-prong problem, orthographic variety and optical character recognition (OCR) mistakes. We present a new search engine for historical documents, DuoSearch, which uses ElasticSearch and machine learning methods based on deep neural networks to offer a solution to this problem. It was tested on a collection of historical newspapers in Bulgarian from the mid-19th to the mid-20th century. The system provides an interactive and intuitive interface for the end-users allowing them to enter search terms in modern Bulgarian and search across historical spellings. This is the first solution facilitating the use of digitised historical documents in Bulgarian.
更多
查看译文
关键词
Historical newspapers search engine, Orthographic variety, Post-OCR text correction, BERT
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要