Making Latin Manuscripts Searchable using gHMMs

NIPS(2004)

引用 94|浏览82
暂无评分
摘要
We describe a method that can make a scanned, handwritten mediaeval latin manuscript accessible to full text search. A generalized HMM is Þtted, using transcribed latin to obtain a transition model and one exam- ple each of 22 letters to obtain an emission model. We show results for unigram, bigram and trigram models. Our method transcribes 25 pages of a manuscript of Terence with fair accuracy (75% of letters correctly transcribed). Search results are very strong; we use examples of vari- ant spellings to demonstrate that the search respects the ink of the doc- ument. Furthermore, our model produces fair searches on a document from which we obtained no training data. 1. Intoduction There are many large corpora of handwritten scanned documents, and their number is grow- ing rapidly. Collections range from the complete works of Mark Twain to thousands of pages of zoological notes spanning two centuries. Large scale analyses of such corpora is currently very difÞcult, because handwriting recognition works poorly. Recently, Rath and Manmatha have demonstrated that one can use small bodies of aligned material as supervised data to train a word spotting mechanism (6). The result can make scanned handwritten documents searchable. Current techniques assume a closed vocabulary — one can search only for words in the training set — and search for instances of whole words. This approach is particularly unattractive for an inßected language, because individual words can take so many forms that one is unlikely to see all in the training set. Furthermore, one would like the method used to require very little aligned training data, so that it is possible to process documents written by different scribes with little overhead. Mediaeval Latin manuscripts are a natural Þrst corpus for studying this problem, because there are many scanned manuscripts and because the handwriting is relatively regular. We expect the primary user need to be search over a large body of documents — to allow comparisons between documents — rather than transcription of a particular document (which is usually relatively easy to do by hand). Desirable features for a system are: First, that it use little or no aligned training data (an
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要