An Efficient Framework for Searching Text in Noisy Document Images

Document Analysis Systems(2012)

引用 46|浏览1
暂无评分
摘要
An efficient word spotting framework is proposed to search text in scanned books. The proposed method allows one to search for words when optical character recognition (OCR) fails due to noise or for languages where there is no OCR. Given a query word image, the aim is to retrieve matching words in the book sorted by the similarity. In the offline stage, SIFT descriptors are extracted over the corner points of each word image. Those features are quantized into visual terms (visterms) using hierarchical K-Means algorithm and indexed using an inverted file. In the query resolution stage, the candidate matches are efficiently identified using the inverted index. These word images are then forwarded to the next stage where the configuration of visterms on the image plane are tested. Configuration matching is efficiently performed by projecting the visterms on the horizontal axis and searching for the Longest Common Subsequence (LCS) between the sequences of visterms. The proposed framework is tested on one English and two Telugu books. It is shown that the proposed method resolves a typical user query under 10 milliseconds providing very high retrieval accuracy (Mean Average Precision 0.93). The search accuracy for the English book is comparable to searching text in the high accuracy output of a commercial OCR engine.
更多
查看译文
关键词
image plane,high accuracy output,commercial ocr engine,high retrieval accuracy,noisy document images,searching text,word image,query word image,efficient word,proposed framework,efficient framework,next stage,indexation,image resolution,indexing,image retrieval,inverted file,noise,k means algorithm,feature extraction,accuracy,natural language processing,text analysis,longest common subsequence,mean average precision,optical character recognition,detectors,inverted index
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要