BLSTM Neural Network Based Word Retrieval for Hindi Documents

Document Analysis and Recognition(2011)

引用 22|浏览0
暂无评分
摘要
Retrieval from Hindi document image collections is a challenging task. This is partly due to the complexity of the script, which has more than 800 unique ligatures. In addition, segmentation and recognition of individual characters often becomes difficult due to the writing style as well as degradations in the print. For these reasons, robust OCRs are non existent for Hindi. Therefore, Hindi document repositories are not amenable to indexing and retrieval. In this paper, we propose a scheme for retrieving relevant Hindi documents in response to a query word. This approach uses BLSTM neural networks. Designed to take contextual information into account, these networks can handle word images that can not be robustly segmented into individual characters. By zoning the Hindi words, we simplify the problem and obtain high retrieval rates. Our simplification suits the retrieval problem, while it does not apply to recognition. Our scalable retrieval scheme avoids explicit recognition of characters. An experimental evaluation on a dataset of word images gathered from two complete books demonstrates good accuracy even in the presence of printing variations and degradations. The performance is compared with baseline methods.
更多
查看译文
关键词
hindi word,blstm neural network,hindi documents,relevant hindi document,hindi document image collection,scalable retrieval scheme,word retrieval,explicit recognition,retrieval problem,hindi document repository,word image,individual character,high retrieval rate,image retrieval,vectors,hindi,robustness,image segmentation,natural language processing,neural networks,optical character recognition,neural nets
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要