Extraction of Tabular Data from Document Images

W4A(2017)

引用 4|浏览25
暂无评分
摘要
In this paper, we propose a heuristics-based method for automatic detection and extraction of tabular data from document images. The proposed approach utilizes page segmentation techniques, along with an OCR engine, in order to acquire the text data and bounding boxes of each word in the document. These elements are then grouped in a bottom-up fashion, based on a series of rules, in order to identify and reconstruct tabular arrangements of data. Based on this methodology, an open source cross-platform tool capable of recognizing the semantic structure of documents containing tabular data has been implemented, thus widening the range of document types than can be successfully converted into alternative accessible formats, suitable for users with visual impairments.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要