Making scanned Arabic documents machine accessible using an ensemble of SVM classifiers

IJDAR(2018)

引用 8|浏览33
暂无评分
摘要
Raster-image PDF files originating from scanning or photographing paper documents are inaccessible to both text search engines and screen readers that people with visual impairments use. We here focus on the relatively less-researched problem of converting raster-image files with Arabic script into machine-accessible documents. Our method, called ECDP for “Ensemble-based classification of document patches,” segments the physical layout of the document, classifies image patches as containing text or graphics, assembles homogeneous document regions, and passes the text to an optical character recognition engine to convert into natural language. Classification is based on the majority voting of an ensemble of support vector machines. When tested on the dataset BCE-Arabic [Saad et al. in: ACM 9th annual international conference on pervasive technologies related to assistive environments (PETRA’16), Corfu, 2016 ], ECDP yielded an average patch classification accuracy of 97.3
更多
查看译文
关键词
Arabic document analysis,Physical layout analysis,Page layout analysis,Optical character recognition (OCR),Screen readers,Classifier ensemble,Page zone classification,Creation of structured meta data
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要