Writing type, script and language identification in heterogeneous documents.

IJISTA(2017)

引用 25|浏览17
暂无评分
摘要
In this paper, we propose a writing type, script and language text classification method to automatically determine the identity of texts segmented from heterogeneous document images. These documents are written in Arabic, French and English languages with mixed machine-printed and handwritten text. To handle such a problem, we treat each text-line/word image with a fixed-length sliding window. Each window is represented with 23 simple and efficient features to achieve the writing type and the script identification goal using Gaussian mixture models (GMM). The proposed approach for language identification is based on a bi-gram analysis of an optical character recognition (OCR) output. Experiments have been conducted with handwritten and machine-printed text-blocks, text-lines and words extracted from the Maurdor database. The results reveal the feasibility of our proposed method in writing type, script and language identification.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要