Text and non-text segmentation based on connected component features
International Conference on Document Analysis and Recognition(2015)
摘要
Document image segmentation is crucial to OCR and other digitization processes. In this paper, we present a learning-based approach for text and non-text separation in document images. The training features are extracted at the level of connected components, a mid-level between the slow noise-sensitive pixel level, and the segmentation-dependent zone level. Given all types, shapes and sizes of connected components, we extract a powerful set of features based on size, shape, stroke width and position of each connected component. Adaboosting with Decision trees is used for labeling connected components. Finally, the classification of connected components into text and non-text is corrected based on classification probabilities and size as well as stroke width analysis of the nearest neighbors of a connected component. The performance of our approach has been evaluated on the two standard datasets: UW-III and ICDAR-2009 competition for document layout analysis. Our results demonstrate that the proposed approach achieves competitive performance for segmenting text and non-text in document images of variable content and degradation.
更多查看译文
关键词
connected component feature,document image segmentation,OCR,digitization processes,learning-based approach,nontext separation,text separation,document images,training feature extraction,noise-sensitive pixel level,segmentation-dependent zone level,Adaboosting,decision trees,classification probabilities,classification size,stroke width analysis,UW-III,ICDAR- 2009 competition,document layout analysis,variable content,text segmentation,nontext segmentation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络