Text Zone Classification Using Unsupervised Feature Learning

2015 13th International Conference on Document Analysis and Recognition (ICDAR)(2015)

引用 4|浏览8
暂无评分
摘要
Text zone classification is a vital step in the digitization process, without which OCR systems perform poorly. Prior methods to document zone classification have relied on large sets of hand-crafted features for training zone classifiers. Such features are usually database-dependent, and their computation is time consuming. In this work we propose a novel method for text zone classification that relies on the approach of unsupervised feature learning. Within our method, feature vectors of document zones are automatically learned by patches extraction, encoding and pooling, where feature encoding is based on a codebook of visual words. The training phase of the text classifier takes into consideration the unbalance between text zones and non-text zones of all types. The proposed method has been tested on publicly available standard databases, and achieved competitive or better results compared to state-of-the-art methods. The results show that our approach matches well the task of text classification, and is robust to zone shapes, orientations and size.
更多
查看译文
关键词
text zone classification,unsupervised feature learning,digitization process,document zone classification,hand-crafted features,zone classifier training,database-dependent feature,feature vector,patches extraction,pooling,feature encoding,visual word codebook,text classifier,text classification
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要