Layout Analysis Algorithm Based on Probabilistic Graphical Model for Dunhuang Historical Documents

HIP@ICDAR(2015)

引用 0|浏览26
暂无评分
摘要
The Dunhuang historical documents are of great significance to the study of ancient Chinese Buddhist culture and other topics. It would greatly benefit the protection and the study of historical documents with full-text information generated by historical document recognition technology. However, many historical documents from Dunhuang are old and broken, and to make it more challenging, the style and layout of these documents are casual as well. Traditional layout analysis algorithm failed to pay much attention to these problems. In this paper, a new layout analysis algorithm based on Probabilistic Graphical Model is proposed, including both rough segmentation and fine segmentation. After the input historical document images are pre-processed by Gaussian smoothed filtering and binarization, the rough segmentation step uses projection information to get rough text-column regions. In the fine segmentation step, a connected component analysis algorithm based on Probabilistic Graphical Model is developed. The method models the extracted connected components based on Markov Random Field, and combines connected components to get output text columns. Experiments were conducted on some Dunhuang historical documents, and the proposed method could correctly segment text columns with a recall rate of 90.0% and an accuracy of 77.7%. The segmented text-column regions could cover 99.2% characters in historical document images. The result shows that the proposed layout analysis algorithm could be successfully applied to degraded historical document images.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要