A New Khmer Palm Leaf Manuscript Dataset for Document Analysis and Recognition: SleukRith Set.

HIP@ICDAR(2017)

引用 24|浏览3
暂无评分
摘要
Analysis of ancient Khmer documents can be quite challenging due to the elaborated shape of Khmer handwritten characters combined with the complex structure of how words are formed from those characters. Palm leaf manuscripts, one of the most well-known old Khmer documents, have been being digitized and centralized; therefore, document analysis functions such as text search capabilities are necessary but still remain unavailable for this type of documents. In order to contribute to the progress of relevant researches, we introduce in this paper a new dataset called SleukRith set comprising of 657 pages of Khmer palm leaf manuscripts randomly selected from various collections whose quality and digitization method are variable. The dataset contains three types of data: isolated characters, words, and lines. Each type of data is annotated with the ground truth information which is very useful for evaluating and serving as a training set for common document analysis tasks such as character/text recognition, word/line segmentation, and word spotting. In order to serve as a base line, the result of an evaluation study of Khmer isolated character recognition that we have conducted on SleukRith Set using Convolutional Neural Network is also presented.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要