Linguistic Resources for Arabic Handwriting Recognition

msra(2009)

引用 27|浏览5
暂无评分
摘要
MADCAT (Multilingual Automatic Document Classification Analysis and Translation) is a five year DARPA program that will produce systems to automatically convert foreign language text images into English transcripts for use by humans and downstream processes including summarization and information extraction. The first two phases of MADCAT focus on handwritten Arabic. Linguistic Data Consortium (LDC) creates and distributes linguistic resources for MADCAT, including data, annotations, specifications and tools for system training and evaluation. To date LDC has recruited over 300 scribes from around the Arabic speaking world to produce handwritten text for MADCAT. A web-based collection toolkit supports scribe recruitment, registration, data assignment and tracking, progress reporting, quality control and compensation both at LDC and at remote collection sites. Handwritten pages are scanned at high resolution and manually annotated with information including bounding boxes for each line and word on the page. Corresponding digital text and English translations are generated, and the multiple data layers are unified into a single xml output file containing: a text layer consisting of source text, tokenization and sentence segmentation; an image layer consisting of bounding boxes; a scribe demographic layer consisting of scribe ID and partition (train/dev/test); and a document metadata layer. LDC has collected, annotated and distributed over 38,000 handwritten pages thus far, and collection continues at a rapid pace. Most linguistic resources developed for the program will also be published in LDC's catalog making them generally available to the larger research community; the MADCAT Phase 1 Training Corpus is expected to be published in late 2009.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要