PDF-to-Text Reanalysis for Linguistic Data Mining.

LREC(2018)

引用 24|浏览30
暂无评分
摘要
Extracting semi-structured text from scientific writing in PDF files is a difficult task that researchers have faced for decades. In the 1990s, this task was largely a computer vision and OCR problem, as PDF files were often the result of scanning printed documents. Today, PDFs have standardized digital typesetting without the need for OCR, but extraction of semi-structured text from these documents remains a nontrivial task. In this paper, we present a system for the reanalysis of glyph-level PDF-extracted text that performs block detection, respacing, and tabular data analysis for the purposes of linguistic data mining. We further present our reanalyzed output format, which attempts to eliminate the extreme verbosity of XML output while leaving important positional information available for downstream processes.
更多
查看译文
关键词
Low Resource Languages, Interlinear Glossed Text (IGT), Corpus Creation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要