Towards High-Quality Text Stream Extraction from PDF. Technical Background to the ACL 2012 Contributed Task.

Øyvind Raddum Berg,Stephan Oepen,Jonathon Read

ACL '12: Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries(2012)

引用 19|浏览23
暂无评分
摘要
Extracting textual content and document structure from PDF presents a surprisingly (depressingly, to some, in fact) difficult challenge, owing to the purely display-oriented design of the PDF document standard. While a variety of lower-level PDF extraction toolkits exist, none fully support the recovery of original text (in reading order) and relevant structural elements, even for so-called borndigital PDFs, i.e. those prepared electronically using typesetting systems like LATEX, OpenOffice, and the like. This short paper summarizes a new tool for high-quality extraction of text and structure from PDFs, combining state-of-the-art PDF parsing, font interpretation, layout analysis, and TEI-compliant output of text and logical document markup.
更多
查看译文
关键词
PDF document standard,lower-level PDF extraction toolkits,state-of-the-art PDF parsing,document structure,logical document markup,original text,high-quality extraction,so-called borndigital PDFs,TEI-compliant output,difficult challenge,Towards high-quality text stream,technical background
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要