Detecting In-line Mathematical Expressions in Scientific Documents.

DocEng(2017)

引用 18|浏览16
暂无评分
摘要
One of the issues in extracting natural language sentences from PDF documents is the identification of non-textual elements in a sentence. In this paper, we report our preliminary results on the identification of in-line mathematical expressions. We first construct a manually annotated corpus and apply conditional random field (CRF) for the math-zone identification using both layout features, such as font types, and linguistic features, such as context n-grams, obtained from PDF documents. Although our method is naive and uses a small amount of annotated training data, our method achieved an 88.95% F-measure compared with 22.81% for existing math OCR software.
更多
查看译文
关键词
PDF structure analysis, mathematical formula recognition, in-line mathematical expression detection, math IR, scientific paper mining
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要