Dataset Construction for Scientific-Document Writing Support by Extracting RelatedWork Section and Citations from PDF Papers

Keita Kobayashi, Kohei Koyama,Hiromi Narimatsu,Yasuhiro Minami

International Conference on Language Resources and Evaluation (LREC)(2022)

引用 0|浏览0
暂无评分
摘要
To augment datasets used for scientific-document writing support research, we extract texts from "Related Work" sections and citation information in PDF-formatted papers published in English. The previous dataset was constructed entirely with Tex-formatted papers, from which it is easy to extract citation information. However, since many publicly available papers in various fields are provided only in PDF format, a dataset constructed using only Tex papers has limited utility. To resolve this problem, we augment the existing dataset by extracting the titles of sections using the visual features of PDF documents and extracting the RelatedWork section text using the explicit title information. Since text generated from the figures and footnotes appearing in the extraction target areas is considered noise, we remove instances of such text. Moreover, we map the cited paper's information obtained using existing tools to citation marks detected by regular expression rules, resulting in pairs of cited paper information and text of the Related Work section. By evaluating body text extraction and citation mapping in the constructed dataset, the accuracy of the proposed dataset was found to be close to that of the previous dataset. Accordingly, we demonstrated the possibility of building a significantly augmented dataset.
更多
查看译文
关键词
Scientific Document Analysis,PDF Text Analytics,PDF Information Extraction,Corpus,Bibliometrics
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要