Unlocking Science: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction.
CoRR(2023)
摘要
Extracting key information from scientific papers has the potential to help
researchers work more efficiently and accelerate the pace of scientific
progress. Over the last few years, research on Scientific Information
Extraction (SciIE) witnessed the release of several new systems and benchmarks.
However, existing paper-focused datasets mostly focus only on specific parts of
a manuscript (e.g., abstracts) and are single-modality (i.e., text- or
table-only), due to complex processing and expensive annotations. Moreover,
core information can be present in either text or tables or across both. To
close this gap in data availability and enable cross-modality IE, while
alleviating labeling costs, we propose a semi-supervised pipeline for
annotating entities in text, as well as entities and relations in tables, in an
iterative procedure. Based on this pipeline, we release novel resources for the
scientific community, including a high-quality benchmark, a large-scale corpus,
and a semi-supervised annotation pipeline. We further report the performance of
state-of-the-art IE models on the proposed benchmark dataset, as a baseline.
Lastly, we explore the potential capability of large language models such as
ChatGPT for the current task. Our new dataset, results, and analysis validate
the effectiveness and efficiency of our semi-supervised pipeline, and we
discuss its remaining limitations.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要