Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets

Yousef Younes,Ansgar Scherp

IEEE Access(2023)

引用 0|浏览0
暂无评分
摘要
Dataset mention extraction is a difficult problem due to the unstructured nature of text, the sparsity of dataset mentions, and the various ways the same dataset can be mentioned. Extracting unknown dataset mentions which are not part of the training data of the model is even harder. We address this challenge in two ways. First, we consider a two-step approach where a binary classifier filters out positive contexts, i.e., detects sentences with a dataset mention. We consider multiple transformer-based models and strong baselines for this task. Subsequently, the dataset is extracted from the positive context. Second, we consider a one-step approach and directly aim to detect and extract a possible dataset mention. For the extraction of datasets, we consider transformer models in named entity recognition (NER) mode. We contrast NER with the transformers' capabilities for question answering (QA). We use the Coleridge Initiative "Show US the Data"dataset consisting of 14.3k scientific papers with about 35k mentions of datasets. We found that using transformers in QA mode is a better choice than NER for extracting unknown datasets. The rationale is that detecting new datasets is an out-of-vocabulary task, i.e., the dataset name has not been seen once during training. Comparing the two-step versus the one-step approach, we found contrasting strengths. A two-step dataset extraction using an MLP for filtering and RoBERTa in QA mode extracts more dataset mentions than a one-step system, but at the cost of a lower F1-score of 62.7%. A one-step extraction with DeBERTa in QA achieves the highest F1-score of 92.88% at the cost of missing dataset mentions. We recommend the one-step approach for the case when accuracy is more important, and the two-step approach when there is a postprocessing mechanism for the extracted dataset mentions, e.g., a manual check. The source code is available at https://github.com/yousef-younes/dataset_mention_extraction.
更多
查看译文
关键词
Binary text classification,dataset mentions,named entity recognition,question answering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要