Mapping the plague through natural language processing

medRxiv (Cold Spring Harbor Laboratory)(2021)

引用 1|浏览1
暂无评分
摘要
Abstract Pandemic diseases such as plague have produced a vast amount of literature providing information about the spatiotemporal extent of past epidemics, circumstances of transmission, symptoms, or countermeasures. However, the manual extraction of such information from running text is a tedious process, and much of this information has therefore remained locked into a narrative format. Natural Language processing (NLP) is a promising tool for the automated extraction of epidemiological data from texts, and can facilitate the establishment of datasets. In this paper, we explore the utility of NLP to assist in the creation of a plague outbreak dataset. We first produced a gold standard list of toponyms by manual annotation of a German plague treatise published by Sticker in 1908. We then investigated the performance of five pre-trained NLP libraries (Google NLP, Stanford CoreNLP, spaCy, germaNER and Geoparser.io) for the automated extraction of location data from a compared to the gold standard. Of all tested algorithms, spaCy performed best (sensitivity 0.92, F1 score 0.83), followed closely by Stanford CoreNLP (sensitivity 0.81, F1 score 0.87). Google NLP had a slightly lower performance (F1 score 0.72, sensitivity 0.78). Geoparser and germaNER had a poor sensitivity (0.41 and 0.61) From the gold standard list we produced a plague dataset by linking dates and outbreak places with GIS coordinates. We then evaluated how well automated geocoding services such as Google geocoding, Geonames and Geoparser located these outbreaks correctly. All geocoding services performed poorly and returned the correct GIS information only in 60.4%, 52.7% and 33.8% of all cases. The rate of correct matches was particularly low when it came to historical regions and places. Finally, we compared our newly digitized plague dataset to a re-digitized version of the plague treatise by Biraben and provide an update of the spatio-temporal extent of the second pandemic plague outbreaks. We conclude that NLP tools have their limitations, but they are potentially useful to accelerate the collection of data and the generation of a global plague outbreak database.
更多
查看译文
关键词
natural language processing,plague,natural language,mapping
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要