Overview of the chemical compound and drug name recognition ( CHEMDNER ) task

semanticscholar(2013)

引用 12|浏览1
暂无评分
摘要
There is an increasing need to facilitate automated access to information relevant for chemical compounds and drugs described in text, including scientific articles, patents or health agency reports. A number of recent efforts have implemented natural language processing (NLP) and text mining technologies for the chemical domain (ChemNLP or chemical text mining). Due to the lack of manually labeled Gold Standard datasets together with comprehensive annotation guidelines, both the implementation as well as the comparative assessment of ChemNLP technologies BSF opaque. Two key components for most chemical text mining technologies are the indexing of documents with chemicals (chemical document indexing CDI ) and finding the mentions of chemicals in text (chemical entity mention recognition CEM ). These two tasks formed part of the chemical compound and drug named entity recognition (CHEMDNER) task introduced at the fourth BioCreative challenge, a community effort to evaluate biomedical text mining applications. For this task, the CHEMDNER text corpus was constructed, consisting of 10,000 abstracts containing a total of 84,355 mentions of chemical compounds and drugs that have been manually labeled by domain experts following specific annotation guidelines. This corpus covers representative abstracts from major chemistry-related sub-disciplines such as medicinal chemistry, biochemistry, organic chemistry and toxicology. A total of 27 teams – 23 academic and 4 commercial HSPVQT, comprised of 87 researchers – submitted results for this task. Of these teams, 26 provided submissions for the CEM subtask and 23 for the CDI subtask. Teams were provided with the manual annotations of 7,000 abstracts to implement and train their systems and then had to return predictions for the 3,000 test set abstracts during a short period of time. When comparing exact matches of the automated results against the manually labeled Gold Standard annotations, the best teams reached an F-score ⋆ Corresponding author Proceedings of the fourth BioCreative challenge evaluation workshop, vol. 2 of 87.39% JO the CEM task and of 88.20% JO the CDI task. This can be regarded as a very competitive result when compared to the expected upper boundary, the agreement between to human annotators, at 91%. In general, the technologies used to detect chemicals and drugs by the teams included machine learning methods (particularly CRFs using a considerable range of different features), interaction of chemistry-related lexical resources and manual rules (e.g., to cover abbreviations, chemical formula or chemical identifiers). By promoting the availability of the software of the participating systems as well as through the release of the CHEMDNER corpus to enable implementation of new tools, this work fosters the development of text mining applications like the automatic extraction of biochemical reactions, toxicological properties of compounds, or the detection of associations between genes or mutations BOE drugs in the context pharmacogenomics.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要