BioBERT: a pre-trained biomedical language representation model for biomedical text mining
BIOINFORMATICS, pp. 1234-1240, 2019.
EI WOS
Keywords:
large scaleNaver Smart Machine LearningConditional Random Fieldbiomedical literaturef1 scoreMore(15+)
Weibo:
Abstract:
Motivation: Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective bi...More
Code:
Data:
Introduction
- The volume of biomedical literature continues to rapidly increase. On average, more than 3000 new articles are published every day in peer-reviewed journals, excluding pre-prints and technical reports such as clinical trial reports in various archives.
- There is increasingly more demand for accurate biomedical text mining tools for extracting information from the literature.
- Recent progress of biomedical text mining models was made possible by the advancements of deep learning techniques used in natural language processing (NLP).
- Other deep learning based models have made improvements in biomedical text mining tasks such as relation extraction (RE) (Bhasuran and Natarajan, 2018; Lim and Kang, 2018) and question answering (QA) (Wiese et al, 2017)
Highlights
- The volume of biomedical literature continues to rapidly increase
- Compared with most previous biomedical text mining models that are mainly focused on a single task such as named entity recognition (NER) or question answering (QA), our model BioBERT achieves state-of-the-art performance on various biomedical text mining tasks, while requiring only minimal architectural modifications
- Note that there are several other recently introduced high quality biomedical NER datasets (Mohan and Li, 2019), we use datasets that are frequently used by many biomedical natural language processing (NLP) researchers, which makes it much easier to compare our work with theirs
- F1 scores were used for NER/relation extraction (RE), and MRR scores were used for QA
- We showed that pre-training BERT on biomedical corpora is crucial in applying it to the biomedical domain
- The following updated versions of BioBERT will be available to the bioNLP community: (i) BioBERTBASE and BioBERTLARGE trained on only PubMed abstracts without initialization from the existing BERT model and (ii) BioBERTBASE and BioBERTLARGE trained on domain-specific vocabulary based on WordPiece
Methods
- BioBERT basically has the same structure as BERT. The authors briefly discuss the recently proposed BERT, and the authors describe in detail the pre-training and fine-tuning process of BioBERT.
3.1 BERT: bidirectional encoder representations from transformers
Learning word representations from a large amount of unannotated text is a long-established method. - According to the authors of BERT, incorporating information from bidirectional representations, rather than unidirectional representations, is crucial for representing words in natural language.
- The authors hypothesize that such bidirectional representations are critical in biomedical text mining as complex relationships between biomedical terms often exist in a biomedical corpus (Krallinger et al, 2017).
- Due to the space limitations, the authors refer readers to Devlin et al (2019) for a more detailed description of BERT
Results
- 4.1 Datasets
The statistics of biomedical NER datasets are listed in Table 3. - The authors used the pre-processed versions of all the NER datasets provided by Wang et al (2018) except the 2010 i2b2/VA, JNLPBA and Species800 datasets.
- The relatively low scores on the LINNAEUS dataset can be attributed to the following: (i) the lack of a silver-standard dataset for training previous state-of-the-art models and (ii) different training/ test set splits used in previous work (Giorgi and Bader, 2018), which were unavailable.
- BioBERT achieved the highest F1 scores on 2 out of 3 biomedical datasets
Conclusion
- The authors used additional corpora of different sizes for pre-training and investigated their effect on performance.
- Figure 2(a) shows that the performance of BioBERT v1.0 (þ PubMed) on three NER datasets (NCBI Disease, BC2GM, BC4CHEMD) changes in relation to the size of the PubMed corpus.
- Figure 2(b) shows the performance changes of BioBERT v1.0 (þ PubMed) on the same three NER datasets in relation to the number of pre-training steps.
- The pre-released version of BioBERT (January 2019) has already been shown to be very effective in many biomedical text mining tasks such as NER for clinical notes (Alsentzer et al, 2019), human phenotype-gene RE (Sousa et al, 2019) and clinical temporal RE (Lin et al, 2019).
- The following updated versions of BioBERT will be available to the bioNLP community: (i) BioBERTBASE and BioBERTLARGE trained on only PubMed abstracts without initialization from the existing BERT model and (ii) BioBERTBASE and BioBERTLARGE trained on domain-specific vocabulary based on WordPiece.
Summary
Introduction:
The volume of biomedical literature continues to rapidly increase. On average, more than 3000 new articles are published every day in peer-reviewed journals, excluding pre-prints and technical reports such as clinical trial reports in various archives.- There is increasingly more demand for accurate biomedical text mining tools for extracting information from the literature.
- Recent progress of biomedical text mining models was made possible by the advancements of deep learning techniques used in natural language processing (NLP).
- Other deep learning based models have made improvements in biomedical text mining tasks such as relation extraction (RE) (Bhasuran and Natarajan, 2018; Lim and Kang, 2018) and question answering (QA) (Wiese et al, 2017)
Methods:
BioBERT basically has the same structure as BERT. The authors briefly discuss the recently proposed BERT, and the authors describe in detail the pre-training and fine-tuning process of BioBERT.
3.1 BERT: bidirectional encoder representations from transformers
Learning word representations from a large amount of unannotated text is a long-established method.- According to the authors of BERT, incorporating information from bidirectional representations, rather than unidirectional representations, is crucial for representing words in natural language.
- The authors hypothesize that such bidirectional representations are critical in biomedical text mining as complex relationships between biomedical terms often exist in a biomedical corpus (Krallinger et al, 2017).
- Due to the space limitations, the authors refer readers to Devlin et al (2019) for a more detailed description of BERT
Results:
4.1 Datasets
The statistics of biomedical NER datasets are listed in Table 3.- The authors used the pre-processed versions of all the NER datasets provided by Wang et al (2018) except the 2010 i2b2/VA, JNLPBA and Species800 datasets.
- The relatively low scores on the LINNAEUS dataset can be attributed to the following: (i) the lack of a silver-standard dataset for training previous state-of-the-art models and (ii) different training/ test set splits used in previous work (Giorgi and Bader, 2018), which were unavailable.
- BioBERT achieved the highest F1 scores on 2 out of 3 biomedical datasets
Conclusion:
The authors used additional corpora of different sizes for pre-training and investigated their effect on performance.- Figure 2(a) shows that the performance of BioBERT v1.0 (þ PubMed) on three NER datasets (NCBI Disease, BC2GM, BC4CHEMD) changes in relation to the size of the PubMed corpus.
- Figure 2(b) shows the performance changes of BioBERT v1.0 (þ PubMed) on the same three NER datasets in relation to the number of pre-training steps.
- The pre-released version of BioBERT (January 2019) has already been shown to be very effective in many biomedical text mining tasks such as NER for clinical notes (Alsentzer et al, 2019), human phenotype-gene RE (Sousa et al, 2019) and clinical temporal RE (Lin et al, 2019).
- The following updated versions of BioBERT will be available to the bioNLP community: (i) BioBERTBASE and BioBERTLARGE trained on only PubMed abstracts without initialization from the existing BERT model and (ii) BioBERTBASE and BioBERTLARGE trained on domain-specific vocabulary based on WordPiece.
Tables
- Table1: List of text corpora used for BioBERT
- Table2: Pre-training BioBERT on different combinations of the following text corpora: English Wikipedia (Wiki), BooksCorpus (Books), PubMed abstracts (PubMed) and PMC full-text articles (PMC)
- Table3: Statistics of the biomedical named entity recognition datasets
- Table4: Statistics of the biomedical relation extraction datasets
- Table5: Statistics of biomedical question answering datasets
- Table6: Test results in biomedical named entity recognition BERT
- Table7: Biomedical relation extraction test results
- Table8: Biomedical question answering test results BERT
- Table9: Prediction samples from BERT and BioBERT on NER and QA datasets
Funding
- This research was supported by the National Research Foundation of Korea(NRF) funded by the Korea government (NRF-2017R1A2A1A17069645, NRF-2017M3C4A7065887, NRF-2014M3C9A3063541).
Reference
- Alsentzer,E. et al. (2019) Publicly available clinical bert embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, MN, USA. pp. 72–78. Association for Computational Linguistics. https://www.acl web.org/anthology/W19-1909.
- Bhasuran,B. and Natarajan,J. (2018) Automatic extraction of gene-disease associations from literature using joint ensemble learning. PLoS One, 13, e0200699.
- Bravo,A. et al. (2015) Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinformatics, 16, 55.
- Devlin,J. et al. (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA. pp. 4171–4186. Association for Computational Linguistics. https://www.aclweb.org/anthology/N19-1423.
- Dogan,R.I. et al. (2014) NCBI disease corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inform., 47, 1–10.
- Gerner,M. et al. (2010) Linnaeus: a species name identification system for biomedical literature. BMC Bioinformatics, 11, 85.
- Giorgi,J.M. and Bader,G.D. (2018) Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics, 34, 4087.
- Habibi,M. et al. (2017) Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics, 33, i37–i48.
- Kim,J.-D. et al. (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP), Geneva, Switzerland. pp. 73–78. COLING. https://www.aclweb.org/anthology/W04-1213.
- Krallinger,M. et al. (2015) The chemdner corpus of chemicals and drugs and its annotation principles. J. Cheminform., 7.
- Krallinger,M. et al. (2017) Overview of the BioCreative VI chemical-protein interaction track. In: Proceedings of the BioCreative VI Workshop, Bethesda, MD, USA, pp. 141–146. https://academic.oup.com/database/art icle/doi/10.1093/database/bay073/5055578.
- Li,J. et al. (2016) Biocreative V CDR task corpus: a resource for chemical disease relation extraction. Database, 2016.
- Lim,S. and Kang,J. (2018) Chemical–gene relation extraction using recursive neural network. Database, 2018.
- Lin,C. et al. (2019) A bert-based universal model for both within-and cross-sentence clinical temporal relation extraction. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, MN, USA. pp. 65–71. Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-1908.
- Lou,Y. et al. (2017) A transition-based joint model for disease named entity recognition and normalization. Bioinformatics, 33, 2363–2371.
- Luo,L. et al. (2018) An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics, 34, 1381–1388.
- McCann,B. et al. (2017) Learned in translation: contextualized word vectors. In: Guyon,I. et al. (eds.), Advances in Neural Information Processing Systems 30, Curran Associates, Inc., pp. 6294–6305. http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf.
- Mikolov,T. et al. (2013) Distributed representations of words and phrases and their compositionality. In: Burges,C.J.C. (eds.), Advances in Neural Information Processing Systems 26, Curran Associates, Inc., pp. 3111–3119. http://papers.nips.cc/paper/5021-distributed-representationsof-words-and-phrases-and-their-compositionality.pdf.
- Mohan,S. and Li,D. (2019) Medmentions: a large biomedical corpus annotated with UMLS concepts. arXiv preprint arXiv: 1902.09476.
- Pafilis,E. et al. (2013) The species and organisms resources for fast and accurate identification of taxonomic names in text. PLoS One, 8, e65390.
- Pennington,J. et al. (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar. pp. 1532–1543. Association for Computational Linguistics. https://www.aclweb.org/anthology/D14-1162.
- Peters,M.E. et al. (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA. pp. 2227–2237. Association for Computational Linguistics. https://www.aclweb.org/anthol ogy/N18-1202.
- Pyysalo,S. et al. (2013) Distributional semantics resources for biomedical text processing. In: Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan, pp. 39–43. https://aca demic.oup.com/bioinformatics/article/33/14/i37/3953940.
- Rajpurkar,P. et al. (2016) Squad: 100,000þ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX. pp. 2383–2392. Association for Computational Linguistics. https://www.aclweb.org/anthol ogy/D16-1264.
- Sachan,D.S. et al. (2018) Effective use of bidirectional language modeling for transfer learning in biomedical named entity recognition. In: Finale,D.-V. et al. (eds.), Proceedings of Machine Learning Research, Palo Alto, CA, Vol. 85, pp. 383–402. PMLR. http://proceedings.mlr.press/v85/sachan18a.html.
- Smith,L. et al. (2008) Overview of biocreative ii gene mention recognition. Genome Biol., 9, S2.
- Sousa,D. et al. (2019) A silver standard corpus of human phenotype-gene relations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN. pp. 1487–1492. Association for Computational Linguistics. https://www.aclweb.org/anthology/N19-1152.
- Sung,N. et al. (2017) NSML: A machine learning platform that enables you to focus on your models. arXiv preprint arXiv: 1712.05902.
- Tsatsaronis,G. et al. (2015) An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16, 138.
- Uzuner,O. et al. (2011) 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J. Am. Med. Inform. Assoc., 18, 552–556.
- Van Mulligen,E.M. et al. (2012) The EU-ADR corpus: annotated drugs, diseases, targets, and their relationships. J. Biomed. Inform., 45, 879–884.
- Vaswani,A. et al. (2017) Attention is all you need. In: Guyon,I. et al. (eds.), Advances in Neural Information Processing Systems, pp. 5998–6008. Curran Associates, Inc. http://papers.nips.cc/paper/7181-attention-is-allyou-need.pdf.
- Wang,X. et al. (2018) Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics, 35, 1745–1752.
- Wiese,G. et al. (2017) Neural domain adaptation for biomedical question answering. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, Canada. pp. 281–289. Association for Computational Linguistics. https://www.aclweb.org/anthology/K17-1029.
- Wu,Y. et al. (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv: 1609.08144.
- Xu,K. et al. (2019) Document-level attention-based BiLSTM-CRF incorporating disease dictionary for disease named entity recognition. Comput. Biol. Med., 108, 122–132.
- Yoon,W. et al. (2019) Collabonet: collaboration of deep neural networks for biomedical named entity recognition. BMC Bioinformatics, 20, 249.
- Zhu,H. et al. (2018) Clinical concept extraction with contextual word embedding. NIPS Machine Learning for Health Workshop. http://par.nsf.gov/bib lio/10098080.
Tags
Comments