BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Wonjin Yoon
Wonjin Yoon
Sungdong Kim
Sungdong Kim
Chan Ho So
Chan Ho So

Bioinformatics (Oxford, England), pp. 1234-1240, 2019.

Cited by: 237|Bibtex|Views48|Links
EI WOS
Keywords:
large scaleNaver Smart Machine LearningConditional Random Fieldbiomedical literaturef1 scoreMore(15+)
Weibo:
Compared with most previous biomedical text mining models that are mainly focused on a single task such as named entity recognition or question answering, our model BioBERT achieves state-of-the-art performance on various biomedical text mining tasks, while requiring only minimal...

Abstract:

Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in machine learning, extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. ...More

Code:

Data:

Introduction
  • The volume of biomedical literature continues to rapidly increase. On average, more than 3000 new articles are published every day in peer-reviewed journals, excluding pre-prints and technical reports such as clinical trial reports in various archives.
  • There is increasingly more demand for accurate biomedical text mining tools for extracting information from the literature.
  • Recent progress of biomedical text mining models was made possible by the advancements of deep learning techniques used in natural language processing (NLP).
  • Other deep learning based models have made improvements in biomedical text mining tasks such as relation extraction (RE) (Bhasuran and Natarajan, 2018; Lim and Kang, 2018) and question answering (QA) (Wiese et al, 2017)
Highlights
  • The volume of biomedical literature continues to rapidly increase
  • Compared with most previous biomedical text mining models that are mainly focused on a single task such as named entity recognition (NER) or question answering (QA), our model BioBERT achieves state-of-the-art performance on various biomedical text mining tasks, while requiring only minimal architectural modifications
  • Note that there are several other recently introduced high quality biomedical NER datasets (Mohan and Li, 2019), we use datasets that are frequently used by many biomedical natural language processing (NLP) researchers, which makes it much easier to compare our work with theirs
  • F1 scores were used for NER/relation extraction (RE), and MRR scores were used for QA
  • We showed that pre-training BERT on biomedical corpora is crucial in applying it to the biomedical domain
  • The following updated versions of BioBERT will be available to the bioNLP community: (i) BioBERTBASE and BioBERTLARGE trained on only PubMed abstracts without initialization from the existing BERT model and (ii) BioBERTBASE and BioBERTLARGE trained on domain-specific vocabulary based on WordPiece
Methods
  • BioBERT basically has the same structure as BERT. The authors briefly discuss the recently proposed BERT, and the authors describe in detail the pre-training and fine-tuning process of BioBERT.

    3.1 BERT: bidirectional encoder representations from transformers

    Learning word representations from a large amount of unannotated text is a long-established method.
  • According to the authors of BERT, incorporating information from bidirectional representations, rather than unidirectional representations, is crucial for representing words in natural language.
  • The authors hypothesize that such bidirectional representations are critical in biomedical text mining as complex relationships between biomedical terms often exist in a biomedical corpus (Krallinger et al, 2017).
  • Due to the space limitations, the authors refer readers to Devlin et al (2019) for a more detailed description of BERT
Results
  • 4.1 Datasets

    The statistics of biomedical NER datasets are listed in Table 3.
  • The authors used the pre-processed versions of all the NER datasets provided by Wang et al (2018) except the 2010 i2b2/VA, JNLPBA and Species800 datasets.
  • The relatively low scores on the LINNAEUS dataset can be attributed to the following: (i) the lack of a silver-standard dataset for training previous state-of-the-art models and (ii) different training/ test set splits used in previous work (Giorgi and Bader, 2018), which were unavailable.
  • BioBERT achieved the highest F1 scores on 2 out of 3 biomedical datasets
Conclusion
  • The authors used additional corpora of different sizes for pre-training and investigated their effect on performance.
  • Figure 2(a) shows that the performance of BioBERT v1.0 (þ PubMed) on three NER datasets (NCBI Disease, BC2GM, BC4CHEMD) changes in relation to the size of the PubMed corpus.
  • Figure 2(b) shows the performance changes of BioBERT v1.0 (þ PubMed) on the same three NER datasets in relation to the number of pre-training steps.
  • The pre-released version of BioBERT (January 2019) has already been shown to be very effective in many biomedical text mining tasks such as NER for clinical notes (Alsentzer et al, 2019), human phenotype-gene RE (Sousa et al, 2019) and clinical temporal RE (Lin et al, 2019).
  • The following updated versions of BioBERT will be available to the bioNLP community: (i) BioBERTBASE and BioBERTLARGE trained on only PubMed abstracts without initialization from the existing BERT model and (ii) BioBERTBASE and BioBERTLARGE trained on domain-specific vocabulary based on WordPiece.
Summary
  • Introduction:

    The volume of biomedical literature continues to rapidly increase. On average, more than 3000 new articles are published every day in peer-reviewed journals, excluding pre-prints and technical reports such as clinical trial reports in various archives.
  • There is increasingly more demand for accurate biomedical text mining tools for extracting information from the literature.
  • Recent progress of biomedical text mining models was made possible by the advancements of deep learning techniques used in natural language processing (NLP).
  • Other deep learning based models have made improvements in biomedical text mining tasks such as relation extraction (RE) (Bhasuran and Natarajan, 2018; Lim and Kang, 2018) and question answering (QA) (Wiese et al, 2017)
  • Methods:

    BioBERT basically has the same structure as BERT. The authors briefly discuss the recently proposed BERT, and the authors describe in detail the pre-training and fine-tuning process of BioBERT.

    3.1 BERT: bidirectional encoder representations from transformers

    Learning word representations from a large amount of unannotated text is a long-established method.
  • According to the authors of BERT, incorporating information from bidirectional representations, rather than unidirectional representations, is crucial for representing words in natural language.
  • The authors hypothesize that such bidirectional representations are critical in biomedical text mining as complex relationships between biomedical terms often exist in a biomedical corpus (Krallinger et al, 2017).
  • Due to the space limitations, the authors refer readers to Devlin et al (2019) for a more detailed description of BERT
  • Results:

    4.1 Datasets

    The statistics of biomedical NER datasets are listed in Table 3.
  • The authors used the pre-processed versions of all the NER datasets provided by Wang et al (2018) except the 2010 i2b2/VA, JNLPBA and Species800 datasets.
  • The relatively low scores on the LINNAEUS dataset can be attributed to the following: (i) the lack of a silver-standard dataset for training previous state-of-the-art models and (ii) different training/ test set splits used in previous work (Giorgi and Bader, 2018), which were unavailable.
  • BioBERT achieved the highest F1 scores on 2 out of 3 biomedical datasets
  • Conclusion:

    The authors used additional corpora of different sizes for pre-training and investigated their effect on performance.
  • Figure 2(a) shows that the performance of BioBERT v1.0 (þ PubMed) on three NER datasets (NCBI Disease, BC2GM, BC4CHEMD) changes in relation to the size of the PubMed corpus.
  • Figure 2(b) shows the performance changes of BioBERT v1.0 (þ PubMed) on the same three NER datasets in relation to the number of pre-training steps.
  • The pre-released version of BioBERT (January 2019) has already been shown to be very effective in many biomedical text mining tasks such as NER for clinical notes (Alsentzer et al, 2019), human phenotype-gene RE (Sousa et al, 2019) and clinical temporal RE (Lin et al, 2019).
  • The following updated versions of BioBERT will be available to the bioNLP community: (i) BioBERTBASE and BioBERTLARGE trained on only PubMed abstracts without initialization from the existing BERT model and (ii) BioBERTBASE and BioBERTLARGE trained on domain-specific vocabulary based on WordPiece.
Tables
  • Table1: List of text corpora used for BioBERT
  • Table2: Pre-training BioBERT on different combinations of the following text corpora: English Wikipedia (Wiki), BooksCorpus (Books), PubMed abstracts (PubMed) and PMC full-text articles (PMC)
  • Table3: Statistics of the biomedical named entity recognition datasets
  • Table4: Statistics of the biomedical relation extraction datasets
  • Table5: Statistics of biomedical question answering datasets
  • Table6: Test results in biomedical named entity recognition BERT
  • Table7: Biomedical relation extraction test results
  • Table8: Biomedical question answering test results BERT
  • Table9: Prediction samples from BERT and BioBERT on NER and QA datasets
Download tables as Excel
Funding
  • This research was supported by the National Research Foundation of Korea(NRF) funded by the Korea government (NRF-2017R1A2A1A17069645, NRF-2017M3C4A7065887, NRF-2014M3C9A3063541).
Reference
  • Alsentzer,E. et al. (2019) Publicly available clinical bert embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, MN, USA. pp. 72–78. Association for Computational Linguistics. https://www.acl web.org/anthology/W19-1909.
    Locate open access versionFindings
  • Bhasuran,B. and Natarajan,J. (2018) Automatic extraction of gene-disease associations from literature using joint ensemble learning. PLoS One, 13, e0200699.
    Google ScholarLocate open access versionFindings
  • Bravo,A. et al. (2015) Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinformatics, 16, 55.
    Google ScholarLocate open access versionFindings
  • Devlin,J. et al. (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA. pp. 4171–4186. Association for Computational Linguistics. https://www.aclweb.org/anthology/N19-1423.
    Locate open access versionFindings
  • Dogan,R.I. et al. (2014) NCBI disease corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inform., 47, 1–10.
    Google ScholarLocate open access versionFindings
  • Gerner,M. et al. (2010) Linnaeus: a species name identification system for biomedical literature. BMC Bioinformatics, 11, 85.
    Google ScholarLocate open access versionFindings
  • Giorgi,J.M. and Bader,G.D. (2018) Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics, 34, 4087.
    Google ScholarLocate open access versionFindings
  • Habibi,M. et al. (2017) Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics, 33, i37–i48.
    Google ScholarLocate open access versionFindings
  • Kim,J.-D. et al. (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP), Geneva, Switzerland. pp. 73–78. COLING. https://www.aclweb.org/anthology/W04-1213.
    Locate open access versionFindings
  • Krallinger,M. et al. (2015) The chemdner corpus of chemicals and drugs and its annotation principles. J. Cheminform., 7.
    Google ScholarLocate open access versionFindings
  • Krallinger,M. et al. (2017) Overview of the BioCreative VI chemical-protein interaction track. In: Proceedings of the BioCreative VI Workshop, Bethesda, MD, USA, pp. 141–146. https://academic.oup.com/database/art icle/doi/10.1093/database/bay073/5055578.
    Locate open access versionFindings
  • Li,J. et al. (2016) Biocreative V CDR task corpus: a resource for chemical disease relation extraction. Database, 2016.
    Google ScholarLocate open access versionFindings
  • Lim,S. and Kang,J. (2018) Chemical–gene relation extraction using recursive neural network. Database, 2018.
    Google ScholarLocate open access versionFindings
  • Lin,C. et al. (2019) A bert-based universal model for both within-and cross-sentence clinical temporal relation extraction. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, MN, USA. pp. 65–71. Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-1908.
    Locate open access versionFindings
  • Lou,Y. et al. (2017) A transition-based joint model for disease named entity recognition and normalization. Bioinformatics, 33, 2363–2371.
    Google ScholarLocate open access versionFindings
  • Luo,L. et al. (2018) An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics, 34, 1381–1388.
    Google ScholarLocate open access versionFindings
  • McCann,B. et al. (2017) Learned in translation: contextualized word vectors. In: Guyon,I. et al. (eds.), Advances in Neural Information Processing Systems 30, Curran Associates, Inc., pp. 6294–6305. http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf.
    Locate open access versionFindings
  • Mikolov,T. et al. (2013) Distributed representations of words and phrases and their compositionality. In: Burges,C.J.C. (eds.), Advances in Neural Information Processing Systems 26, Curran Associates, Inc., pp. 3111–3119. http://papers.nips.cc/paper/5021-distributed-representationsof-words-and-phrases-and-their-compositionality.pdf.
    Locate open access versionFindings
  • Mohan,S. and Li,D. (2019) Medmentions: a large biomedical corpus annotated with UMLS concepts. arXiv preprint arXiv: 1902.09476.
    Findings
  • Pafilis,E. et al. (2013) The species and organisms resources for fast and accurate identification of taxonomic names in text. PLoS One, 8, e65390.
    Google ScholarLocate open access versionFindings
  • Pennington,J. et al. (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar. pp. 1532–1543. Association for Computational Linguistics. https://www.aclweb.org/anthology/D14-1162.
    Locate open access versionFindings
  • Peters,M.E. et al. (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA. pp. 2227–2237. Association for Computational Linguistics. https://www.aclweb.org/anthol ogy/N18-1202.
    Locate open access versionFindings
  • Pyysalo,S. et al. (2013) Distributional semantics resources for biomedical text processing. In: Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan, pp. 39–43. https://aca demic.oup.com/bioinformatics/article/33/14/i37/3953940.
    Locate open access versionFindings
  • Rajpurkar,P. et al. (2016) Squad: 100,000þ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX. pp. 2383–2392. Association for Computational Linguistics. https://www.aclweb.org/anthol ogy/D16-1264.
    Locate open access versionFindings
  • Sachan,D.S. et al. (2018) Effective use of bidirectional language modeling for transfer learning in biomedical named entity recognition. In: Finale,D.-V. et al. (eds.), Proceedings of Machine Learning Research, Palo Alto, CA, Vol. 85, pp. 383–402. PMLR. http://proceedings.mlr.press/v85/sachan18a.html.
    Locate open access versionFindings
  • Smith,L. et al. (2008) Overview of biocreative ii gene mention recognition. Genome Biol., 9, S2.
    Google ScholarLocate open access versionFindings
  • Sousa,D. et al. (2019) A silver standard corpus of human phenotype-gene relations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN. pp. 1487–1492. Association for Computational Linguistics. https://www.aclweb.org/anthology/N19-1152.
    Locate open access versionFindings
  • Sung,N. et al. (2017) NSML: A machine learning platform that enables you to focus on your models. arXiv preprint arXiv: 1712.05902.
    Findings
  • Tsatsaronis,G. et al. (2015) An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16, 138.
    Google ScholarLocate open access versionFindings
  • Uzuner,O. et al. (2011) 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J. Am. Med. Inform. Assoc., 18, 552–556.
    Google ScholarLocate open access versionFindings
  • Van Mulligen,E.M. et al. (2012) The EU-ADR corpus: annotated drugs, diseases, targets, and their relationships. J. Biomed. Inform., 45, 879–884.
    Google ScholarLocate open access versionFindings
  • Vaswani,A. et al. (2017) Attention is all you need. In: Guyon,I. et al. (eds.), Advances in Neural Information Processing Systems, pp. 5998–6008. Curran Associates, Inc. http://papers.nips.cc/paper/7181-attention-is-allyou-need.pdf.
    Locate open access versionFindings
  • Wang,X. et al. (2018) Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics, 35, 1745–1752.
    Google ScholarLocate open access versionFindings
  • Wiese,G. et al. (2017) Neural domain adaptation for biomedical question answering. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, Canada. pp. 281–289. Association for Computational Linguistics. https://www.aclweb.org/anthology/K17-1029.
    Locate open access versionFindings
  • Wu,Y. et al. (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv: 1609.08144.
    Findings
  • Xu,K. et al. (2019) Document-level attention-based BiLSTM-CRF incorporating disease dictionary for disease named entity recognition. Comput. Biol. Med., 108, 122–132.
    Google ScholarLocate open access versionFindings
  • Yoon,W. et al. (2019) Collabonet: collaboration of deep neural networks for biomedical named entity recognition. BMC Bioinformatics, 20, 249.
    Google ScholarLocate open access versionFindings
  • Zhu,H. et al. (2018) Clinical concept extraction with contextual word embedding. NIPS Machine Learning for Health Workshop. http://par.nsf.gov/bib lio/10098080.
    Findings
Your rating :
0

 

Tags
Comments