exKidneyBERT: a language model for kidney transplant pathology reports and the crucial role of extended vocabularies

PEERJ COMPUTER SCIENCE(2024)

引用 0|浏览0
暂无评分
摘要
Background. Pathology reports contain key information about the patient's diagnosis as well as important gross and microscopic findings. These information-rich clinical reports offer an invaluable resource for clinical studies, but data extraction and analysis from such unstructured texts is often manual and tedious. While neural information retrieval systems (typically implemented as deep learning methods for natural language processing) are automatic and flexible, they typically require a large domain-specific text corpus for training, making them infeasible for many medical subdomains. Thus, an automated data extraction method for pathology reports that does not require a large training corpus would be of significant value and utility. Objective. To develop a language model-based neural information retrieval system that can be trained on small datasets and validate it by training it on renal transplantpathology reports to extract relevant information for two predefined questions: (1) "What kind of rejection does the patient show?"; (2) "What is the grade of interstitial fibrosis and tubular atrophy (IFTA)?"Methods. Kidney BERT was developed by pre-training Clinical BERT on 3.4K renal transplant pathology reports and 1.5M words. Then, exKidneyBERT was developed by extending Clinical BERT's tokenizer with six technical keywords and repeating the pre-training procedure. This extended the model's vocabulary. All three models were fine-tuned with information retrieval heads. Results. The model with extended vocabulary, exKidneyBERT, outperformed Clinical BERT and Kidney BERT in both questions. For rejection, exKidneyBERT achieved an 83.3% overlap ratio for antibody-mediated rejection (ABMR) and 79.2% for T-cell mediated rejection (TCMR). For IFTA, exKidneyBERT had a 95.8% exact match rate. Conclusion. ExKidneyBERT is a high-performing model for extracting information from renal pathology reports. Additional pre-training of BERT language models on specialized small domains does not necessarily improve performance. Extending the BERT tokenizer's vocabulary library is essential for specialized domains to improve performance, especially when pre-training on small corpora.
更多
查看译文
关键词
Natural language processing,NLP,Transformer,BERT,Kidney,Renal,Pathology,Transplant,Language model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要