Supplementing domain knowledge to BERT with semi-structured information of documents

EXPERT SYSTEMS WITH APPLICATIONS(2024)

引用 0|浏览15
暂无评分
摘要
Domain adaptation is a good way to boost BERT's performance on domain-specific natural language processing (NLP) tasks. Common domain adaptation methods, however, can be deficient in capturing domain knowledge. Meanwhile, the context fragmentation inherent in Transformer-based models also hinders the acquisition of domain knowledge. Considering the semi-structural characteristics of documents and its potential for alleviating these problems, we leverage the semi-structured information of documents to supplement domain knowledge to BERT. To this end, we propose a topic-based domain adaptation method, which enhances the capture of domain knowledge at various levels of text granularity. Specifically, topic masked language modeling is designed at the paragraph level for pre-training; topic subsection matching degree dataset is automatically constructed at the subsection level for intermediate fine-tuning. Experiments are conducted over four biomedical NLP tasks across six datasets. The results show that our method benefits BERT, RoBERTa, SpanBERT, BioBERT, and PubMedBERT in nearly all cases. And we see significant gains in two question answering (QA) tasks, especially customer health QA, the topic-related one, with an average accuracy improvement of 4.8%. Thus, the semi-structured information of documents can be exploited to make BERT capture domain knowledge more effectively.
更多
查看译文
关键词
BERT,Domain adaptation,Semi-structured information,Biomedical question answering,Pre-trained language model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要