LinkBERT: Pretraining Language Models with Document Links

PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS)(2022)

引用 221|浏览244
暂无评分
摘要
Language model (LM) pretraining captures various knowledge from text corpora, helping downstream NLP tasks. However, existing methods such as BERT model a single document, failing to capture document dependencies and knowledge that spans across documents. In this work, we propose LinkBERT, an effective LM pretraining method that incorporates document links, such as hyperlinks. Given a pretraining corpus, we view it as a graph of documents, and create LM inputs by placing linked documents in the same context. We then train the LM with two joint self-supervised tasks: masked language modeling and our newly proposed task, document relation prediction. We study LinkBERT in two domains: general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). LinkBERT outperforms BERT on various downstream tasks in both domains. It is especially effective for multi-hop reasoning and few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA), and our biomedical LinkBERT attains new state-of-the-art on various BioNLP tasks (+7% on BioASQ and USMLE). We release the pretrained models, LinkBERT and BioLinkBERT, as well as code and data.(1)
更多
查看译文
关键词
language models,document links
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要