REALM: Retrieval-Augmented Language Model Pre-Training

Guu Kelvin
Guu Kelvin
Tung Zora
Tung Zora
Cited by: 0|Bibtex|Views117|Links
Keywords:
language model pre trainingknowledge retrieverreading comprehensionMaximum Inner Product SearchOpen-domain Question AnsweringMore(9+)
Weibo:
We present several alternate ways of viewing Retrieval-Augmented Language Model that connect it to a broader set of ideas beyond Open-domain Question Answering: Language modeling with corpus as context Language representation models have been incorporating contexts of increasingl...

Abstract:

Language model pre-training has been shown to capture a surprising amount of world knowledge, crucial for NLP tasks such as question answering. However, this knowledge is stored implicitly in the parameters of a neural network, requiring ever-larger networks to cover more facts. To capture knowledge in a more modular and interpretable...More

Code:

Data:

0
Introduction
  • Recent advances in language model pre-training have shown that models such as BERT (Devlin et al, 2018), RoBERTa (Liu et al, 2019) and T5 (Raffel et al, 2019) store a surprising amount of world knowledge, acquired from the massive text corpora they are trained on (Petroni et al, 2019).
  • BERT is able to correctly predict the missing word in the following sentence: “The is the currency of the United
  • In these language models, the learned world knowledge is stored implicitly in the parameters of the underlying neural network.
  • The language model uses the retriever to retrieve documents1 from a large corpus such as Wikipedia, and attends over those documents to help inform its prediction
  • Learning this model end-toend requires backpropagating through a retrieval step that considers an entire corpus of textual knowledge, as shown in Figure 1.
  • A good MLM must learn to encode syntactic and semantic information as well as some world knowledge
Highlights
  • Recent advances in language model pre-training have shown that models such as BERT (Devlin et al, 2018), RoBERTa (Liu et al, 2019) and T5 (Raffel et al, 2019) store a surprising amount of world knowledge, acquired from the massive text corpora they are trained on (Petroni et al, 2019)
  • To capture knowledge in a more interpretable and modular way, we propose a novel framework, Retrieval-Augmented Language Model (REALM) pre-training, which augments language model pre-training algorithms with a learned textual knowledge retriever
  • We evaluate our approach by fine-tuning the models pre-trained with REALM on the task of Opendomain Question Answering (Open-QA), one of the most knowledge-intensive tasks in natural language processing
  • We present several alternate ways of viewing REALM that connect it to a broader set of ideas beyond Open-domain Question Answering (Open-QA): Language modeling with corpus as context Language representation models have been incorporating contexts of increasingly large scope when making predictions
  • REALM has a similar approach, except that the model learns for itself which texts are most useful for reducing perplexity
  • By jointly learning the retriever, REALM has the capacity to depend on information beyond lexical overlap
Methods
  • The authors evaluate the approach on the Open-QA task. the authors describe in detail the benchmarks used and the different approaches to which the authors compare empirically.

    4.1.
  • The authors evaluate the approach on the Open-QA task.
  • The authors focus on datasets where the question writers did not already know the answer.
  • This yields questions that reflect more realistic information-seeking needs, and avoids artifacts that can arise if the question is formulated with a particular answer in mind.
  • The predicted answer is evaluated via exact match with any reference answer, following previous Open-QA work (Chen et al, 2017)
Results
  • The authors select and mask one of these salient spans within a sentence for the masked language modeling task.
  • The authors show that this significantly outperforms other masking strategies in Section 4.5
Conclusion
  • The authors present several alternate ways of viewing REALM that connect it to a broader set of ideas beyond Open-QA: Language modeling with corpus as context Language representation models have been incorporating contexts of increasingly large scope when making predictions
  • Examples of this progression include models that condition on surrounding words (Mikolov et al, 2013a;b), sentences (Kiros et al, 2015; Peters et al, 2018), and paragraphs (Radford et al, 2018; Devlin et al, 2018).
  • By jointly learning the retriever, REALM has the capacity to depend on information beyond lexical overlap
Summary
  • Introduction:

    Recent advances in language model pre-training have shown that models such as BERT (Devlin et al, 2018), RoBERTa (Liu et al, 2019) and T5 (Raffel et al, 2019) store a surprising amount of world knowledge, acquired from the massive text corpora they are trained on (Petroni et al, 2019).
  • BERT is able to correctly predict the missing word in the following sentence: “The is the currency of the United
  • In these language models, the learned world knowledge is stored implicitly in the parameters of the underlying neural network.
  • The language model uses the retriever to retrieve documents1 from a large corpus such as Wikipedia, and attends over those documents to help inform its prediction
  • Learning this model end-toend requires backpropagating through a retrieval step that considers an entire corpus of textual knowledge, as shown in Figure 1.
  • A good MLM must learn to encode syntactic and semantic information as well as some world knowledge
  • Methods:

    The authors evaluate the approach on the Open-QA task. the authors describe in detail the benchmarks used and the different approaches to which the authors compare empirically.

    4.1.
  • The authors evaluate the approach on the Open-QA task.
  • The authors focus on datasets where the question writers did not already know the answer.
  • This yields questions that reflect more realistic information-seeking needs, and avoids artifacts that can arise if the question is formulated with a particular answer in mind.
  • The predicted answer is evaluated via exact match with any reference answer, following previous Open-QA work (Chen et al, 2017)
  • Results:

    The authors select and mask one of these salient spans within a sentence for the masked language modeling task.
  • The authors show that this significantly outperforms other masking strategies in Section 4.5
  • Conclusion:

    The authors present several alternate ways of viewing REALM that connect it to a broader set of ideas beyond Open-QA: Language modeling with corpus as context Language representation models have been incorporating contexts of increasingly large scope when making predictions
  • Examples of this progression include models that condition on surrounding words (Mikolov et al, 2013a;b), sentences (Kiros et al, 2015; Peters et al, 2018), and paragraphs (Radford et al, 2018; Devlin et al, 2018).
  • By jointly learning the retriever, REALM has the capacity to depend on information beyond lexical overlap
Tables
  • Table1: Test results on Open-QA benchmarks. The number of train/test examples are shown in paretheses below each benchmark. Predictions are evaluated with exact match against any reference answer. Sparse retrieval denotes methods that use sparse features such as TF-IDF and BM25. Our model, REALM, outperforms all existing systems
  • Table2: Ablation experiments on NQ’s development set
  • Table3: An example where REALM utilizes retrieved documents to better predict masked tokens. It assigns much higher probability (0.129) to the correct term, “Fermat”, compared to BERT. (Note that the blank corresponds to 3 BERT wordpieces.)
  • Table4: An example where REALM adapts to the updated knowledge corpus. The Wikipedia page “Excellent Cadaver” was added in 2019, so the model was not about to recover the word when the knowledge corpus is outdated (2018). Interestingly, the same REALM model pre-trained on the 2018 corpus is able to retrieve the document in the updated corpus (2020) and generate the correct token, “Lawrence”
Download tables as Excel
Reference
  • Asai, A., Hashimoto, K., Hajishirzi, H., Socher, R., and Xiong, C. Learning to retrieve reasoning paths over wikipedia graph for question answering. arXiv preprint arXiv:1911.10470, 2019.
    Findings
  • Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
    Findings
  • Berant, J., Chou, A., Frostig, R., and Liang, P. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1533–1544, 2013.
    Google ScholarLocate open access versionFindings
  • Brill, E., Dumais, S., and Banko, M. An analysis of the askmsr question-answering system. In Empirical Methods in Natural Language Processing, 2002.
    Google ScholarLocate open access versionFindings
  • Chen, D., Fisch, A., Weston, J., and Bordes, A. Reading wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pp. 1870–1879, 2017.
    Google ScholarLocate open access versionFindings
  • Clark, C. and Gardner, M. Simple and effective multiparagraph reading comprehension. In Annual Meeting of the Association for Computational Linguistics, 2017.
    Google ScholarLocate open access versionFindings
  • Dai, A. M. and Le, Q. V. Semi-supervised sequence learning. In Advances in neural information processing systems, pp. 3079–3087, 2015.
    Google ScholarLocate open access versionFindings
  • Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
    Findings
  • Graves, A., Wayne, G., and Danihelka, I. Neural turing machines. ArXiv, abs/1410.5401, 2014.
    Findings
  • Guu, K., Hashimoto, T. B., Oren, Y., and Liang, P. Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics, 6:437– 450, 2018.
    Google ScholarLocate open access versionFindings
  • Hashimoto, T. B., Guu, K., Oren, Y., and Liang, P. S. A retrieve-and-edit framework for predicting structured outputs. In Advances in Neural Information Processing Systems, pp. 10052–10062, 2018.
    Google ScholarLocate open access versionFindings
  • Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., and Levy, O. SpanBERT: Improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529, 2019.
    Findings
  • Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., and Lewis, M. Generalization through memorization: Nearest neighbor language models. ArXiv, abs/1911.00172, 2019.
    Findings
  • Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., and Fidler, S. Skip-thought vectors. In Advances in neural information processing systems, pp. 3294–3302, 2015.
    Google ScholarLocate open access versionFindings
  • Kwiatkowski, T., Palomaki, J., Rhinehart, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kelcey, M., Devlin, J., et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 2019.
    Google ScholarLocate open access versionFindings
  • Lample, G., Sablayrolles, A., Ranzato, M., Denoyer, L., and Jegou, H. Large memory layers with product keys. In Advances in Neural Information Processing Systems, pp. 8546–8557, 2019.
    Google ScholarLocate open access versionFindings
  • Lee, K., Salant, S., Kwiatkowski, T., Parikh, A., Das, D., and Berant, J. Learning recurrent span representations for extractive question answering. arXiv preprint arXiv:1611.01436, 2016.
    Findings
  • Lee, K., Chang, M.-W., and Toutanova, K. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the Conference of Association for Computational Linguistics, 2019.
    Google ScholarLocate open access versionFindings
  • Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. ArXiv, abs/1910.13461, 2019.
    Findings
  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a.
    Findings
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013b.
    Google ScholarLocate open access versionFindings
  • Miller, A., Fisch, A., Dodge, J., Karimi, A.-H., Bordes, A., and Weston, J. Key-value memory networks for directly reading documents. arXiv preprint arXiv:1606.03126, 2016.
    Findings
  • Min, S., Chen, D., Hajishirzi, H., and Zettlemoyer, L. A discrete hard em approach for weakly supervised question answering. arXiv preprint arXiv:1909.04849, 2019a.
    Findings
  • Min, S., Chen, D., Zettlemoyer, L., and Hajishirzi, H. Knowledge guided text retrieval and reading for open domain question answering. arXiv preprint arXiv:1911.03868, 2019b.
    Findings
  • Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. In Proc. of NAACL, 2018.
    Google ScholarLocate open access versionFindings
  • Peters, M. E., Neumann, M., IV, R. L. L., Schwartz, R., Joshi, V., Singh, S., and Smith, N. A. Knowledge enhanced contextual word representations, 2019.
    Google ScholarFindings
  • Petroni, F., Rocktaschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., and Riedel, S. Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019.
    Findings
  • Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding with unsupervised learning. Technical report, OpenAI, 2018.
    Google ScholarFindings
  • Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog, 2019.
    Google ScholarLocate open access versionFindings
  • Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
    Findings
  • Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383– 2392, 2016.
    Google ScholarLocate open access versionFindings
  • Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
    Findings
  • Ram, P. and Gray, A. G. Maximum inner-product search using cone trees. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 931–939, 2012.
    Google ScholarLocate open access versionFindings
  • Roberts, A., Raffel, C., and Shazeer, N. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:TBD, 2020.
    Google ScholarFindings
  • Robertson, S., Zaragoza, H., et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009.
    Google ScholarLocate open access versionFindings
  • Sang, E. T. K. and De Meulder, F. Introduction to the conll2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147, 2003.
    Google ScholarLocate open access versionFindings
  • Seo, M., Kembhavi, A., Farhadi, A., and Hajishirzi, H. Bidirectional attention flow for machine comprehension. In International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • Shen, F., Liu, W., Zhang, S., Yang, Y., and Tao Shen, H. Learning binary codes for maximum inner product search. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4148–4156, 2015.
    Google ScholarLocate open access versionFindings
  • Shrivastava, A. and Li, P. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, pp. 2321–2329, 2014.
    Google ScholarLocate open access versionFindings
  • Sukhbaatar, S., Weston, J., Fergus, R., et al. End-to-end memory networks. In Advances in neural information processing systems, 2015.
    Google ScholarLocate open access versionFindings
  • Weston, J., Chopra, S., and Bordes, A. Memory networks. arXiv preprint arXiv:1410.3916, 2014.
    Findings
  • REALM (Z =20 Dec 2018 corpus) smith (0.01), brown (0.01), jones (0.01)
    Google ScholarFindings
  • REALM (Z =20 Jan 2020 corpus) lawrence (0.13), brown (0.01), smith (0.01),...
    Google ScholarFindings
Your rating :
0

 

Tags
Comments