AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
Sparse, Dense, and Attentional Representations for Text Retrieval
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, (2021): 329-345
- Retrieving relevant documents is a core task for language technology, and is a component of other applications, such as information extraction (e.g., Narasimhan et al, 2016) and question answering (e.g., Kwok et al, 2001; Voorhees, 2001).
- BERT-init documents are retrieved using sparse high dimensional query/document representations, and are further reranked with learned neural models (see Mitra and Craswell (2018) for an overview)
- This two stage approach is powerful and has achieved state-of-the-art results on multiple IR benchmarks (Nogueira and Cho, 2019; Yang et al, 2019; Nogueira et al, 2019a), especially since largescale annotated data has become available for training deep neural models (Dietz et al, 2018; Craswell et al, 2020).
- One approach to take advantage of neural models while still employing sparse term-based retrieval is to expand the documents with neural models before indexing (Nogueira et al, 2019b) or learn contextual term weights (Dai and Callan, 2020)
- Retrieving relevant documents is a core task for language technology, and is a component of other applications, such as information extraction (e.g., Narasimhan et al, 2016) and question answering (e.g., Kwok et al, 2001; Voorhees, 2001)
- While classical information retrieval has focused on heuristic weights for sparse bag-of-words representations (Spärck Jones, 1972), more recent work has adopted a two-stage retrieval and ranking pipeline, where a large number (e.g. 1000)
- Recent history in NLP might suggest that learned dense representations should always outperform sparse features, but this is not necessarily true: as shown in Figure 1, the BM25 model (Robertson et al, 2009) can outperform a dual encoder based on BERT, on longer documents (See § 7). This raises questions about the utility and limitations of dual encoders, and the circumstances in which these powerful models do not yet reach the state-of-the-art. We explore these questions using both theoretical and empirical tools, and propose new architectures that leverage the strengths of dual encoders while avoiding some of their weaknesses
- We focus on the capacity of the dual encoder model, because capacity limitations impose a strict upper bound on performance, and because they do not depend on details of the training data and learning algorithm
- We have mentioned research improving the accuracy of retrieval and ranking from a large space throughout the paper
- The computational demands of large-scale retrieval push us to seek other architectures: cross attention over contextualized embeddings is too slow, but dual encoding over fixed-length vectors may be insufficiently expressive, failing even to match the performance of sparse bag-of-words competitors. We have used both theoretical and empirical techniques to characterize the limitations of fixed-length dual encoders, focusing on the role of document length
- The authors' theoretical results relate the dimensionality of compressive dual encoders to their ability to accurately approximate rankings defined by bagof-words representations like BM25.
- The distribution of natural language texts may have a special structure
- This in turn could enable precise approximation of sparse bagof-word models with a lower-dimensional compressive dual encoder.
- Dual encoders can introduce trained distributed representations of texts, better equipped to capture graded notions of semantic similarity
- If they can’t make the distinctions that sparse models make, they could suffer a performance ceiling
- The state of the art prior work follows the twostage retrieval and reranking approach, where an efficient first-stage system retrieves a list of candidates from the document collection, and a second stage more expensive model such as cross-attention BERT reranks the candidates.
- The authors' focus is on improving the first efficient retrieval stage, and the authors compare to prior works in two settings: Retrieval, top part of the Table, where only first-stage efficient retrieval systems are used and Reranking, bottom part of the Table, where more expensive second-stage models are employed to re-rank candidates.
- DeepCT-Index produces term weights that can be stored in an ordinary inverted index for first-stage passage retrieval. 4) IDST is a two-stage cascade ranking pipeline proposed by Yan et al (2020) which used both document expansion and crossattention ensemble reranking with tailored BERT model pre-training. 5) Leaderboard is the best reported development set score on the MS MARCOpassage leaderboard 14
- Transformers perform well on an unreasonable range of problems in natural language processing.
- The computational demands of large-scale retrieval push them to seek other architectures: cross attention over contextualized embeddings is too slow, but dual encoding over fixed-length vectors may be insufficiently expressive, failing even to match the performance of sparse bag-of-words competitors
- The authors have used both theoretical and empirical techniques to characterize the limitations of fixed-length dual encoders, focusing on the role of document length.
- Table1: Short answer exact match on the Natural Questions open-domain test set for retrieval models over collections with varying document length
- Table2: Results on MS MARCO-Passage (MSPassage), MS MARCO-Document (MS-Doc) and TREC-CAR datasets. We report MRR@10 and MAP@1000 to align with prior work. For the MS MARCO datasets, results are on the development set; the TREC-CAR results are on the test set
- Table3: MRR@10 when reranking at different depth for the MS MARCO passage and document tasks
- We have mentioned research improving the accuracy of retrieval and ranking from a large space throughout the paper. Here we focus on prior works related to our research questions on the capacity of dense dual encoder representations relative to sparse high-dimensional bag-of-words ones.
A number of other works relate to the general problem of recovering bag-of-words representations from dense encodings. For example, the literature from compressive sensing shows that it is possible to recover a bag of words vector x from the projection Ax for suitable A. Bounds for the sufficient dimensionality of isotropic Gaussian projections (Candes and Tao, 2005; Arora et al, 2018) are a factor of T log v worse than the bound described in § 3, but this is unsurprising because the task of recovering bags-of-words from a compressed measurement is strictly harder than recovering inner products.
Subramani et al (2019) ask whether it is possible to exactly recover sentences (token sequences) from pretrained decoders, using vector embeddings that are added as a bias to the decoder hidden state. Because their decoding model is more expressive (and thus more computationally intensive) than inner product retrieval, the theoretical bounds derived here do not apply. Nonetheless, Subramani et al empirically observe a similar dependence between sentence length and embedding size. Wieting and Kiela (2019) represent sentences as bags of random projections, finding that high-dimensional projections (k = 4096) perform nearly as well as trained encoding models such as SkipThought (Kiros et al, 2015) and InferSent (Conneau et al, 2017). These empirical results may provide further empirical support for the hypothesis that bag-of-words vectors from real text are “hard to embed” in the sense of Larsen and Nelson (2017). Our contribution is to systematically explore the relationship between document length and encoding dimension, focusing on the case of exact inner product-based retrieval. Approximate retrieval (Indyk and Motwani, 1998; Har-Peled et al, 2012) is often necessary in practice. We leave the combination of representation learning and approximate retrieval for future work.
- Using the MS MARCO document retrieval dataset (see § 9) for data processing details, we evaluate the ability of Rademacher random projections to achieve accuracy of at least 95% on pairwise rankings (q, d1, d2), with respect to both boolean (Figure 2) and BM25 sparse representations (Figure 3)
- BM25-bi achieves over 90% accuracy across document collections for this task
- We have mentioned research improving the accuracy of retrieval and ranking from a large space throughout the paper
- Dimitris Achlioptas. 2003. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences, 66(4):671–687.
- Alexandr Andoni, Piotr Indyk, and Ilya Razenshteyn. 2019. Approximate nearest neighbor search in high dimensions. Proceedings of the International Congress of Mathematicians (ICM 2018).
- Sanjeev Arora, Mikhail Khodak, Nikunj Saunshi, and Kiran Vodrahalli. 2018. A compressed sensing view of unsupervised text embeddings, bag-of-n-grams, and lstms. In Proceedings of the International Conference on Learning Representations (ICLR).
- Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 201Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR).
- Shai Ben-David, Nadav Eiron, and Hans Ulrich Simon. 2002. Limitations of learning via embeddings in euclidean half spaces. Journal of Machine Learning Research, 3(Nov):441–461.
- Emmanuel J Candes and Terence Tao. 2005. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203–4215.
- Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), pages 670– 680.
- Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the trec 2019 deep learning track. In Text REtrieval Conference (TREC). TREC.
- Zhuyun Dai and Jamie Callan. 2020. Contextaware sentence/passage term importance estimation for first stage retrieval. Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
- Laura Dietz, Ben Gamari, Jeff Dalton, and Nick Craswell. 2018. Trec complex answer retrieval overview. In Text REtrieval Conference (TREC).
- Daniel Gillick, Sayali Kulkarni, Larry Lansing, Alessandro Presta, Jason Baldridge, Eugene Ie, and Diego Garcia-Olano. 2019. Learning dense representations for entity retrieval. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 528–537.
- Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and David Simcha. 2016. Quantization based fast inner product search. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 482– 490.
- Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. Realm: Retrieval-augmented language model pretraining.
- Yanchao Hao, Yuanzhe Zhang, Kang Liu, Shizhu He, Zhanyi Liu, Hua Wu, and Jun Zhao. 2017.
- An end-to-end model for question answering over knowledge base with cross-attention combining global knowledge. In Proceedings of the Association for Computational Linguistics (ACL), pages 221–231.
- Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. 2012. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of computing, 8(1):321–350.
- Harold Stanley Heaps. 1978. Information retrieval, computational and theoretical aspects. Academic Press.
- Gustav Herdan. 1960. Type-token mathematics: A textbook of mathematical linguistics, volume 4. Mouton.
- Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 2333–2338.
- Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2020. Polyencoders: Transformer architectures and pretraining strategies for fast and accurate multisentence scoring. In Proceedings of the International Conference on Learning Representations (ICLR).
- Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613.
- Vladimir Karpukhin, Barlas OÄ§uz, Sewon Min, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. 2020. Dense passage retrieval for open-domain question answering.
- Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems, pages 3294–3302.
- Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics.
- Cody Kwok, Oren Etzioni, and Daniel S Weld. 2001. Scaling question answering to the web. ACM Transactions on Information Systems (TOIS), 19(3):242–262.
- Kasper Green Larsen and Jelani Nelson. 2017. Optimality of the johnson-lindenstrauss lemma. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 633–638. IEEE.
- Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the Association for Computational Linguistics (ACL).
- Yankai Lin, Haozhe Ji, Zhiyuan Liu, and Maosong Sun. 2018. Denoising distantly supervised open-domain question answering. In Proceedings of the Association for Computational Linguistics (ACL), pages 1736–1745.
- Sewon Min, Danqi Chen, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2019. Knowledge guided text retrieval and reading for open domain question answering. arXiv preprint arXiv:1911.03868.
- Bhaskar Mitra and Nick Craswell. 2018. An introduction to neural information retrieval. Foundations and TrendsÂoin Information Retrieval, 13(1):1–126.
- Karthik Narasimhan, Adam Yala, and Regina Barzilay. 2016. Improving information extraction by acquiring external evidence with reinforcement learning. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), pages 2355–2365.
- Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human generated machine reading comprehension dataset.
- Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage re-ranking with BERT. CoRR, abs/1901.04085.
- Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019a. Multi-stage document ranking with bert.
- Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019b. Document expansion by query prediction. CoRR, abs/1904.08375.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), pages 3982–3992.
- Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends R in Information Retrieval, 3(4):333–389.
- Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski, Ankur Parikh, Ali Farhadi, and Hannaneh Hajishirzi. 2019. Real-time open-domain question answering with dense-sparse phrase index. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4430–4441.
- Karen Spärck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11– 21.
- Nishant Subramani, Samuel Bowman, and Kyunghyun Cho. 2019. Can unconditional language models recover arbitrary sentences? In Advances in Neural Information Processing Systems, pages 15232–15242.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998– 6008.
- Santosh S Vempala. 2004. The random projection method, volume 65. American Mathematical Society.
- Ellen M. Voorhees. 2001. The TREC question answering track. Natural Language Engineering, 7(4):361âA S 378.
- John Wieting and Douwe Kiela. 2019. No training required: Exploring random encoders for sentence classification. In Proceedings of the International Conference on Learning Representations (ICLR).
- Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, and Luke Zettlemoyer. 2019. Zeroshot entity linking with dense entity retrieval.
- Ming Yan, Chenliang Li, Chen Wu, Bin Bi, Wei Wang, Jiangnan Xia, and Luo Si. 2020. Idst at trec 2019 deep learning track: Deep cascade ranking with generation-based document expansion and pre-trained language modeling. In Text REtrieval Conference (TREC).
- Liu Yang, Qingyao Ai, Jiafeng Guo, and W Bruce Croft. 2016. aNMM: Ranking short answer texts with attention-based neural matching model. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 287–296.
- Wei Yang, Haotian Zhang, and Jimmy Lin. 2019. Simple applications of BERT for ad hoc document retrieval. CoRR, abs/1903.10972.
- Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2018. Breaking the softmax bottleneck: A high-rank rnn language model. In Proceedings of the International Conference on Learning Representations (ICLR).