Self-supervised Information Retrieval Trained from Self-generated Sets of Queries and Relevant Documents

SIMILARITY SEARCH AND APPLICATIONS (SISAP 2022)(2022)

引用 1|浏览0
暂无评分
摘要
Large corpora of textual data such as scientific papers, patents, legal documents, reviews, etc., represent precious unstructured knowledge that needs semantic information retrieval engines to be extracted. Current best information retrieval solutions use supervised deep learning approaches, requiring large labelled training sets of queries and corresponding relevant documents, often unavailable, or their preparation is economically infeasible for most organizations. In this work, we present a new self-supervised method to train a neural solution to model and efficiently search large corpora of documents against arbitrary queries without requiring labelled dataset of queries and associated relevant papers. The core points of our self-supervised approach are (i) a method to self-generate the training set of queries and their relevant documents from the corpus itself, without any kind of human supervision, (ii) a deep metric learning approach to model their semantic space of relationships, and (iii) the incorporation of a multi-dimensional index for this neural semantic space over which running queries efficiently. To better stress the performance of the approach, we applied it to a totally unsupervised corpus with complex contents of over half a million Italian legal documents.
更多
查看译文
关键词
Semantic search,Self-supervised learning,Large italian dataset
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要