Bayesian locality sensitive hashing for fast similarity search

Proceedings of the VLDB Endowment(2012)

引用 203|浏览55
暂无评分
摘要
Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. Locality-sensitive hashing (LSH) based methods have become a very popular approach for this problem. However, most such methods only use LSH for the first phase of similarity search - i.e. efficient indexing for candidate generation. In this paper, we present BayesLSH, a principled Bayesian algorithm for the subsequent phase of similarity search - performing candidate pruning and similarity estimation using LSH. A simpler variant, BayesLSH-Lite, which calculates similarities exactly, is also presented. Our algorithms are able to quickly prune away a large majority of the false positive candidate pairs, leading to significant speedups over baseline approaches. For BayesLSH, we also provide probabilistic guarantees on the quality of the output, both in terms of accuracy and recall. Finally, the quality of BayesLSH's output can be easily tuned and does not require any manual setting of the number of hashes to use for similarity estimation, unlike standard approaches. For two state-of-the-art candidate generation algorithms, AllPairs and LSH, BayesLSH enables significant speedups, typically in the range 2x-20x for a wide variety of datasets.
更多
查看译文
关键词
all-pairs similarity search problem,fast similarity search,candidate generation,subsequent phase,false positive candidate pair,associated similarity measure,candidate pruning,state-of-the-art candidate generation algorithm,significant speedup,similarity search,similarity estimation,indexation,generic algorithm,false positive,information retrieval,locality sensitive hashing,artificial intelligent,data structure
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要