L2AP: Fast cosine similarity search with prefix L-2 norm bounds

Data Engineering(2014)

引用 66|浏览57
暂无评分
摘要
The All-Pairs similarity search, or self-similarity join problem, finds all pairs of vectors in a high dimensional sparse dataset with a similarity value higher than a given threshold. The problem has been classically solved using a dynamically built inverted index. The search time is reduced by early pruning of candidates using size and value-based bounds on the similarity. In the context of cosine similarity and weighted vectors, leveraging the Cauchy-Schwarz inequality, we propose new ℓ2-norm bounds for reducing the inverted index size, candidate pool size, and the number of full dot-product computations. We tighten previous candidate generation and verification bounds and introduce several new ones to further improve our algorithm's performance. Our new pruning strategies enable significant speedups over baseline approaches, most times outperforming even approximate solutions. We perform an extensive evaluation of our algorithm, L2AP, and compare against state-of-the-art exact and approximate methods, AllPairs, MMJoin, and BayesLSH, across a variety of real-world datasets and similarity thresholds.
更多
查看译文
关键词
information filtering,AllPairs method,BayesLSH method,Cauchy-Schwarz inequality,L2AP algorithm,MMJoin method,all-pairs similarity search,cosine similarity context,dynamically built inverted index,fast cosine similarity search,filtering strategies,high dimensional sparse dataset,prefix l2-norm bounds,pruning strategies,search time,self-similarity join problem,similarity thresholds,similarity value,size-based bounds,value-based bounds,weighted vectors context
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要