Top-k String Similarity Joins.

SSDBM(2020)

引用 0|浏览62
暂无评分
摘要
Top-k joins have been extensively studied in relational databases as ranking operations when every object has, among others, at least one ranking attribute. However, the focus has mostly been the case when the join attributes are of primitive data types (e.g., numerical values) and the join predicate is equality. In this work, we consider string objects assigned such ranking attributes or simply scores. Given two collection of string objects and a string similarity measure (e.g., the Edit distance), we introduce the top-k string similarity join () which returns k sufficiently similar pairs of objects with respect to a similarity threshold ϵ, which have the highest combined score computed by a monotone aggregate function γ (e.g., SUM). Such a join operation finds application in data integration, data cleaning and de-duplication scenarios, and in emerging scientific fields such as bioinformatics. We investigate how existing top-k join methods can be adapted and optimized for , taking into account the semantics and the special characteristics of string similarity joins. We present techniques to avoid computing the entire string join and indexing that enables pruning candidates with respect to both the string join and the ranking component of the query. Our extensive experimental analysis demonstrates the efficiency of our methodology for by comparing solutions that either prioritize the ranking/join component or are able to handle both components of the query at the same time.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要