MassJoin: A mapreduce-based method for scalable string similarity joins

ICDE(2014)

引用 156|浏览113
暂无评分
摘要
String similarity join is an essential operation in data integration. The era of big data calls for scalable algorithms to support large-scale string similarity joins. In this paper, we study scalable string similarity joins using MapReduce. We propose a MapReduce-based framework, called MASSJOIN, which supports both set-based similarity functions and character-based similarity functions. We extend the existing partition-based signature scheme to support set-based similarity functions. We utilize the signatures to generate key-value pairs. To reduce the transmission cost, we merge key-value pairs to significantly reduce the number of key-value pairs, from cubic to linear complexity, while not sacrificing the pruning power. To improve the performance, we incorporate “light-weight” filter units into the key-value pairs which can be utilized to prune large number of dissimilar pairs without significantly increasing the transmission cost. Experimental results on real-world datasets show that our method significantly outperformed state-of-the-art approaches.
更多
查看译文
关键词
mapreduce-based method,transmission cost reduction,scalable algorithm,massjoin,string matching,big data,linear complexity,character-based similarity functions,computational complexity,mapreduce-based framework,large-scale string similarity join,cubic complexity,light-weight filter units,scalable string similarity joins,set-based similarity functions,data integration,cost reduction,key-value pairs,partition-based signature scheme,erbium,open systems,filtering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要