Crowd-Based Deduplication: An Adaptive Approach

MOD(2015)

引用 86|浏览139
暂无评分
摘要
Data deduplication stands as a building block for data integration and data cleaning. The state-of-the-art techniques focus on how to exploit crowdsourcing to improve the accuracy of deduplication. However, they either incur significant overheads on the crowd or offer inferior accuracy.This paper presents ACD, a new crowd-based algorithm for data deduplication. The basic idea of ACD is to adopt correlation clustering (which is a classic machine-based algorithm for data deduplication) under a crowd-based setting. We propose non-trivial techniques to reduce the time required in performing correlation clustering with the crowd, and devise methods to post-process the results of correlation clustering for better accuracy of deduplication. With extensive experiments on the Amazon Mechanical Turk, we demonstrate that ACD outperforms the states of the art by offering a high precision of deduplication while incurring moderate crowdsourcing overheads.
更多
查看译文
关键词
Crowdsourcing,Data Deduplication,Correlating Clustering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要