Probclean: A Probabilistic Duplicate Detection System

26TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING ICDE 2010(2010)

引用 9|浏览45
暂无评分
摘要
One of the most prominent data quality problems is the existence of duplicate records. Current data cleaning systems usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist.We propose ProbClean, a system that treats duplicate detection procedures as data processing tasks with uncertain outcomes. We use a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. ProbClean efficiently supports relational queries and allows new types of queries against a set of possible repairs.
更多
查看译文
关键词
computer science,databases,business,relational databases,data mining,maintenance engineering,data cleaning,data processing,data warehouses,probabilistic logic,uncertainty,data integrity,clustering algorithms,data quality
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要