Modeling Uncertainty in Duplicate Elimination

msra(2008)

引用 24|浏览19
暂无评分
摘要
Real-world databases experience various data quality problems of different causes including heterogeneity of consolidated data sources, imprecision of read- ing devices, and data entry errors. Existence of duplicate records is a prominent data quality problem. The process of duplicate elimination often involves uncer- tainty in deciding on the true duplicates. Current tools resolve such uncertainty either through expert intervention, which is not always possible, or by taking de- structive decisions that may lead to unrecoverable errors. In this paper, we approach duplicate elimination from a new perspective treat- ing deduplication procedures as data processing tasks with uncertain outcomes. We propose a complete uncertainty model that compactly encodes the space of clean instances of the input data, and introduce efficient model implementations. We extend our model to capture the behavior of the deduplication process, and allow revising and updating the modeled uncertainty. We apply our model and techniques to state-of-the-art deduplication algorithms to demonstrate the added value of our methods. Our experimental study evaluates the complexity and scala- bility of our techniques in different configurations.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要