Cleaning Data with Forbidden Itemsets

IEEE Transactions on Knowledge and Data Engineering(2020)

引用 36|浏览95
暂无评分
摘要
Methods for cleaning dirty data typically employ additional information about the data such as user-provided constraints specifying when data is dirty, e.g., domain restrictions, illegal value combinations, or logical rules. However, real-world scenarios usually only have dirty data available, without known constraints. In such settings, constraints are automatically discovered on dirty data and discovered constraints are used to detect and repair errors. Typical repairing processes stop there. Yet, when constraint discovery algorithms are re-run on the repaired data (assumed to be clean), new constraints and thus errors are often found. The repairing process then introduces new constraint violations. We present a different type of repairing method, which prevents introducing new constraint violations, according to a discovery algorithm. Summarily, our repairs guarantee that all errors identified by constraints discovered on the dirty data are fixed; and the constraint discovery process cannot identify new constraint violations. We do this for a new kind of constraints, called forbidden itemsets (FBIs), capturing unlikely value co-occurrences. We show that FBIs detect errors with high precision. Evaluation on real-world data shows that our repair method obtains high-quality repairs without introducing new FBIs. Optional user interaction is readily integrated, with users deciding how much effort to invest.
更多
查看译文
关键词
Itemsets,Maintenance engineering,Cleaning,Data integrity,Data mining,Heuristic algorithms
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要