An Interactive Data Quality Test Approach for Constraint Discovery and Fault Detection

2019 IEEE International Conference on Big Data (Big Data)(2019)

引用 5|浏览23
暂无评分
摘要
Data quality tests validate heterogeneous data to detect violations of syntactic and semantic constraints. The specification of these constraints can be incomplete because domain experts typically specify them in an ad hoc manner. Existing automated test approaches can generate false alarms and do not explain the constraint violations while reporting faulty data records. In previous work, we proposed ADQuaTe, which is an automated data quality test approach that uses an unsupervised deep learning techni que (1) to discover constraints from big datasets that may have been missed by experts, and (2) to label as suspicious those records that violate the constraints. These records are grouped and explanations for constraint violations are presented to domain experts who determine whether or not the groups are actually faulty. This paper presents ADQuaTe2, which extends ADQuaTe to use an interactive learning technique that incorporates expert feedback to retrain the learning model and improve the accuracy of constraint discovery and fault detection. We evaluate the effectiveness of the approach on real-world datasets from a health data warehouse and a plant diagnosis database. We also use datasets with known faults from the UCI repository to evaluate the improvement in the accuracy of the approach after incorporating ground truth knowledge.
更多
查看译文
关键词
Big Data,Data quality tests,Explainable learning,Interactive learning,Unsupervised learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要