A Statistical Method for Integrated Data Cleaning and Imputation

msra(2009)

引用 27|浏览12
暂无评分
摘要
Real-world databases often contain both syntactic and semantic errors, in spite of integrity constraints and other safety measures incorporated into standard DBMSs. This is primarily due to the broad scope of incorrect data values that are difficult to fully express using the general types of constraints available. As a result many errors are subtle, and laborious to detect with manually-specified rules. However, combining statistical methods with extensions to conventional integrity constraints makes it possible to develop automated data cleaning methods for a variety of relational dependencies. In this work, we focus on exploiting the statistical dependencies among tuples in relational domains such as sensor networks, supply chain systems, and fraud detection. We identify potential statistical dependencies among the data values of related tuples and develop algorithms to automatically estimate these dependencies, utilizing them to jointly fill in missing values at the same time as identifying and correcting errors. The key features of our method are that (1) it uses an efficient approximate inference algorithm that is easily implemented in standard DBMSs and scales well to large databases sizes, and (2) it uses shrinkage and joint inference to accurately infer correct values even in the presence of both missing and corrupt values. We evaluate the method empirically on both synthetic and real-world genealogy data and compare to a baseline statistical method that uses Bayesian networks with exact inference. The results show that our algorithm achieves accuracy comparable to the baseline with respect to inferring missing values. However, our algorithm scales linearly rather than exponentially and can also simultaneously identify and correct corrupted values with high accuracy. I. INTRODUCTION Although the database community has produced a large amount of research on integrity constraints and other safety measures to maintain and ensure the the quality of information stored in relational databases, real-world databases often still contain a non-trivial number of errors. These errors, both syntactic and semantic, are generally subtle mistakes, which are difficult or even impossible to express (and detect) using the general types of constraints available in modern database management systems. In addition, quality-control on data input is decreasing as collaborative efforts increase, with the
更多
查看译文
关键词
data cleaning,missing values,quality of information,database management system,bayesian network,integrity constraints,sensor network,quality control,relational database,supply chain
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要