Statistical inference and data cleaning in relational database systems

Statistical inference and data cleaning in relational database systems(2011)

引用 22|浏览14
暂无评分
摘要
Real-world databases often contain syntactic and semantic errors, in spite of integrity constraints and other safety measures available in modern DBMSs. We present an iterative statistical framework for inferring missing information and correcting such errors automatically. The key insight of our approach is to exploit dependencies not only within tuples, but also between attributes of related tuples. We draw on techniques from statistical relational learning to develop an efficient approximate inference algorithm that can be implemented in standard DBMSs using SQL and user-defined functions. The resulting framework performs the inference and data cleaning tasks in an integrated manner, using novel techniques to infer correct values accurately even in the presence of dirty data. We evaluate our methods empirically using multiple synthetic and real data sets. The results show that our algorithm infers missing values comparable to baseline statistical methods, such as exact inference in Bayesian networks. However our framework simultaneously identifies and corrects corrupted values with high precision, and is significantly more efficient because of its database-level implementation.
更多
查看译文
关键词
missing information,efficient approximate inference algorithm,relational database system,real data set,exact inference,dirty data,baseline statistical method,resulting framework,statistical relational,algorithm infers,iterative statistical framework,Statistical inference
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要