A Formal Framework For Probabilistic Unclean Databases.

international conference on database theory(2019)

引用 42|浏览70
暂无评分
摘要
Traditional modeling of inconsistency in database theory casts all possible equally likely. Yet, effective data cleaning needs to incorporate statistical reasoning. For example, yearly salary of $100k and age of 22 are more likely than $100k and 122 and two people with same address are likely to share their last name (i.e., a functional dependency tends to hold but may occasionally be violated). We propose a formal framework for unclean databases, where two types of statistical knowledge are incorporated. The first represents a belief of how intended (clean) data is generated, and the second represents a belief of how the actual database is realized through the introduction of noise. Formally, a Probabilistic Unclean Database (PUD) is a triple that consists of a probabilistic database that we call the intention, a probabilistic data transformator that we call the realization, and a dirty database that we call the observation. We define three computational problems in this framework: cleaning (find the most likely intention), probabilistic query answering (compute the probability of an answer tuple), and learning (find the most likely parameters given examples of clean and dirty databases). We illustrate the framework on concrete representations of PUDs, show that they generalize traditional concepts of repairs such as cardinality and value repairs, draw connection to consistent query answering, and prove tractability results. We further show that parameters can be learned in practical instantiations, and in fact, prove that under certain conditions we can learn directly from a single dirty database without any need for clean examples.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要