AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We have proposed editing rules that, in contrast to constraints used in data cleaning, are able to find certain fixes by updating input tuples with master data

Towards certain fixes with editing rules and master data

The VLDB Journal — The International Journal on Very Large Data Bases, no. 2 (2012): 213-238

Cited by: 231|Views189
WOS SCOPUS EI

Abstract

A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find certain fixes that are absolutely correct, and worse, may introduce new errors when repairing...More

Code:

Data:

0
Introduction
  • Dirty data costs them companies alone 600 billion dollars each year [10].
  • These highlight the need for data cleaning, to catch and fix errors in the data.
  • An important functionality expected from a data cleaning tool is data monitoring [6, 26]: when a tuple t is created, it is to find errors in t and correct the errors.
  • As noted by [26], it is far less costly to correct t at the point of entry than fixing it afterward
Highlights
  • Real-life data is often dirty: 1%–5% of business data contains errors [25]
  • We used real-life datasets hosp and dblp, and synthetic tpc-h data to verify the effectiveness of certain regions found by our heuristic compCRegions
  • The tests were conducted upon varying three parameters: d%, |Dm| and n%, where d% means the probability that an input tuple can match a tuple in Dm; |Dm| is the cardinality of master data; n% is the noise rate, which represents the percentage of attributes with errors in the input tuples
  • We have proposed editing rules that, in contrast to constraints used in data cleaning, are able to find certain fixes by updating input tuples with master data
  • We have identified fundamental problems for deciding certain fixes and certain regions, and established their complexity bounds
  • We have developed a graph-based algorithm for deriving certain regions from editing rules and master data
Methods
  • FindCliques is presented following the algorithm given in [21] for the ease of understanding.
  • The authors used the algorithm in [24] in the experiments
  • These algorithms output a maximal clique in O(|V ||E|) time for a graph G(V, E), in a lexicographical order of the nodes.
  • Procedure findCliques first generates a total order for eRs in Σ
  • It recursively generates K maximal cliques.
  • Given a clique C, Procedure cvrtClique derives a set of certain regions, using the heuristic given in Section 4.1
  • It first extracts Z2 and Zm from the set ΣC of eRs in C.
Results
  • Exp-1: Effectiveness.
  • The authors used real-life datasets hosp and dblp, and synthetic tpc-h data to verify the effectiveness of certain regions found by the heuristic compCRegions.
  • The tests were conducted upon varying three parameters: d%, |Dm| and n%, where d% means the probability that an input tuple can match a tuple in Dm; |Dm| is the cardinality of master data; n% is the noise rate, which represents the percentage of attributes with errors in the input tuples.
  • The comparisons were quantified with two measures, in tuple level and in attribute level, respectively
Conclusion
  • The authors have proposed editing rules that, in contrast to constraints used in data cleaning, are able to find certain fixes by updating input tuples with master data.
  • The authors have developed a graph-based algorithm for deriving certain regions from editing rules and master data.
  • The authors are exploring optimization methods to improve the derivation algorithm
  • Another topic is to develop methods for discovering editing rules from sample inputs and master data, along the same lines as discovering other data quality rules [7, 19]
Related work
  • Several classes of constraints have been studied for data cleaning (e.g., [3, 4, 8, 5, 12, 22, 30]; see [11] for a survey). As remarked earlier, editing rules differ from those constraints in the following: (a) they are defined in terms of updates, and (b) their reasoning is relative to master data and is based on its dynamic semantics, a departure from our familiar terrain of dependency analysis. They are also quite different from edits studied for census data repairing [15, 18, 20], which are conditions defined on a single record and are used to detect errors.

    Closer to editing rules are matching dependencies (mds [13]). We shall elaborate their differences in Section 2.

    Rules have also been studied for active databases (see [29] for a survey). Those rules are far more general than editing rules, specifying events, conditions and actions. Indeed, even the termination problem for those rules is undecidable, as opposed to the conp upper bounds for editing rules. Results on those rules do not carry over to editing rules.
Funding
  • Fan and Ma are supported in part by EPSRC E029213/1
Reference
  • F-measure. http://en.wikipedia.org/wiki/F-measure.
    Findings
  • T. Akutsu and F. Bao. Approximating minimum keys and optimal substructure screens. In COCOON, 1996.
    Google ScholarLocate open access versionFindings
  • M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS, 1999.
    Google ScholarLocate open access versionFindings
  • P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A costbased model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005.
    Google ScholarLocate open access versionFindings
  • L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In VLDB, 2007.
    Google ScholarFindings
  • S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003.
    Google ScholarFindings
  • F. Chiang and R. Miller. Discovering data quality rules. In VLDB, 2008.
    Google ScholarLocate open access versionFindings
  • J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Inf. Comput., 197(12):90–121, 2005.
    Google ScholarLocate open access versionFindings
  • G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, 2007.
    Google ScholarFindings
  • W. W. Eckerson. Data quality and the bottom line: Achieving business success through a commitment to high quality data. The Data Warehousing Institute, 2002.
    Google ScholarLocate open access versionFindings
  • W. Fan. Dependencies revisited for improving data quality. In PODS, 2008.
    Google ScholarLocate open access versionFindings
  • W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. TODS, 33(2), 2008.
    Google ScholarLocate open access versionFindings
  • W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. PVLDB, 2(1), 2009.
    Google ScholarLocate open access versionFindings
  • T. Faruquie et al. Data cleansing as a transient service. In ICDE, 2010.
    Google ScholarLocate open access versionFindings
  • I. Fellegi and D. Holt. A systematic approach to automatic edit and imputation. J. American Statistical Association, 71(353):17–35, 1976.
    Google ScholarLocate open access versionFindings
  • M. Garey and D. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, 1979.
    Google ScholarLocate open access versionFindings
  • Gartner. Forecast: Data quality tools, worldwide, 2006-2011. Technical report, Gartner, 2007.
    Google ScholarFindings
  • P. Giles. A model for generalized edit and imputation of survey data. The Canadian J. of Statistics, 16:57–73, 1988.
    Google ScholarLocate open access versionFindings
  • L. Golab, H. J. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. In VLDB, 2008.
    Google ScholarLocate open access versionFindings
  • T. N. Herzog, F. J. Scheuren, and W. E. Winkler. Data Quality and Record Linkage Techniques. Springer, 2009.
    Google ScholarFindings
  • D. S. Johnson, C. H. Papadimitriou, and M. Yannakakis. On generating all maximal independent sets. Inf. Process. Lett., 27(3):119–123, 1988.
    Google ScholarLocate open access versionFindings
  • S. Kolahi and L. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, 2009.
    Google ScholarLocate open access versionFindings
  • D. Loshin. Master Data Management. Knowledge Integrity, Inc., 2009.
    Google ScholarFindings
  • K. Makino and T. Uno. New algorithms for enumerating all maximal cliques. In SWAT, 2004.
    Google ScholarLocate open access versionFindings
  • T. Redman. The impact of poor data quality on the typical enterprise. Commun. ACM, 41(2):79–82, 1998.
    Google ScholarLocate open access versionFindings
  • G. Sauter, B. Mathews, and E. Ostic. Information service patterns, part 3: Data cleansing pattern. IBM, 2007.
    Google ScholarLocate open access versionFindings
  • L. G. Valiant. The complexity of enumeration and reliability problems. SIAM J. Comput., 8(3):410–421, 1979.
    Google ScholarLocate open access versionFindings
  • V. V. Vazirani. Approximation Algorithms. Springer, 2003.
    Google ScholarFindings
  • J. Widom and S. Ceri. Active database systems: triggers and rules for advanced database processing. Morgan Kaufmann, 1996.
    Google ScholarFindings
  • J. Wijsen. Database repairing using updates. TODS, 30(3):722–768, 2005. (1) We show the problem is in np, by providing an np algorithm that, given Z, returns ‘yes’ iff there exists a non-empty pattern tableau Tc such that (Z, Tc) is a certain region for (Σ, Dm). Observe that if so, there must exist a tuple tc consisting of only constants such that (Z, {tc}) is a certain region for (Σ, Dm). Thus it suffices to consider pattern tuples consisting of constants only.
    Google ScholarLocate open access versionFindings
  • (2) We show the problem is np-hard by reduction from 3SAT. Given an instance φ of 3SAT, we construct schemas R and Rm, a master relation Dm of Rm, a set Z of attributes of R, and a set Σ of eRs such that Z is valid iff φ is satisfiable.
    Google ScholarFindings
  • (1) We show the problem is in np by giving an np algorithm. Consider a set Σ of eRs over schemas (R, Rm), and a positive integer K ≤ |R|. The algorithm works as follows. (a) Guess a set Z of attributes in R such that |Z| ≤ K. (b) Guess a pattern tuple tc, and check whether (Z, tc[Z]) is a certain region for (Σ, Dm). (c) If so, it returns ‘yes’; and it returns ‘no’ otherwise.
    Google ScholarLocate open access versionFindings
  • (2) We show that the problem is np-hard by reduction from the minimum key problem, which is np-complete [2].
    Google ScholarFindings
  • 1. For each eR φ ∈ Σ such that lhs(φ) ∈ Z2 and rhs(φ) ̸∈ (Z1Z2), we first build a hash index based on lhsm(φ) for master tuples in Ds. This takes O(|Σ||Ds| log |Ds|) time, and this part is not shown in the pseudo-code.
    Google ScholarFindings
  • 2. There are at most O(|Ds||Σ|) loops (lines 2–7). Each innermost loop takes O(log |Ds|) time (lines 5–6). Hence in total it takes O(|Σ||Ds| log |Ds|) time.
    Google ScholarFindings
  • 1. Compute Ds from Dm by removing conflict tuples; 2. Build the compressed compatible graph Gc for (Σ, Ds); 3. M: = ∅; Γ:= findCliques(K, Gc); 4. for each clique C in Γ do
    Google ScholarFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科