## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Towards certain fixes with editing rules and master data

The VLDB Journal — The International Journal on Very Large Data Bases, no. 2 (2012): 213-238

WOS SCOPUS EI

Full Text

Weibo

Abstract

A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find certain fixes that are absolutely correct, and worse, may introduce new errors when repairing...More

Code:

Data:

Introduction

- Dirty data costs them companies alone 600 billion dollars each year [10].
- These highlight the need for data cleaning, to catch and fix errors in the data.
- An important functionality expected from a data cleaning tool is data monitoring [6, 26]: when a tuple t is created, it is to find errors in t and correct the errors.
- As noted by [26], it is far less costly to correct t at the point of entry than fixing it afterward

Highlights

- Real-life data is often dirty: 1%–5% of business data contains errors [25]
- We used real-life datasets hosp and dblp, and synthetic tpc-h data to verify the effectiveness of certain regions found by our heuristic compCRegions
- The tests were conducted upon varying three parameters: d%, |Dm| and n%, where d% means the probability that an input tuple can match a tuple in Dm; |Dm| is the cardinality of master data; n% is the noise rate, which represents the percentage of attributes with errors in the input tuples
- We have proposed editing rules that, in contrast to constraints used in data cleaning, are able to find certain fixes by updating input tuples with master data
- We have identified fundamental problems for deciding certain fixes and certain regions, and established their complexity bounds
- We have developed a graph-based algorithm for deriving certain regions from editing rules and master data

Methods

- FindCliques is presented following the algorithm given in [21] for the ease of understanding.
- The authors used the algorithm in [24] in the experiments
- These algorithms output a maximal clique in O(|V ||E|) time for a graph G(V, E), in a lexicographical order of the nodes.
- Procedure findCliques first generates a total order for eRs in Σ
- It recursively generates K maximal cliques.
- Given a clique C, Procedure cvrtClique derives a set of certain regions, using the heuristic given in Section 4.1
- It first extracts Z2 and Zm from the set ΣC of eRs in C.

Results

- Exp-1: Effectiveness.
- The authors used real-life datasets hosp and dblp, and synthetic tpc-h data to verify the effectiveness of certain regions found by the heuristic compCRegions.
- The tests were conducted upon varying three parameters: d%, |Dm| and n%, where d% means the probability that an input tuple can match a tuple in Dm; |Dm| is the cardinality of master data; n% is the noise rate, which represents the percentage of attributes with errors in the input tuples.
- The comparisons were quantified with two measures, in tuple level and in attribute level, respectively

Conclusion

- The authors have proposed editing rules that, in contrast to constraints used in data cleaning, are able to find certain fixes by updating input tuples with master data.
- The authors have developed a graph-based algorithm for deriving certain regions from editing rules and master data.
- The authors are exploring optimization methods to improve the derivation algorithm
- Another topic is to develop methods for discovering editing rules from sample inputs and master data, along the same lines as discovering other data quality rules [7, 19]

Related work

- Several classes of constraints have been studied for data cleaning (e.g., [3, 4, 8, 5, 12, 22, 30]; see [11] for a survey). As remarked earlier, editing rules differ from those constraints in the following: (a) they are defined in terms of updates, and (b) their reasoning is relative to master data and is based on its dynamic semantics, a departure from our familiar terrain of dependency analysis. They are also quite different from edits studied for census data repairing [15, 18, 20], which are conditions defined on a single record and are used to detect errors.

Closer to editing rules are matching dependencies (mds [13]). We shall elaborate their differences in Section 2.

Rules have also been studied for active databases (see [29] for a survey). Those rules are far more general than editing rules, specifying events, conditions and actions. Indeed, even the termination problem for those rules is undecidable, as opposed to the conp upper bounds for editing rules. Results on those rules do not carry over to editing rules.

Funding

- Fan and Ma are supported in part by EPSRC E029213/1

Reference

- F-measure. http://en.wikipedia.org/wiki/F-measure.
- T. Akutsu and F. Bao. Approximating minimum keys and optimal substructure screens. In COCOON, 1996.
- M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS, 1999.
- P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A costbased model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005.
- L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In VLDB, 2007.
- S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003.
- F. Chiang and R. Miller. Discovering data quality rules. In VLDB, 2008.
- J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Inf. Comput., 197(12):90–121, 2005.
- G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, 2007.
- W. W. Eckerson. Data quality and the bottom line: Achieving business success through a commitment to high quality data. The Data Warehousing Institute, 2002.
- W. Fan. Dependencies revisited for improving data quality. In PODS, 2008.
- W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. TODS, 33(2), 2008.
- W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. PVLDB, 2(1), 2009.
- T. Faruquie et al. Data cleansing as a transient service. In ICDE, 2010.
- I. Fellegi and D. Holt. A systematic approach to automatic edit and imputation. J. American Statistical Association, 71(353):17–35, 1976.
- M. Garey and D. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, 1979.
- Gartner. Forecast: Data quality tools, worldwide, 2006-2011. Technical report, Gartner, 2007.
- P. Giles. A model for generalized edit and imputation of survey data. The Canadian J. of Statistics, 16:57–73, 1988.
- L. Golab, H. J. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. In VLDB, 2008.
- T. N. Herzog, F. J. Scheuren, and W. E. Winkler. Data Quality and Record Linkage Techniques. Springer, 2009.
- D. S. Johnson, C. H. Papadimitriou, and M. Yannakakis. On generating all maximal independent sets. Inf. Process. Lett., 27(3):119–123, 1988.
- S. Kolahi and L. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, 2009.
- D. Loshin. Master Data Management. Knowledge Integrity, Inc., 2009.
- K. Makino and T. Uno. New algorithms for enumerating all maximal cliques. In SWAT, 2004.
- T. Redman. The impact of poor data quality on the typical enterprise. Commun. ACM, 41(2):79–82, 1998.
- G. Sauter, B. Mathews, and E. Ostic. Information service patterns, part 3: Data cleansing pattern. IBM, 2007.
- L. G. Valiant. The complexity of enumeration and reliability problems. SIAM J. Comput., 8(3):410–421, 1979.
- V. V. Vazirani. Approximation Algorithms. Springer, 2003.
- J. Widom and S. Ceri. Active database systems: triggers and rules for advanced database processing. Morgan Kaufmann, 1996.
- J. Wijsen. Database repairing using updates. TODS, 30(3):722–768, 2005. (1) We show the problem is in np, by providing an np algorithm that, given Z, returns ‘yes’ iff there exists a non-empty pattern tableau Tc such that (Z, Tc) is a certain region for (Σ, Dm). Observe that if so, there must exist a tuple tc consisting of only constants such that (Z, {tc}) is a certain region for (Σ, Dm). Thus it suffices to consider pattern tuples consisting of constants only.
- (2) We show the problem is np-hard by reduction from 3SAT. Given an instance φ of 3SAT, we construct schemas R and Rm, a master relation Dm of Rm, a set Z of attributes of R, and a set Σ of eRs such that Z is valid iff φ is satisfiable.
- (1) We show the problem is in np by giving an np algorithm. Consider a set Σ of eRs over schemas (R, Rm), and a positive integer K ≤ |R|. The algorithm works as follows. (a) Guess a set Z of attributes in R such that |Z| ≤ K. (b) Guess a pattern tuple tc, and check whether (Z, tc[Z]) is a certain region for (Σ, Dm). (c) If so, it returns ‘yes’; and it returns ‘no’ otherwise.
- (2) We show that the problem is np-hard by reduction from the minimum key problem, which is np-complete [2].
- 1. For each eR φ ∈ Σ such that lhs(φ) ∈ Z2 and rhs(φ) ̸∈ (Z1Z2), we first build a hash index based on lhsm(φ) for master tuples in Ds. This takes O(|Σ||Ds| log |Ds|) time, and this part is not shown in the pseudo-code.
- 2. There are at most O(|Ds||Σ|) loops (lines 2–7). Each innermost loop takes O(log |Ds|) time (lines 5–6). Hence in total it takes O(|Σ||Ds| log |Ds|) time.
- 1. Compute Ds from Dm by removing conflict tuples; 2. Build the compressed compatible graph Gc for (Σ, Ds); 3. M: = ∅; Γ:= findCliques(K, Gc); 4. for each clique C in Γ do

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn