Interactive and Deterministic Data Cleaning

    SIGMOD/PODS'16: International Conference on Management of Data San Francisco California USA June, 2016, pp. 893-907, 2016.

    Cited by: 50|Bibtex|Views20|Links
    EI
    Keywords:
    functional dependenciessupport vector machinedatum quality ruleintegrity constraintsBreadth-first searchMore(9+)
    Wei bo:
    In order to e ciently manage all potential updates, and e↵ectively interact with users, we propose Fal, which works as follows

    Abstract:

    We present Falcon, an interactive, deterministic, and declarative data cleaning system, which uses SQL update queries as the language to repair data. Falcon does not rely on the existence of a set of pre-defined data quality rules. On the contrary, it encourages users to explore the data, identify possible problems, and make updates to fi...More

    Code:

    Data:

    0
    Introduction
    • High quality data is important to all businesses, and data cleaning is an important but tedious step.
    • Removing errors in order to get high quality data takes most of data analysts’ time [31], and some studies predict a shortage of people with the skills and the know-how for these tasks [33].
    • In the evolving scenario of data cleaning, these approaches show a serious limitation.
    • They assume that data quality rules are declared upfront by domain experts who understand the data and write logical formulas or procedural code.
    • These systems have failed short in terms of adoption in industrial tools
    Highlights
    • High quality data is important to all businesses, and data cleaning is an important but tedious step
    • Besides using traditional one-hop sql based traverse algorithms (e.g., Breadth-first search or Depth-first search), we describe novel multi-hop search algorithms such that can
    • Which rule-based data repairing consists of using integrity constraints to identify data errors [11, 12, 17, 25, 40], and automated algorithms to enforce these constraints over the data [7, 22, 23, 32, 43]
    • In order to e ciently manage all potential updates, and e↵ectively interact with users, we propose Fal, which works as follows
    • Despite more sophisticated combinations are possible, we found that the simple sum gives a global overview of the algorithms behaviour that is close to the real overall experience of the users
    • While we discover rules using any combination of columns, Refine either generates rules for the entire column, which is unlikely to hold for data errors, or rules that update a single tuple
    Methods
    • The authors conducted five experiments.
    • Piq Exp-1 compares benefits of the various lattice-traversal algorithms with di↵erent budget values, and show that CoDive maximizes the benefit.
    • Piiq Exp-2 studies the impact of di↵erent Benefit Benefit Benefit DFS.
    • BFS Ducc Dive CoDive Soccer Hospital Synth 10k Synth 1M DBLP BUS (a) Budget=2
    • The authors conducted five experiments. piq Exp-1 compares benefits of the various lattice-traversal algorithms with di↵erent budget values, and show that CoDive maximizes the benefit. piiq Exp-2 studies the impact of di↵erent Benefit Benefit Benefit DFS
    Conclusion
    • Falcon and deterministic data cleaning system.
    • The authors have demonstrated that can e↵ectively interact with users to.
    • Falcon generalize user-solicited updates, and clean-up data with a significant benefit w.r.t. the number of required interactions.
    • A number of possible future studies using are.
    • Falcon apparent.
    • The authors plan to extend it by using external sources, as remarked in Appendix B.
    • The authors will leverage the information obtained from previous interactions with the user multiple data updates
    Summary
    • Introduction:

      High quality data is important to all businesses, and data cleaning is an important but tedious step.
    • Removing errors in order to get high quality data takes most of data analysts’ time [31], and some studies predict a shortage of people with the skills and the know-how for these tasks [33].
    • In the evolving scenario of data cleaning, these approaches show a serious limitation.
    • They assume that data quality rules are declared upfront by domain experts who understand the data and write logical formulas or procedural code.
    • These systems have failed short in terms of adoption in industrial tools
    • Methods:

      The authors conducted five experiments.
    • Piq Exp-1 compares benefits of the various lattice-traversal algorithms with di↵erent budget values, and show that CoDive maximizes the benefit.
    • Piiq Exp-2 studies the impact of di↵erent Benefit Benefit Benefit DFS.
    • BFS Ducc Dive CoDive Soccer Hospital Synth 10k Synth 1M DBLP BUS (a) Budget=2
    • The authors conducted five experiments. piq Exp-1 compares benefits of the various lattice-traversal algorithms with di↵erent budget values, and show that CoDive maximizes the benefit. piiq Exp-2 studies the impact of di↵erent Benefit Benefit Benefit DFS
    • Conclusion:

      Falcon and deterministic data cleaning system.
    • The authors have demonstrated that can e↵ectively interact with users to.
    • Falcon generalize user-solicited updates, and clean-up data with a significant benefit w.r.t. the number of required interactions.
    • A number of possible future studies using are.
    • Falcon apparent.
    • The authors plan to extend it by using external sources, as remarked in Appendix B.
    • The authors will leverage the information obtained from previous interactions with the user multiple data updates
    Tables
    • Table1: Dataset Tdrug with drug tests
    • Table2: A 2-way contingency table
    • Table3: Notations used in the paper
    • Table4: Features of node DML
    • Table5: Correlation of attributes in Soccer dataset when Stadium is updated
    • Table6: Comparison of the lattice search algorithms with B “ 3: U is the number of user updates, A is the number of user answers, and |QpT q| is the total number of errors
    • Table7: Comparison of the baselines. Here T is the total interaction cost for the user, Rep is the number of repaired
    Download tables as Excel
    Related work
    • Data transformation. Interactive systems for data transformation [27,37,44] also reason about the updated attribute to learn transformation rules. They mainly focus on string manipulation and reformatting at the text level. In contrast, we use more expressive SQL scripts. Consequently, we discover not only rules that contain one attribute that is being updated syntactically, but also rules that combine multiple attributes to semantically determine new repairs. Our language and algorithms can lead to smaller interaction cost, as discussed in Section 6 Exp-3.
    Funding
    • This work was partly supported by the 973 Program of China (2015CB358700), NSF of China (61422205, 61472198), Huawei, Shenzhou, Tencent, FDCT/116/2013/A3, MYRG105(Y1-L3)-FST13- GZ, National High-Tech R&D (863) Program of China (2012AA012600), and the Chinese Special Project of Science and Technology (2013zx01039-002-002)
    Reference
    • [2] A. Abouzied, J. M. Hellerstein, and A. Silberschatz. Playful query specification with dataplay. PVLDB, 5(12):1938–1941, 2012.
      Google ScholarLocate open access versionFindings
    • [4] B. Alexe, L. Chiticariu, R. J. Miller, and W. C. Tan. Muse: Mapping understanding and design by example. In ICDE, pages 10–19, 2008.
      Google ScholarLocate open access versionFindings
    • [6] P. C. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, and D. Santoro. Messing up with BART: error generation for evaluating data-cleaning algorithms. PVLDB, 9(2), 2015.
      Google ScholarLocate open access versionFindings
    • [7] P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and e↵ective heuristic for repairing constraints by value modification. In SIGMOD, 2005.
      Google ScholarLocate open access versionFindings
    • [8] A. Bonifati, R. Ciucanu, and S. Staworko. Interactive inference of join queries. In EDBT, 2014.
      Google ScholarLocate open access versionFindings
    • [9] A. Bonifati, R. Ciucanu, and S. Staworko. Interactive join query inference with JIM. PVLDB, 7(13), 2014.
      Google ScholarLocate open access versionFindings
    • [10] C. Chang and C. Lin. LIBSVM: A library for support vector machines. ACMTIST, 2(3):27, 2011.
      Google ScholarLocate open access versionFindings
    • [11] F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB, 1(1), 2008.
      Google ScholarLocate open access versionFindings
    • [12] X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13), 2013.
      Google ScholarLocate open access versionFindings
    • [13] M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: a commodity data cleaning system. In SIGMOD, 2013.
      Google ScholarLocate open access versionFindings
    • [14] O. Deshpande, D. S. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Building, maintaining, and using knowledge bases: A report from the trenches. In SIGMOD, 2013.
      Google ScholarLocate open access versionFindings
    • [15] A. Ebaid, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, J. Quiane-Ruiz, N. Tang, and S. Yin. NADEEF: A generalized data cleaning system. PVLDB, 6(12):1218–1221, 2013.
      Google ScholarLocate open access versionFindings
    • [16] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst., 33(2), 2008.
      Google ScholarLocate open access versionFindings
    • [17] W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng., 23(5), 2011.
      Google ScholarLocate open access versionFindings
    • [18] W. Fan, F. Geerts, N. Tang, and W. Yu. Inferring data currency and consistency for conflict resolution. In ICDE, 2013.
      Google ScholarFindings
    • [19] W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, 2011.
      Google ScholarLocate open access versionFindings
    • [20] W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB J., 21(2), 2012.
      Google ScholarLocate open access versionFindings
    • [21] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB, 2001.
      Google ScholarLocate open access versionFindings
    • [22] F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC data-cleaning framework. PVLDB, 6(9), 2013.
      Google ScholarLocate open access versionFindings
    • [23] F. Geerts, G. Mecca, P. Papotti, and D. Santoro. Mapping and Cleaning. In ICDE, pages 232–243, 2014.
      Google ScholarLocate open access versionFindings
    • [24] F. Geerts, G. Mecca, P. Papotti, and D. Santoro. That’s all folks! LLUNATIC goes open source. PVLDB, 7(13):1565–1568, 2014.
      Google ScholarLocate open access versionFindings
    • [25] L. Golab, H. J. Karlo↵, F. Korn, B. Saha, and D. Srivastava. Discovering conservation rules. In ICDE, 2012.
      Google ScholarLocate open access versionFindings
    • [27] J. Heer, J. M. Hellerstein, and S. Kandel. Predictive interaction for data transformation. In CIDR, 2015.
      Google ScholarLocate open access versionFindings
    • [28] A. Heise, J. Quiane-Ruiz, Z. Abedjan, A. Jentzsch, and F. Naumann. Scalable discovery of unique column combinations. PVLDB, 7(4), 2013.
      Google ScholarLocate open access versionFindings
    • [29] I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. CORDS: automatic discovery of correlations and soft functional dependencies. In SIGMOD, pages 647–658, 2004.
      Google ScholarLocate open access versionFindings
    • [30] M. Interlandi and N. Tang. Proof positive and negative in data cleaning. In ICDE, 2015.
      Google ScholarLocate open access versionFindings
    • [31] S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Enterprise data analysis and visualization: An interview study. IEEE Trans. Vis. Comput. Graph., 18(12), 2012.
      Google ScholarLocate open access versionFindings
    • [32] Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, J.-A. Quiane-Ruiz, P. Papotti, N. Tang, and S. Yin. BigDansing: a system for big data cleansing. In SIGMOD, 2015.
      Google ScholarLocate open access versionFindings
    • [35] L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-driven schema mapping. In SIGMOD, pages 73–84, 2012.
      Google ScholarLocate open access versionFindings
    • [36] G. Ramalingam and T. W. Reps. A categorized bibliography on incremental computation. In POPL, 1993.
      Google ScholarLocate open access versionFindings
    • [37] V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In VLDB, pages 381–390, 2001.
      Google ScholarLocate open access versionFindings
    • [38] Y. Shen, K. Chakrabarti, S. Chaudhuri, B. Ding, and L. Novik. Discovering queries based on example tuples. In SIGMOD, pages 493–504, 2014.
      Google ScholarLocate open access versionFindings
    • [39] D. D. Sleator and R. E. Tarjan. Amortized e ciency of list update and paging rules. Commun. ACM, 28(2), 1985.
      Google ScholarLocate open access versionFindings
    • [40] S. Song and L. Chen. E cient discovery of similarity constraints for matching dependencies. Data Knowl. Eng., 87, 2013.
      Google ScholarLocate open access versionFindings
    • [41] M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, 2014.
      Google ScholarLocate open access versionFindings
    • [42] J. Wang, J. Han, and J. Pei. CLOSET+: searching for the best strategies for mining frequent closed itemsets. In SIGKDD, 2003.
      Google ScholarLocate open access versionFindings
    • [43] J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, 2014.
      Google ScholarLocate open access versionFindings
    • [44] B. Wu and C. A. Knoblock. An iterative approach to synthesize data transformation programs. In IJCAI, pages 1726–1732, 2015.
      Google ScholarLocate open access versionFindings
    • [45] M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, pages 553–564, 2013.
      Google ScholarLocate open access versionFindings
    • [46] M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 2011.
      Google ScholarLocate open access versionFindings
    • [47] Z. Yan, N. Zheng, Z. G. Ives, P. P. Talukdar, and C. Yu. Actively soliciting feedback for query answers in keyword search-based data integration. PVLDB, 6(3):205–216, 2013.
      Google ScholarLocate open access versionFindings
    • [48] M. J. Zaki and W. Meira. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press, 2014.
      Google ScholarFindings
    • [49] M. Zhang, H. Elmeleegy, C. M. Procopiuc, and D. Srivastava. Reverse engineering complex join queries. In SIGMOD, 2013.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments