C 2 Metadata: Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation

Proceedings of the 2019 International Conference on Management of Data(2019)

引用 6|浏览32
暂无评分
摘要
Datasets are often derived by manipulating raw data with statistical software packages. The derivation of a dataset must be recorded in terms of both the raw input and the manipulations applied to it. Statistics packages typically provide limited help in documenting provenance for the resulting derived data. At best, the operations performed by the statistical package are described in a script. Disparate representations make these scripts hard to understand for users. To address these challenges, we created Continuous Capture of Metadata (C2Metadata), a system to capture data transformations in scripts for statistical packages and represent it as metadata in a standard format that is easy to understand. We do so by devising a Structured Data Transformation Algebra (SDTA), which uses a small set of algebraic operators to express a large fraction of data manipulation performed in practce. We then implement SDTA, inspired by relational algebra, in a data transformation specification language we call SDTL. In this demonstration, we showcase C2metadata's capture of data transformations from a pool of sample transformation scripts in at least two languages: SPSS and Stata (SAS and R are under development), for social science data in a large academic repository. We will allow the audience to explore C2Metadata using a web-based interface, visualize the intermediate steps and trace the provenance and changes of data at different levels for better understanding of the process.
更多
查看译文
关键词
data documentation, data provenance, data transformation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要