Automatic Module Detection in Data Cleaning Workflows: Enabling Transparency and Recipe Reuse

International Journal of Digital Curation(2022)

引用 0|浏览7
暂无评分
摘要
Before data from multiple sources can be analyzed, data cleaning workflows (“recipes”) usually need to be employed to improve data quality. We identify a number of technical problems that make application of FAIR principles to data cleaning recipes challenging. We then demonstrate how transparency and reusability of recipes can be improved by analyzing dataflow dependencies within recipes. In particular column-level dependencies can be used to automatically detect independent subworkflows, which then can be reused individually as data cleaning modules. We have prototypically implemented this approach as part of an ongoing project to develop open-source companion tools for OpenRefine. Keywords: Data Cleaning, Provenance, Workflow Analysis
更多
查看译文
关键词
data cleaning workflows,automatic module detection
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要