On Efficiently Processing Workflow Provenance Queries In Spark

2019 39TH IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2019)(2019)

引用 3|浏览28
暂无评分
摘要
In this paper, we look at how we can leverage Spark platform for efficiently processing fine-grained provenance queries on large volumes of workflow provenance data. Simple recursive querying based Spark solutions involve large data scanning cost and hence do not work well. We propose a novel provenance framework which is engineered to quickly determine a small volume of data containing the entire lineage of the queried data-item. This small volume of data is then recursively processed to figure out the provenance of the queried data-item. We study the effectiveness of the proposed framework on a provenance trace obtained from a financial domain text curation workflow and report our observations. We show that the proposed framework easily outperforms the naive approaches.
更多
查看译文
关键词
Workflow provenance, graph partitioning, workflow entity dependency graph, weakly connected components, weakly connected sets
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络