On Efficiently Processing Workflow Provenance Queries In Spark
2019 39TH IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2019)(2019)
摘要
In this paper, we look at how we can leverage Spark platform for efficiently processing fine-grained provenance queries on large volumes of workflow provenance data. Simple recursive querying based Spark solutions involve large data scanning cost and hence do not work well. We propose a novel provenance framework which is engineered to quickly determine a small volume of data containing the entire lineage of the queried data-item. This small volume of data is then recursively processed to figure out the provenance of the queried data-item. We study the effectiveness of the proposed framework on a provenance trace obtained from a financial domain text curation workflow and report our observations. We show that the proposed framework easily outperforms the naive approaches.
更多查看译文
关键词
Workflow provenance, graph partitioning, workflow entity dependency graph, weakly connected components, weakly connected sets
AI 理解论文
溯源树
样例

生成溯源树,研究论文发展脉络