Making It Tractable to Catch Duplicates and Conflicts in Graphs.

Proc. ACM Manag. Data(2023)

引用 0|浏览23
暂无评分
摘要
This paper proposes an approach for entity resolution (ER) and conflict resolution (CR) in large-scale graphs. It is based on a class of Graph Cleaning Rules (GCRs), which support the primitives of relational data cleaning rules, and may embed machine learning classifiers as predicates. As opposed to previous graph rules, GCRs are defined with a dual graph pattern to accommodate irregular structures of schemaless graphs, and adopt patterns of a star form to reduce the complexity. We show that the satisfiability, implication and validation problems are all in polynomial time (PTIME) for GCRs, as opposed to the intractability of these classical problems for previous graph dependencies. We develop a parallel algorithm to discover GCRs by combining the generations of patterns and predicates, and a parallel PTIME algorithm for "deep" ER and CR by recursively applying the mined GCRs. We show that these algorithms guarantee to reduce runtime when more processors are used. Using real-life and synthetic graphs, we experimentally verify that rule discovery and error detection with GCRs are substantially faster than with previous graph dependencies, with improved accuracy.
更多
查看译文
关键词
graphs,duplicates,conflicts
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要