Minimal Edit-Based Diffs for Large Trees

CIKM '20: The 29th ACM International Conference on Information and Knowledge Management Virtual Event Ireland October, 2020(2020)

引用 4|浏览12
暂无评分
摘要
Hierarchically structured data are commonly represented as trees and have given rise to popular data formats like XML or JSON. An interesting query computes the difference between two versions of a tree, expressed as the minimum set of node edits (deletion, insertion, label rename) that transform one tree into another, commonly known as the tree edit distance. Unfortunately, the fastest tree edit distance algorithms run in cubic time and quadratic space and are therefore not feasible for large inputs. In this paper, we leverage the fact that the difference between two versions of a tree is typically much smaller than the overall tree size. We propose a new tree edit distance algorithm that is linear in the tree size for similar trees. Our algorithm is based on the new concept of top node pairs and avoids redundant distance computations, the main issue with previous solutions for tree diffs. We empirically evaluate the runtime of our algorithm on large synthetic and real-world trees; our algorithm clearly outperforms the state of the art, often by orders of magnitude.
更多
查看译文
关键词
tree edit distance, minimal tree difference, tree-structured data, similarity search
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要