OmniMatch: Effective Self-Supervised Any-Join Discovery in Tabular Data Repositories
arxiv(2024)
摘要
How can we discover join relationships among columns of tabular data in a
data repository? Can this be done effectively when metadata is missing?
Traditional column matching works mainly rely on similarity measures based on
exact value overlaps, hence missing important semantics or failing to handle
noise in the data. At the same time, recent dataset discovery methods focusing
on deep table representation learning techniques, do not take into
consideration the rich set of column similarity signals found in prior matching
and discovery methods. Finally, existing methods heavily depend on
user-provided similarity thresholds, hindering their deployability in
real-world settings. In this paper, we propose OmniMatch, a novel join
discovery technique that detects equi-joins and fuzzy-joins betwen columns by
combining column-pair similarity measures with Graph Neural Networks (GNNs).
OmniMatch's GNN can capture column relatedness leveraging graph transitivity,
significantly improving the recall of join discovery tasks. At the same time,
OmniMatch also increases the precision by augmenting its training data with
negative column join examples through an automated negative example generation
process. Most importantly, compared to the state-of-the-art matching and
discovery methods, OmniMatch exhibits up to 14
score and AUC without relying on metadata or user-provided thresholds for each
similarity metric.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要