WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses

arxiv(2023)

引用 0|浏览24
暂无评分
摘要
Data discovery is a major challenge in enterprise data analysis: users often struggle to find data relevant to their analysis goals or even to navigate through data across data sources, each of which may easily contain thousands of tables. One common user need is to discover tables joinable with a given table. This need is particularly critical because join is a ubiquitous operation in data analysis, and join paths are mostly obscure to users, especially across databases. Furthermore, users are typically interested in finding ``semantically'' joinable tables: with columns that can be transformed to become joinable even if they are not joinable as currently represented in the data store. We present WarpGate, a system prototype for data discovery over cloud data warehouses. WarpGate implements an embedding-based solution to semantic join discovery, which encodes columns into high-dimensional vector space such that joinable columns map to points that are near each other. Through experiments on several table corpora, we show that WarpGate (i) captures semantic relationships between tables, especially those across databases, and (ii) is sample efficient and thus scalable to very large tables of millions of rows. We also showcase an application of WarpGate within an enterprise product for cloud data analytics.
更多
查看译文
关键词
semantic join discovery system,cloud,data
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要