Amalur: Next-generation Data Integration in Data Lakes

Conference on Innovative Data Systems Research (CIDR)(2022)

引用 0|浏览1
暂无评分
摘要
Data science workflows require extracting, preparing and integrating data from multiple data sources. Due to the lack of proper tooling this is a very cumbersome process that hinders the productivity of data scientists. Moreover, this is a very slow process: most of the times, data scientists prepare data in a data processing system or a data lake, and export it in the form of a table, in order for it to be consumed by a Machine Learning (ML) algorithm. Recent advances in the area of factorized ML, allow us to push down certain linear algebra (LA) operators, and to execute them closer to the data sources [2, 1]. At the same time, we have a proliferation of novel data exploration and discovery tools as well as dataset relatedness and matching algorithms [6, 5]. With this work we argue that this is the right moment to revisit all the components of classic data integration (DI) systems, and to see how these fit into modern data lakes that are meant to support LA as a first-class citizen. In this paper we first investigate how the advances in factorized ML and modern data integration techniques influence and can benefit from each other, forming new research opportunities. We then describe Amalur : a reference architecture of a next-generation data lake system which facilitates linear algebra processing over heterogeneous sources. We propose a formal representation based on matrices, which connects to the schema mapping formalism in first-order logic [3, 4], and enables LA factorization over joinable or unionable data in a data lake. Finally, we outline the future research challenges related to next-generation data lake systems.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要