Magellan: toward building ecosystems of entity matching solutions

Communications of the ACM(2020)

引用 22|浏览102
暂无评分
摘要
Entity matching (EM) finds data instances that refer to the same real-world entity. In 2015, we started the Magellan project at UW-Madison, jointly with industrial partners, to build EM systems. Most current EM systems are stand-alone monoliths. In contrast, Magellan borrows ideas from the field of data science (DS), to build a new kind of EM systems, which is ecosystems of interoperable tools for multiple execution environments, such as on-premise, cloud, and mobile. This paper describes Magellan, focusing on the system aspects. We argue why EM can be viewed as a special class of DS problems and thus can benefit from system building ideas in DS. We discuss how these ideas have been adapted to build PyMatcher and CloudMatcher, sophisticated on-premise tools for power users and self-service cloud tools for lay users. These tools exploit techniques from the fields of machine learning, big data scaling, efficient user interaction, databases, and cloud systems. They have been successfully used in 13 companies and domain science groups, have been pushed into production for many customers, and are being commercialized. We discuss the lessons learned and explore applying the Magellan template to other tasks in data exploration, cleaning, and integration.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要