Record Linkage: A Machine Learning Approach, A Toolbox, and A Digital Government Web Service

msra(2003)

引用 27|浏览16
暂无评分
摘要
Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e- services. The process of identifying the record pairs that represent the same entity (dupli- cate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no ex- isting model, including those proposed in this paper, has been proved to be superior, we have developed an interactive Record Linkage Toolbox named TAILOR. Users of TAI- LOR can build their own record linkage models by tuning system parameters and by plugging in in-house developed and public domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to inter- face with existing and future record linkage models. We have conducted an extensive ex- perimental study to evaluate our proposed models using not only synthetic but also real data. Results show that the proposed machine learning record linkage models outperform the existing ones both in accuracy and in performance. As a practical case study, we have incorporated the toolbox as a web service in a digital government web application. Digi- tal government serves as an emerging area for database research, while web services is considered a very suitable approach that meets the needs of the governmental services.
更多
查看译文
关键词
system integration,web service,data warehousing,public domain,data cleaning,machine learning,record linkage
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要