Optimal Starting Parameters for Unsupervised Data Clustering and Cleaning in the Data Washing Machine

Kenneth Anderson,John R. Talburt, Nicholas Kofi Akortia Hagan, Thomas Zimmerman, Dianne Hagan

Lecture notes in networks and systems(2023)

引用 0|浏览4
暂无评分
摘要
The Data Washing Machine (DWM) is an open-source Python Jupyter Notebook project developed as a proof-of-concept for unsupervised Data Curation. The DWM addresses a particular use case for multiple sources of the same information about entities such as customers, patients, or products. It ingests references to these entities without prior data cleansing or metadata alignment and clusters equivalent references to the same entity. The DWM requires several starting parameters to tokenize, cleanse, organize, link, and cluster equivalent references. While the DWM has produced clustering results similar to those obtained by supervised entity resolution (ER) systems acting on the same dataset, the optimal parameter settings could only be found through a grid search guided by the ground truth (truth set). For the DWM to have a practical application, there must be an unsupervised process for setting the optimal starting parameters without having access to a truth set. This paper describes an unsupervised Parameter Discovery Process (PDP) that finds 14 optimal starting parameters for a given dataset processed by the DWM. The PDP process uses statistics from an input dataset to drive a combination of historical settings, regression formulas, and entropy-guided grid search to determine the optimal DWM starting parameters. Reference Sample runs, with and without a truth set, demonstrate the capabilities of the PDP to discover optimal starting parameters for the Data Washing Machine and its unsupervised Data Curation process. Mean Square Error (MSE) calculations are used to validate the overall quality of the PDP and DWM System Model.
更多
查看译文
关键词
data washing machine,unsupervised data clustering,cleaning,optimal starting parameters
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要