HoloCleanX: A Multi-source Heterogeneous Data Cleaning Solution Based on Lakehouse.

International Conference on Health Information Science (HIS)(2022)

引用 0|浏览12
暂无评分
摘要
The storage of multi-source heterogeneous data has been solved effectively by using Lakehouse, but there are no universal and effective solutions for cleaning in existing systems. Based on Lakehouse MHDP, this paper proposes a cleaning scheme with interactivity based on DCs (Denial Constraints) for cleaning multi-source heterogeneous data. Firstly, we optimize Holoclean to achieve better results on small datasets, which improves F1 by at least 5%. Furthermore, we propose algorithms to parse various types of data, which can effectively reconstruct data. Secondly, we implement an interactive system with real-time feedback which extracts and visualizes the basic metadata and allows users to participate in cleaning work by building DCs. Finally, the cleaned data is saved in the original data format without removing the original data. The experiment results prove that our solution can effectively clean multi-source heterogeneous data with both high accuracy and easy usability.
更多
查看译文
关键词
lakehouse,holocleanx,multi-source
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要