Detecting Near Duplicate Dataset

PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND PATTERN RECOGNITION (SOCPAR 2021)(2022)

引用 2|浏览4
暂无评分
摘要
This paper introduces the concept of near duplicate dataset, a quasi-duplicate version of a dataset. This version has undergone an unknown number of row and column insertions and deletions (modifications on schema and instance). This concepts is interesting for data exploration, data integration and data quality. To formalise these insertions and deletions, two parameters are introduced. Our technique for detecting these quasi-duplicate datasets is based on features extraction and machine learning. In order to train these algorithms, we introduce a method to artificially generate training data. We perform several experiments to evaluate the best parameters to use when creating training data and the performance of several classifiers. In the studied cases, these experiments lead us to an accuracy rate higher than 95%.
更多
查看译文
关键词
Machine learning, Entity resolution, Record linkage, Data quality, Data integration, Data profiling, Catboost
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要