Empirical comparison of supervised learning techniques for missing value imputation

Knowledge and Information Systems(2022)

引用 9|浏览13
暂无评分
摘要
Many data mining algorithms cannot handle incomplete datasets where some data samples are missing attribute values. To solve this problem, missing value imputation is usually conducted and commonly based on reasoning from observed data or complete data to provide estimated replacements for missing values. In general, missing imputation methods can be classified into statistical and machine learning methods. The statistical methods are usually based on the mean for continuous attributes or mode for discrete attributes, whereas the machine learning methods are based on supervised learning techniques. However, which machine learning method performs optimally for missing value imputation is unknown. This paper compares five well-known supervised learning techniques, namely k-nearest neighbor, the multilayer perceptron neural network (MLP), the classification and regression tree (CART), naïve Bayes, and the support vector machine, to examine their imputation results for categorical, numerical, and mixed data types. The experimental results demonstrate that CART outperforms the other methods for categorical datasets, whereas the MLP is optimal for numerical and mixed datasets in terms of classification accuracy. However, when computational cost is a factor, CART is superior to the MLP because CART can provide reasonably accurate imputation results and requires the least amount of time to perform missing value imputation. Moreover, CART generates the lowest root-mean-squared error of all methods.
更多
查看译文
关键词
Missing value,Imputation,Supervised learning,Incomplete dataset,Data mining,Data preprocessing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要