Feature selection and its combination with data over-sampling for multi-class imbalanced datasets

APPLIED SOFT COMPUTING(2024)

引用 0|浏览0
暂无评分
摘要
Feature selection aims at filtering out some unrepresentative features from a given dataset in order to construct more effective learning models. Furthermore, ensemble feature selection by combining multiple feature selection methods has shown its outperformance over single feature selection. However, the performances of different (ensemble) feature selection methods have not been fully examined over multi -class imbalanced datasets. On the other hand, for class imbalanced datasets, one widely considered solution is to re -balance the datasets by data over -sampling, which generates some synthetic examples for the minority classes. However, the effect of performing (ensemble) feature selection on over -sampling multi -class imbalanced datasets has not been investigated. Therefore, the first research objective is to examine the performances of single and ensemble feature selection methods by fifteen well-known filter, wrapper, and embedded algorithms in terms of classification accuracy. For the second research objective, two orders of combining the feature selection and over -sampling steps are compared in order to find out the best combination procedure as well as the best combined algorithms. The experimental results based on ten different domain datasets containing low to very high feature dimensions show that ensemble feature selection methods slightly perform better than single ones. However, their performance differences are not big. To combine with the Synthetic Minority Oversampling Technique (SMOTE) over -sampling algorithm, performing feature selection first and over -sampling second outperforms the other procedure. Although the best combined algorithms are based on ensemble feature selection, eXtreme Gradient Boosting (XGBoost), as the single best feature selection algorithm, combined with SMOTE provides very similar classification performance to the best combined algorithms. To consider the issues of classification performance and compactional cost, the optimal solution is based on the combined XGBoost and SMOTE.
更多
查看译文
关键词
Feature selection,Ensemble feature selection,Machine learning,Class imbalance learning,Over-sampling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要