Sample Size Estimation for Effective Modelling of Classification Problems in Machine Learning

Neha Vinayak,Shandar Ahmad

Communications in Computer and Information ScienceAdvanced Network Technologies and Intelligent Computing(2023)

引用 0|浏览2
暂无评分
摘要
High quality and sufficiently numerous data are fundamental to developing any machine learning model. In the absence of a prior estimate on the optimal amount of data needed for modeling a specific system, data collection ends up either producing too little for effective training or too much of it causing waste of critical resources. Here we examine the issue on some publicly available low-dimensional data sets by developing models with progressively larger data subsets and monitoring their predictive performances, employing random forest as the classification model in each case. We try to provide an initial guess for optimum data size requirement for a considered feature set size using Random Forest Classifier. This sample size is also suggested for other machine learning (ML) models, subject to their trainability on the dataset. We also investigate how the data quality impacts its size requirement by introducing incremental noise to original class labels. We observe that optimal data size remained robust for up to 2% class label errors, suggesting that ML models are capable of picking up the most informative data instances as long as there are sufficient number of objects to learn.
更多
查看译文
关键词
Sample size estimation,Training data size,Machine learning models,Learning curve,Random forest,Noisy data
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要