Critical Feature Selection and Critical Sampling for Data Mining

Communications in Computer and Information Science(2018)

引用 2|浏览17
暂无评分
摘要
The rapidly growing big data generated by connected sensors, devices, the web and social network platforms, etc., have stimulated the advancement of data science, which holds tremendous potential for problem solving in various domains. How to properly utilize the data in model building to obtain accurate analytics and knowledge discovery is a topic of great importance in data mining, and wherefore two issues arise: how to select a critical subset of features and how to select a critical subset of data points for sampling. This paper presents ongoing research that suggests: 1. the critical feature dimension problem is theoretically intractable, but simple heuristic methods may well be sufficient for practical purposes; 2. there are big data analytic problems where evidence suggest that the success of data mining depends more on the critical feature dimension than the specific features selected, thus a random selection of the features based on the dataset's critical feature dimension will prove sufficient; and 3. The problem of critical sampling has the same intractable complexity as critical feature dimension, but again simple heuristic methods may well be practicable in most applications; experimental results with several versions of the heuristic method are presented and discussed. Finally, a set of metrics for data quality is proposed based on the concepts of critical features and critical sampling.
更多
查看译文
关键词
Data mining,Critical feature selection,Critical Sampling,Data quality
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要