Big data decision tree for continuous-valued attributes based on unbalanced cut points

Shixiang Ma,Junhai Zhai

J. Big Data(2023)

引用 0|浏览0
暂无评分
摘要
The decision tree is a widely used decision support model, which can quickly mine effective decision rules based on the dataset. The decision tree induction algorithm for continuous-valued attributes, based on unbalanced cut points, is efficient for mining decision rules; however, extending it to big data remains an unresolved. In this paper, two solutions are proposed to solve this problem: the first one is based on partitioning instance subsets, whereas the second one uses partitioning attribute subsets. The crucial of these two solutions is how to find the global optimal cut point from the set of local optimal cut points. For the first solution, the calculation of the Gini index of the cut points between computing nodes and the selection of the global optimal cut point by communication between these computing nodes is proposed. However, in the second solution, the division of the big data into subsets using attribute subsets in a way that all cut points of an attribute are on the same map node is proposed, the local optimal cut points can be found in this map node, then the global optimal cut point can be obtained by summarizing all local optimal cut points in the reduce node. Finally, the proposed solutions are implemented with two big data platforms, Hadoop and Spark, and compared with three related algorithms on four datasets. Experimental results show that the proposed algorithms can not only effectively solve the scalability problem, but also have lowest running time, the fastest speed and the highest efficiency under the premise of preserving the classification performance.
更多
查看译文
关键词
Big data, Decision tree, Cut point, Hadoop, Spark
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要