Speeding Up Unsupervised Learning Through Early Stopping.

Andris Docaj,Yu Zhuang

2023 IEEE International Conference on Big Data (BigData)(2023)

引用 0|浏览2
暂无评分
摘要
Ever-increasing amounts of data make labeling of datasets uneconomic. Unsupervised learning, specifically the K-Means family of clustering algorithms, provide a pragmatic way of dealing with sizable quantities of unlabeled data. The bisecting K-Means algorithm with refinement has been shown to outperform the regular K-Means in both clustering speed and quality. Past works on bisecting K-Means revealed that limiting the iterations of the K-Means algorithm is a promising approach to speed up calculations while still maintaining clustering quality as measured by the sum of square errors (SSE). Particular works from the past employed the centroid difference as an iteration-stopping condition when using the bisecting K-Means algorithm or some forms of the fuzzy C-Means algorithm. In this paper, we show that the centroid difference is affected by the dataset and leads to unreliable clustering quality. We propose a single-parameter criterion for stopping the K-Means iterations. The parameter of the criterion employs SSE, and its selection manages the desired final clustering quality. Moreover, our method is easy to implement and does not carry the dataset-dependent issues of the centroid-difference-based stopping condition. To examine the effectiveness of our method, we performed various experiments under diverse parameter values and using multiple datasets. Our studies tested the effects of our parameter on clustering speed and quality. All the experiments performed used the no-membership change as the benchmark stopping condition. The early-stopping method results reveal speed-up factors of over 200, with an increase of SSE of only about three percent compared with our benchmark. In addition to this, we also applied early stopping to the automatic determination of optimal cluster numbers using the Calinski-Harabasz and Ray-Turi Indices. The results obtained reveal that optimal clustering can be obtained with impactful speed-up factors and minimal loss of cluster quality. Our experimental findings provide a promising path for the early-stopping criterion due to its computational reduction, preservation of clustering quality, ease of use, and broad dataset applicability. Sample code associated with this paper can be found at the following link, https://github.com/adocaj/code.
更多
查看译文
关键词
high dimensional data,big datasets,k-means,optimization,early stopping,machine learning,applied computing,chemistry,evaluation of retrieval results,search,clustering,large p,small n,random projections
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要