# Smart-Sample: An Ecient Algorithm for Clustering Large High-Dimensional Datasets

msra（2014）

摘要

Finding useful related patterns in a dataset is an important task in many interesting applications. In particular, one common need in many algorithms, is the ability to separate a given dataset into a small number of clusters. Each cluster represents a subset of data-points from the dataset, which are considered similar. In some cases, it is also necessary to distinguish data points that are not part of a pattern from the other data-points. This paper introduces a new data clustering method named smart-sample and com- pares its performance to several clustering methodologies. We show that smart-sample clusters successfully large high-dimensional datasets. In addition, smart-sample out- performs other methodologies in terms of running-time. A variation of the smart-sample algorithm, which guarantees eciency in terms of I/O, is also presented. We describe how to achieve an approximation of the in-memory smart-sample algorithm using a constant number of scans with a single sort operation on the disk.

更多查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要