AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity among practitioners, and in order to suggest improvements in its application

The effectiveness of lloyd-type methods for the k-means problem

Berkeley, CA, no. 6 (2013): 165-176

Cited by: 485|Views129
EI WOS

Abstract

We investigate variants of Lloyd's heuristic for clustering high-dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd's heur...More

Code:

Data:

Introduction
  • Consider the following two-step sampling procedure: (a) first pick center c1 by choosing a point x ∈ X with probability equal to
  • This allows them to show in Lemma 3.2 that with high probability, each initial center ci lies in the core of a distinct optimal cluster, say Xi, and c1 − c2 is much larger than the distances ci − ci for i = 1, 2.
Highlights
  • Practitioners instead continue to use a variety of heuristics (Lloyd, EM, agglomerative methods, etc.) that have no known performance guarantees
  • Researchers concerned with the runtime of Lloyd’s method bemoan the need for n nearest-neighbor computations in each descent step [28] ! Interestingly, the last reference provides a data structure that provably speeds up the nearest-neighbor calculations of Lloyd descent steps, under the condition that the optimal clusters are well-separated. (This is unrelated to providing performance guarantees for the outcome.) Their data structure may be used in any Lloyd-variant, including ours, and is well suited to the conditions under which we prove performance of our method; ironically, it may not be worthwhile to precompute their data structure since our method requires so few descent steps
  • Once we have the initial centers within the cores of the two optimal clusters, we show that a simple Lloyd-like step, which is simple to analyze, yields a good performance guarantee: we consider a suitable ball around each center and move the center to the centroid of this ball to obtain the final centers
  • We describe a linear time constantfactor approximation algorithm, and a PTAS that returns a
  • Given k seed centers c1, . . . , ck located sufficiently close to the optimal centers after stage I, we use two procedures in stage II to obtain a near-optimal clustering: the ball-kmeans step, which yields a 1 + f ( ) -approximation algorithm, or the centroid estimation step, based on a sampling idea of Kumar et al [30], which yields a PTAS with running time exponential in k
Results
  • In Section 4.1.1, the authors consider a natural generalization of the sampling procedure used for the 2-means case, and show that this picks the k initial centers from the cores of the optimal clusters.
  • For the kmeans problem, if ∆2k(X) ≤ 2∆2k−1(X), the authors show that the greedy deletion procedure followed by a clean-up step yields a 1 + f ( ) -approximation algorithm.in Section 4.1.3 the authors combine the sampling and deletion procedures to obtain an O-time initialization procedure.
  • The authors sample O(k) centers, which ensures that every cluster has an initial center in a slightly expanded version of the core, and run the deletion procedure on an instance of size O(k) derived from the sampled points to obtain the k seed centers.
  • . The authors show that under the separation assumption, the above sampling procedure will pick the k initial centers to lie in the cores of the clusters X1, .
  • Lemma 4.1 With probability 1 − O(ρ), the first two centers c1, c2 lie in the cores of different clusters, that is, Pr[ i=j(x ∈ Xicor and y ∈ Xjcor)] = 1 − O(ρ).
  • The sampling process ensures that with high probability, every cluster Xi contains a point ci that is close to its center ci.
  • Ck located sufficiently close to the optimal centers after stage I, the authors use two procedures in stage II to obtain a near-optimal clustering: the ball-kmeans step, which yields a 1 + f ( ) -approximation algorithm, or the centroid estimation step, based on a sampling idea of Kumar et al [30], which yields a PTAS with running time exponential in k.
Conclusion
  • Theorem 4.14 Assuming that ∆2k(X) ≤ 2∆2k−1(X) for a small enough , there is a PTAS for the k-means problem that returns a (1 + ω)-optimal solution with constant probability in time O(2O(k(1+ 2)/ω)nd).
  • Proof: By appropriately setting ρ in the sampling procedure, the authors can ensure that with probability Θ(1)k, it returns centers c1, .
Funding
  • Supported in part by IBM Faculty Award, Xerox Innovation Group Award, a gift from Teradata, Intel equipment grant, and NSF Cybertrust grant no. 0430254
Reference
  • K. Alsabti, S. Ranka, and V. Singh. An efficient k-means clustering algorithm. In Proc. 1st Workshop on High Performance Data Mining, 1998.
    Google ScholarLocate open access versionFindings
  • D. Arthur and S. Vassilvitskii. How slow is the k-means method? In Proc. 22nd SoCG, pages 144–153, 2006.
    Google ScholarLocate open access versionFindings
  • V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit. Local search heuristics for k-median and facility location problems. SICOMP, 33:544–562, 2004.
    Google ScholarLocate open access versionFindings
  • M. Badoiu, S. Har-Peled, and P. Indyk. Approximate clustering via core-sets. Proc. 34th STOC, pages 250–257, 2002.
    Google ScholarLocate open access versionFindings
  • P. S. Bradley and U. Fayyad. Refining initial points for Kmeans clustering. In Proc. 15th ICML, pages 91–99, 1998.
    Google ScholarLocate open access versionFindings
  • M. Charikar and S. Guha. Improved combinatorial algorithms for the facility location and k-median problems. In Proc. 40th FOCS, pages 378–388, 1999.
    Google ScholarLocate open access versionFindings
  • M. Charikar, S. Guha, E. Tardos, and D. B. Shmoys. A constant-factor approximation algorithm for the k-median problem. J. Comput. and Syst. Sci., 65:129–149, 2002.
    Google ScholarLocate open access versionFindings
  • M. Chrobak, C. Kenyon, and N. Young. The reverse greedy algorithm for the metric k-median problem. Information Processing Letters, 97:68–72, 2006.
    Google ScholarLocate open access versionFindings
  • D. R. Cox. Note on grouping. J. American Stat. Assoc., 52:543–547, 1957.
    Google ScholarLocate open access versionFindings
  • S. Dasgupta. How fast is k-means? In Proc. 16th COLT, page 735, 2003.
    Google ScholarLocate open access versionFindings
  • W. F. de la Vega, M. Karpinski, C. Kenyon, and Y. Rabani. Approximation schemes for clustering problems. In Proc. 35th ACM STOC, pages 50–58, 2003.
    Google ScholarLocate open access versionFindings
  • A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Statist. Soc. B, 39:1–38, 1977.
    Google ScholarLocate open access versionFindings
  • P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering large graphs via the Singular Value Decomposition. Machine Learning, 56:9–33, 2004.
    Google ScholarLocate open access versionFindings
  • R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley-Interscience, 2000.
    Google ScholarLocate open access versionFindings
  • M. Effros and L. J. Schulman. Deterministic clustering with data nets. Electronic Tech Report ECCC TR04-050, 2004.
    Google ScholarFindings
  • M. Effros and L. J. Schulman. Deterministic clustering with data nets. In Proc. ISIT, 2004.
    Google ScholarLocate open access versionFindings
  • D. Fisher. Iterative optimization and simplification of hierarchical clusterings. J. Artif. Intell. Res., 4:147–178, 1996.
    Google ScholarLocate open access versionFindings
  • E. Forgey. Cluster analysis of multivariate data: efficiency vs. interpretability of classification. Biometrics, 21:768, 1965.
    Google ScholarLocate open access versionFindings
  • A. Gersho and R. M. Gray. Vector quantization and signal compression. Kluwer, 1992.
    Google ScholarLocate open access versionFindings
  • R. M. Gray and D. L. Neuhoff. Quantization. IEEE Trans. Inform. Theory, 44(6):2325–2384, October 1998.
    Google ScholarLocate open access versionFindings
  • S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. In Proc. 36th STOC, pages 291–300, 2004.
    Google ScholarLocate open access versionFindings
  • S. Har-Peled and B. Sadri. How fast is the k-means method? Algorithmica, 41:185–202, 2005.
    Google ScholarLocate open access versionFindings
  • R. E. Higgs, K. G. Bemis, I. A. Watson, and J. H. Wikel. Experimental designs for selecting molecules from large chemical databases. J. Chem. Inf. Comp. Sci., 37:861–870, 1997.
    Google ScholarLocate open access versionFindings
  • A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3), September 1999.
    Google ScholarLocate open access versionFindings
  • K. Jain, M. Mahdian, E. Markakis, A. Saberi, and V. Vazirani. Greedy facility location algorithms analyzed using dual-fitting with factor-revealing LP. JACM, 50:795–824, 2003.
    Google ScholarLocate open access versionFindings
  • K. Jain and V. Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. JACM, 48:274–296, 2001.
    Google ScholarLocate open access versionFindings
  • T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28:89–112, 2004.
    Google ScholarLocate open access versionFindings
  • T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell., 24:881–892, 2002.
    Google ScholarLocate open access versionFindings
  • L. Kaufman and P. J. Rousseeuw. Finding groups in data. An introduction to cluster analysis. Wiley, 1990.
    Google ScholarFindings
  • A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1 + ε)-approximation algorithm for k-means clustering in any dimensions. In Proc. 45th FOCS, pages 454–462, 2004.
    Google ScholarLocate open access versionFindings
  • Y. Linde, A. Buzo, and R. M. Gray. An algorithm for vector quantization design. IEEE Trans. Commun., COM-28:84– 95, January 1980.
    Google ScholarLocate open access versionFindings
  • S. P. Lloyd. Least squares quantization in PCM. Special issue on quantization, IEEE Trans. Inform. Theory, 28:129– 137, 1982.
    Google ScholarLocate open access versionFindings
  • J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symp. on Math. Statistics and Probability, pages 281–297, 1967.
    Google ScholarLocate open access versionFindings
  • J. Matousek. On approximate geometric k-clustering. Discrete & Computational Geometry, 24:61–84, 2000.
    Google ScholarLocate open access versionFindings
  • J. Max. Quantizing for minimum distortion. IEEE Trans. Inform. Theory, IT-6(1):7–12, March 1960.
    Google ScholarLocate open access versionFindings
  • M. Meila and D. Heckerman. An experimental comparison of several clustering and initialization methods. In Proc. 14th UAI, pages 386–395, 1998.
    Google ScholarLocate open access versionFindings
  • R. R. Mettu and C. G. Plaxton. Optimal time bounds for approximate clustering. Machine Learning, 56:35–60, 2004.
    Google ScholarLocate open access versionFindings
  • G. Milligan. An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45:325–342, 1980.
    Google ScholarLocate open access versionFindings
  • R. Ostrovsky and Y. Rabani. Polynomial time approximation schemes for geometric clustering problems. JACM, 49(2):139–156, 2002.
    Google ScholarLocate open access versionFindings
  • D. Pelleg and A. Moore. Accelerating exact k-means algorithms with geometric reasoning. In Proc. 5th ACM KDD, pages 277–281, 1999.
    Google ScholarLocate open access versionFindings
  • J. M. Pena, J. A. Lozano, and P. Larranaga. An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Lett., 20:1027–1040, 1999.
    Google ScholarLocate open access versionFindings
  • S. J. Phillips. Acceleration of k-means and related clustering problems. In Proc. 4th ALENEX, 2002.
    Google ScholarLocate open access versionFindings
  • L. J. Schulman. Clustering for edge-cost minimization. In Proc. 32nd ACM STOC, pages 547–555, 2000.
    Google ScholarLocate open access versionFindings
  • M. Snarey, N. K. Terrett, P. Willet, and D. J. Wilton. Comparison of algorithms for dissimilarity-based compound selection. J. Mol. Graphics and Modelling, 15:372–385, 1997.
    Google ScholarLocate open access versionFindings
  • D. Spielman and S. Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. In Proc. 33rd ACM STOC, pages 296–305, 2001.
    Google ScholarLocate open access versionFindings
  • H. Steinhaus. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci., C1. III vol IV:801–804, 1956.
    Google ScholarLocate open access versionFindings
  • R. C. Tryon and D. E. Bailey. Cluster Analysis. McGrawHill, 1970. Pages 147-150.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科