AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity among practitioners, and in order to suggest improvements in its application
The effectiveness of lloyd-type methods for the k-means problem
Berkeley, CA, no. 6 (2013): 165-176
We investigate variants of Lloyd's heuristic for clustering high-dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd's heur...More
PPT (Upload PPT)
- Consider the following two-step sampling procedure: (a) first pick center c1 by choosing a point x ∈ X with probability equal to
- This allows them to show in Lemma 3.2 that with high probability, each initial center ci lies in the core of a distinct optimal cluster, say Xi, and c1 − c2 is much larger than the distances ci − ci for i = 1, 2.
- Practitioners instead continue to use a variety of heuristics (Lloyd, EM, agglomerative methods, etc.) that have no known performance guarantees
- Researchers concerned with the runtime of Lloyd’s method bemoan the need for n nearest-neighbor computations in each descent step  ! Interestingly, the last reference provides a data structure that provably speeds up the nearest-neighbor calculations of Lloyd descent steps, under the condition that the optimal clusters are well-separated. (This is unrelated to providing performance guarantees for the outcome.) Their data structure may be used in any Lloyd-variant, including ours, and is well suited to the conditions under which we prove performance of our method; ironically, it may not be worthwhile to precompute their data structure since our method requires so few descent steps
- Once we have the initial centers within the cores of the two optimal clusters, we show that a simple Lloyd-like step, which is simple to analyze, yields a good performance guarantee: we consider a suitable ball around each center and move the center to the centroid of this ball to obtain the final centers
- We describe a linear time constantfactor approximation algorithm, and a PTAS that returns a
- Given k seed centers c1, . . . , ck located sufficiently close to the optimal centers after stage I, we use two procedures in stage II to obtain a near-optimal clustering: the ball-kmeans step, which yields a 1 + f ( ) -approximation algorithm, or the centroid estimation step, based on a sampling idea of Kumar et al , which yields a PTAS with running time exponential in k
- In Section 4.1.1, the authors consider a natural generalization of the sampling procedure used for the 2-means case, and show that this picks the k initial centers from the cores of the optimal clusters.
- For the kmeans problem, if ∆2k(X) ≤ 2∆2k−1(X), the authors show that the greedy deletion procedure followed by a clean-up step yields a 1 + f ( ) -approximation algorithm.in Section 4.1.3 the authors combine the sampling and deletion procedures to obtain an O-time initialization procedure.
- The authors sample O(k) centers, which ensures that every cluster has an initial center in a slightly expanded version of the core, and run the deletion procedure on an instance of size O(k) derived from the sampled points to obtain the k seed centers.
- . The authors show that under the separation assumption, the above sampling procedure will pick the k initial centers to lie in the cores of the clusters X1, .
- Lemma 4.1 With probability 1 − O(ρ), the first two centers c1, c2 lie in the cores of different clusters, that is, Pr[ i=j(x ∈ Xicor and y ∈ Xjcor)] = 1 − O(ρ).
- The sampling process ensures that with high probability, every cluster Xi contains a point ci that is close to its center ci.
- Ck located sufficiently close to the optimal centers after stage I, the authors use two procedures in stage II to obtain a near-optimal clustering: the ball-kmeans step, which yields a 1 + f ( ) -approximation algorithm, or the centroid estimation step, based on a sampling idea of Kumar et al , which yields a PTAS with running time exponential in k.
- Theorem 4.14 Assuming that ∆2k(X) ≤ 2∆2k−1(X) for a small enough , there is a PTAS for the k-means problem that returns a (1 + ω)-optimal solution with constant probability in time O(2O(k(1+ 2)/ω)nd).
- Proof: By appropriately setting ρ in the sampling procedure, the authors can ensure that with probability Θ(1)k, it returns centers c1, .
- Supported in part by IBM Faculty Award, Xerox Innovation Group Award, a gift from Teradata, Intel equipment grant, and NSF Cybertrust grant no. 0430254
- K. Alsabti, S. Ranka, and V. Singh. An efficient k-means clustering algorithm. In Proc. 1st Workshop on High Performance Data Mining, 1998.
- D. Arthur and S. Vassilvitskii. How slow is the k-means method? In Proc. 22nd SoCG, pages 144–153, 2006.
- V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit. Local search heuristics for k-median and facility location problems. SICOMP, 33:544–562, 2004.
- M. Badoiu, S. Har-Peled, and P. Indyk. Approximate clustering via core-sets. Proc. 34th STOC, pages 250–257, 2002.
- P. S. Bradley and U. Fayyad. Refining initial points for Kmeans clustering. In Proc. 15th ICML, pages 91–99, 1998.
- M. Charikar and S. Guha. Improved combinatorial algorithms for the facility location and k-median problems. In Proc. 40th FOCS, pages 378–388, 1999.
- M. Charikar, S. Guha, E. Tardos, and D. B. Shmoys. A constant-factor approximation algorithm for the k-median problem. J. Comput. and Syst. Sci., 65:129–149, 2002.
- M. Chrobak, C. Kenyon, and N. Young. The reverse greedy algorithm for the metric k-median problem. Information Processing Letters, 97:68–72, 2006.
- D. R. Cox. Note on grouping. J. American Stat. Assoc., 52:543–547, 1957.
- S. Dasgupta. How fast is k-means? In Proc. 16th COLT, page 735, 2003.
- W. F. de la Vega, M. Karpinski, C. Kenyon, and Y. Rabani. Approximation schemes for clustering problems. In Proc. 35th ACM STOC, pages 50–58, 2003.
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Statist. Soc. B, 39:1–38, 1977.
- P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering large graphs via the Singular Value Decomposition. Machine Learning, 56:9–33, 2004.
- R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley-Interscience, 2000.
- M. Effros and L. J. Schulman. Deterministic clustering with data nets. Electronic Tech Report ECCC TR04-050, 2004.
- M. Effros and L. J. Schulman. Deterministic clustering with data nets. In Proc. ISIT, 2004.
- D. Fisher. Iterative optimization and simplification of hierarchical clusterings. J. Artif. Intell. Res., 4:147–178, 1996.
- E. Forgey. Cluster analysis of multivariate data: efficiency vs. interpretability of classification. Biometrics, 21:768, 1965.
- A. Gersho and R. M. Gray. Vector quantization and signal compression. Kluwer, 1992.
- R. M. Gray and D. L. Neuhoff. Quantization. IEEE Trans. Inform. Theory, 44(6):2325–2384, October 1998.
- S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. In Proc. 36th STOC, pages 291–300, 2004.
- S. Har-Peled and B. Sadri. How fast is the k-means method? Algorithmica, 41:185–202, 2005.
- R. E. Higgs, K. G. Bemis, I. A. Watson, and J. H. Wikel. Experimental designs for selecting molecules from large chemical databases. J. Chem. Inf. Comp. Sci., 37:861–870, 1997.
- A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3), September 1999.
- K. Jain, M. Mahdian, E. Markakis, A. Saberi, and V. Vazirani. Greedy facility location algorithms analyzed using dual-fitting with factor-revealing LP. JACM, 50:795–824, 2003.
- K. Jain and V. Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. JACM, 48:274–296, 2001.
- T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28:89–112, 2004.
- T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell., 24:881–892, 2002.
- L. Kaufman and P. J. Rousseeuw. Finding groups in data. An introduction to cluster analysis. Wiley, 1990.
- A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1 + ε)-approximation algorithm for k-means clustering in any dimensions. In Proc. 45th FOCS, pages 454–462, 2004.
- Y. Linde, A. Buzo, and R. M. Gray. An algorithm for vector quantization design. IEEE Trans. Commun., COM-28:84– 95, January 1980.
- S. P. Lloyd. Least squares quantization in PCM. Special issue on quantization, IEEE Trans. Inform. Theory, 28:129– 137, 1982.
- J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symp. on Math. Statistics and Probability, pages 281–297, 1967.
- J. Matousek. On approximate geometric k-clustering. Discrete & Computational Geometry, 24:61–84, 2000.
- J. Max. Quantizing for minimum distortion. IEEE Trans. Inform. Theory, IT-6(1):7–12, March 1960.
- M. Meila and D. Heckerman. An experimental comparison of several clustering and initialization methods. In Proc. 14th UAI, pages 386–395, 1998.
- R. R. Mettu and C. G. Plaxton. Optimal time bounds for approximate clustering. Machine Learning, 56:35–60, 2004.
- G. Milligan. An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45:325–342, 1980.
- R. Ostrovsky and Y. Rabani. Polynomial time approximation schemes for geometric clustering problems. JACM, 49(2):139–156, 2002.
- D. Pelleg and A. Moore. Accelerating exact k-means algorithms with geometric reasoning. In Proc. 5th ACM KDD, pages 277–281, 1999.
- J. M. Pena, J. A. Lozano, and P. Larranaga. An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Lett., 20:1027–1040, 1999.
- S. J. Phillips. Acceleration of k-means and related clustering problems. In Proc. 4th ALENEX, 2002.
- L. J. Schulman. Clustering for edge-cost minimization. In Proc. 32nd ACM STOC, pages 547–555, 2000.
- M. Snarey, N. K. Terrett, P. Willet, and D. J. Wilton. Comparison of algorithms for dissimilarity-based compound selection. J. Mol. Graphics and Modelling, 15:372–385, 1997.
- D. Spielman and S. Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. In Proc. 33rd ACM STOC, pages 296–305, 2001.
- H. Steinhaus. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci., C1. III vol IV:801–804, 1956.
- R. C. Tryon and D. E. Bailey. Cluster Analysis. McGrawHill, 1970. Pages 147-150.