## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# The effectiveness of lloyd-type methods for the k-means problem

Berkeley, CA, no. 6 (2013): 165-176

EI WOS

Keywords

Abstract

We investigate variants of Lloyd's heuristic for clustering high-dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd's heur...More

Code:

Data:

Introduction

- Consider the following two-step sampling procedure: (a) first pick center c1 by choosing a point x ∈ X with probability equal to
- This allows them to show in Lemma 3.2 that with high probability, each initial center ci lies in the core of a distinct optimal cluster, say Xi, and c1 − c2 is much larger than the distances ci − ci for i = 1, 2.

Highlights

- Practitioners instead continue to use a variety of heuristics (Lloyd, EM, agglomerative methods, etc.) that have no known performance guarantees
- Researchers concerned with the runtime of Lloyd’s method bemoan the need for n nearest-neighbor computations in each descent step [28] ! Interestingly, the last reference provides a data structure that provably speeds up the nearest-neighbor calculations of Lloyd descent steps, under the condition that the optimal clusters are well-separated. (This is unrelated to providing performance guarantees for the outcome.) Their data structure may be used in any Lloyd-variant, including ours, and is well suited to the conditions under which we prove performance of our method; ironically, it may not be worthwhile to precompute their data structure since our method requires so few descent steps
- Once we have the initial centers within the cores of the two optimal clusters, we show that a simple Lloyd-like step, which is simple to analyze, yields a good performance guarantee: we consider a suitable ball around each center and move the center to the centroid of this ball to obtain the final centers
- We describe a linear time constantfactor approximation algorithm, and a PTAS that returns a
- Given k seed centers c1, . . . , ck located sufficiently close to the optimal centers after stage I, we use two procedures in stage II to obtain a near-optimal clustering: the ball-kmeans step, which yields a 1 + f ( ) -approximation algorithm, or the centroid estimation step, based on a sampling idea of Kumar et al [30], which yields a PTAS with running time exponential in k

Results

- In Section 4.1.1, the authors consider a natural generalization of the sampling procedure used for the 2-means case, and show that this picks the k initial centers from the cores of the optimal clusters.
- For the kmeans problem, if ∆2k(X) ≤ 2∆2k−1(X), the authors show that the greedy deletion procedure followed by a clean-up step yields a 1 + f ( ) -approximation algorithm.in Section 4.1.3 the authors combine the sampling and deletion procedures to obtain an O-time initialization procedure.
- The authors sample O(k) centers, which ensures that every cluster has an initial center in a slightly expanded version of the core, and run the deletion procedure on an instance of size O(k) derived from the sampled points to obtain the k seed centers.
- . The authors show that under the separation assumption, the above sampling procedure will pick the k initial centers to lie in the cores of the clusters X1, .
- Lemma 4.1 With probability 1 − O(ρ), the first two centers c1, c2 lie in the cores of different clusters, that is, Pr[ i=j(x ∈ Xicor and y ∈ Xjcor)] = 1 − O(ρ).
- The sampling process ensures that with high probability, every cluster Xi contains a point ci that is close to its center ci.
- Ck located sufficiently close to the optimal centers after stage I, the authors use two procedures in stage II to obtain a near-optimal clustering: the ball-kmeans step, which yields a 1 + f ( ) -approximation algorithm, or the centroid estimation step, based on a sampling idea of Kumar et al [30], which yields a PTAS with running time exponential in k.

Conclusion

- Theorem 4.14 Assuming that ∆2k(X) ≤ 2∆2k−1(X) for a small enough , there is a PTAS for the k-means problem that returns a (1 + ω)-optimal solution with constant probability in time O(2O(k(1+ 2)/ω)nd).
- Proof: By appropriately setting ρ in the sampling procedure, the authors can ensure that with probability Θ(1)k, it returns centers c1, .

Funding

- Supported in part by IBM Faculty Award, Xerox Innovation Group Award, a gift from Teradata, Intel equipment grant, and NSF Cybertrust grant no. 0430254

Reference

- K. Alsabti, S. Ranka, and V. Singh. An efficient k-means clustering algorithm. In Proc. 1st Workshop on High Performance Data Mining, 1998.
- D. Arthur and S. Vassilvitskii. How slow is the k-means method? In Proc. 22nd SoCG, pages 144–153, 2006.
- V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit. Local search heuristics for k-median and facility location problems. SICOMP, 33:544–562, 2004.
- M. Badoiu, S. Har-Peled, and P. Indyk. Approximate clustering via core-sets. Proc. 34th STOC, pages 250–257, 2002.
- P. S. Bradley and U. Fayyad. Refining initial points for Kmeans clustering. In Proc. 15th ICML, pages 91–99, 1998.
- M. Charikar and S. Guha. Improved combinatorial algorithms for the facility location and k-median problems. In Proc. 40th FOCS, pages 378–388, 1999.
- M. Charikar, S. Guha, E. Tardos, and D. B. Shmoys. A constant-factor approximation algorithm for the k-median problem. J. Comput. and Syst. Sci., 65:129–149, 2002.
- M. Chrobak, C. Kenyon, and N. Young. The reverse greedy algorithm for the metric k-median problem. Information Processing Letters, 97:68–72, 2006.
- D. R. Cox. Note on grouping. J. American Stat. Assoc., 52:543–547, 1957.
- S. Dasgupta. How fast is k-means? In Proc. 16th COLT, page 735, 2003.
- W. F. de la Vega, M. Karpinski, C. Kenyon, and Y. Rabani. Approximation schemes for clustering problems. In Proc. 35th ACM STOC, pages 50–58, 2003.
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Statist. Soc. B, 39:1–38, 1977.
- P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering large graphs via the Singular Value Decomposition. Machine Learning, 56:9–33, 2004.
- R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley-Interscience, 2000.
- M. Effros and L. J. Schulman. Deterministic clustering with data nets. Electronic Tech Report ECCC TR04-050, 2004.
- M. Effros and L. J. Schulman. Deterministic clustering with data nets. In Proc. ISIT, 2004.
- D. Fisher. Iterative optimization and simplification of hierarchical clusterings. J. Artif. Intell. Res., 4:147–178, 1996.
- E. Forgey. Cluster analysis of multivariate data: efficiency vs. interpretability of classification. Biometrics, 21:768, 1965.
- A. Gersho and R. M. Gray. Vector quantization and signal compression. Kluwer, 1992.
- R. M. Gray and D. L. Neuhoff. Quantization. IEEE Trans. Inform. Theory, 44(6):2325–2384, October 1998.
- S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. In Proc. 36th STOC, pages 291–300, 2004.
- S. Har-Peled and B. Sadri. How fast is the k-means method? Algorithmica, 41:185–202, 2005.
- R. E. Higgs, K. G. Bemis, I. A. Watson, and J. H. Wikel. Experimental designs for selecting molecules from large chemical databases. J. Chem. Inf. Comp. Sci., 37:861–870, 1997.
- A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3), September 1999.
- K. Jain, M. Mahdian, E. Markakis, A. Saberi, and V. Vazirani. Greedy facility location algorithms analyzed using dual-fitting with factor-revealing LP. JACM, 50:795–824, 2003.
- K. Jain and V. Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. JACM, 48:274–296, 2001.
- T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28:89–112, 2004.
- T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell., 24:881–892, 2002.
- L. Kaufman and P. J. Rousseeuw. Finding groups in data. An introduction to cluster analysis. Wiley, 1990.
- A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1 + ε)-approximation algorithm for k-means clustering in any dimensions. In Proc. 45th FOCS, pages 454–462, 2004.
- Y. Linde, A. Buzo, and R. M. Gray. An algorithm for vector quantization design. IEEE Trans. Commun., COM-28:84– 95, January 1980.
- S. P. Lloyd. Least squares quantization in PCM. Special issue on quantization, IEEE Trans. Inform. Theory, 28:129– 137, 1982.
- J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symp. on Math. Statistics and Probability, pages 281–297, 1967.
- J. Matousek. On approximate geometric k-clustering. Discrete & Computational Geometry, 24:61–84, 2000.
- J. Max. Quantizing for minimum distortion. IEEE Trans. Inform. Theory, IT-6(1):7–12, March 1960.
- M. Meila and D. Heckerman. An experimental comparison of several clustering and initialization methods. In Proc. 14th UAI, pages 386–395, 1998.
- R. R. Mettu and C. G. Plaxton. Optimal time bounds for approximate clustering. Machine Learning, 56:35–60, 2004.
- G. Milligan. An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45:325–342, 1980.
- R. Ostrovsky and Y. Rabani. Polynomial time approximation schemes for geometric clustering problems. JACM, 49(2):139–156, 2002.
- D. Pelleg and A. Moore. Accelerating exact k-means algorithms with geometric reasoning. In Proc. 5th ACM KDD, pages 277–281, 1999.
- J. M. Pena, J. A. Lozano, and P. Larranaga. An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Lett., 20:1027–1040, 1999.
- S. J. Phillips. Acceleration of k-means and related clustering problems. In Proc. 4th ALENEX, 2002.
- L. J. Schulman. Clustering for edge-cost minimization. In Proc. 32nd ACM STOC, pages 547–555, 2000.
- M. Snarey, N. K. Terrett, P. Willet, and D. J. Wilton. Comparison of algorithms for dissimilarity-based compound selection. J. Mol. Graphics and Modelling, 15:372–385, 1997.
- D. Spielman and S. Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. In Proc. 33rd ACM STOC, pages 296–305, 2001.
- H. Steinhaus. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci., C1. III vol IV:801–804, 1956.
- R. C. Tryon and D. E. Bailey. Cluster Analysis. McGrawHill, 1970. Pages 147-150.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn