Streaming k-means on well-clusterable data


引用 102|浏览111
One of the central problems in data-analysis is k-means clustering. In recent years, considerable attention in the literature addressed the streaming variant of this problem, culminating in a series of results (Har-Peled and Mazumdar; Frahling and Sohler; Frahling, Monemizadeh, and Sohler; Chen) that produced a (1 + ε)-approximation for k-means clustering in the streaming setting. Unfortunately, since optimizing the k-means objective is Max-SNP hard, all algorithms that achieve a (1 + ε)-approximation must take time exponential in k unless P=NP. Thus, to avoid exponential dependence on k, some additional assumptions must be made to guarantee high quality approximation and polynomial running time. A recent paper of Ostrovsky, Rabani, Schulman, and Swamy (FOCS 2006) introduced the very natural assumption of data separability: the assumption closely reflects how k-means is used in practice and allowed the authors to create a high-quality approximation for k-means clustering in the non-streaming setting with polynomial running time even for large values of k. Their work left open a natural and important question: are similar results possible in a streaming setting? This is the question we answer in this paper, albeit using substantially different techniques. We show a near-optimal streaming approximation algorithm for k-means in high-dimensional Euclidean space with sublinear memory and a single pass, under the same data separability assumption. Our algorithm offers significant improvements in both space and running time over previous work while yielding asymptotically best-possible performance (assuming that the running time must be fully polynomial and P ≠ NP). The novel techniques we develop along the way imply a number of additional results: we provide a high-probability performance guarantee for online facility location (in contrast, Meyerson's FOCS 2001 algorithm gave bounds only in expectation); we develop a constant approximation method for the general class of semi-metric clustering problems; we improve (even without σ-separability) by a logarithmic factor space requirements for streaming constant-approximation for k-median; finally we design a "re-sampling method" in a streaming setting to convert any constant approximation for clustering to a [1 + O(σ2)]-approximation for σ-separable data.
semi-metric clustering problem,k-means clustering,time exponential,high quality approximation,non-streaming setting,constant approximation,constant approximation method,approximation algorithm,high-quality approximation,well-clusterable data,k-means objective,data analysis,euclidean space,sampling methods,k means clustering,facility location,k means
AI 理解论文
Chat Paper