Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering

Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms(2020)

引用 629|浏览508
暂无评分
摘要
We prove that the sum of the squared Euclidean distances from the n rows of an n x d matrix A to any compact set that is spanned by k vectors in @@@@d can be approximated up to (1 + ε)-factor, for an arbitrary small ε > 0, using the O(k/ε2)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1 + ε)-approximated by an optimal k-means clustering of their projection on the O(k/ε2) first right singular vectors (principle components) of A. A (j, k)-coreset for projective clustering is a small set of points that yields a (1 + ε)-approximation to the sum of squared distances from the n rows of A to any set of k affine subspaces, each of dimension at most j. Our embedding yields (0, k)-coresets of size O(k) for handling k-means queries, (j, 1)-coresets of size O(j) for PCA queries, and (j, k)-coresets of size (log n)O(jk) for any j, k ≥ 1 and constant ε ε (0, 1/2). Previous coresets usually have a size which is linearly or even exponentially dependent of d, which makes them useless when d ~ n. Using our coresets with the merge-and-reduce approach, we obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering. These algorithms use update time per point and memory that is polynomial in log n and only linear in d. For cost functions other than squared Euclidean distances we suggest a simple recursive coreset construction that produces coresets of size @@@@ for k-means and a special class of bregman divergences that is less dependent on the properties of the squared Euclidean distance.
更多
查看译文
关键词
algorithms,design,general,theory,clustering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要