Center Based Clustering: A Foundational Perspective

mag(2014)

引用 29|浏览4
暂无评分
摘要
In the first part of this chapter we detail center based clustering methods, namely methods based on finding a “best” set of center points and then assigning data points to their nearest center. In particular, we focus on k-means and k-median clustering which are two of the most widely used clustering objectives. We describe popular heuristics for these methods and theoretical guarantees associated with them. We also describe how to design worst case approximately optimal algorithms for these problems. In the second part of the chapter we describe recent work on how to improve on these worst case algorithms even further by using insights from the nature of real world clustering problems and data sets. Finally, we also summarize theoretical work on clustering data generated from mixture models such as a mixture of Gaussians. 1 Approximation algorithms for k-means and k-median One of the most popular approaches to clustering is to define an objective function over the data points and find a partitioning which achieves the optimal solution, or an approximately optimal solution to the given objective function. Common objective functions include center based objective functions such as k-median and k-means where one selects k center points and the clustering is obtained by assigning each data point to its closest center point. Here closeness is measured in terms of a pairwise distance function d(), which the clustering algorithm has access to, encoding how dissimilar two data points are. For instance, the data could be points in Euclidean space with d() measuring Euclidean distance, or it could be strings with d() representing an edit distance, or some other dissimilarity score. For mathematical convenience it is also assumed that the distance function d() is a metric. In k-median clustering the objective is to find center points c1, c2, · · · ck, and a partitioning of the data so as to minimize Φk−median = ∑ x mini d(x, ci). This objective is historically very useful and well studied for facility location problems [16, 43]. Similarly the objective in k-means is to minimize Φk−means = ∑ x mini d(x, ci) . Optimizing this objective is closely related to fitting the maximum likelihood mixture model for a given dataset. For a given set of centers, the optimal clustering for that set is obtained by assigning each data point to its closest center point. This is known as the Voronoi partitioning of the data. Unfortunately, exactly optimizing the k-median and the k-means objectives is a notoriously hard problem. Intuitively this is expected since the objective function is a non-convex function of the variables involved. This apparent hardness can also be formally justified by appealing to the
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要