Shared Execution of Clustering Tasks.

International Conference on Big Data(2015)

引用 0|浏览47
暂无评分
摘要
Clustering is a central problem in non-relational data analysis, with k-means being the most popular clustering technique. In various scenarios, it may be necessary to perform clustering over the same input data multiple times - with different values of k, different clustering attributes, or different initial centroids - before arriving at the final solution. In this paper, we propose algorithms for parallel execution of multiple runs of k-means clustering in a way that achieves substantial savings of IO and processing resources. Proposed algorithms can easily be implemented over Hadoop/MapReduce, Spark, etc., with savings in map and reduce phases. Extensive performance evaluation using real-world datasets show that the proposed algorithms result in up to 40% savings in response times when compared to other optimization techniques proposed in literature as well as open-source implementations. The algorithms scale well with increasing data sizes, values of k, and number of clustering tasks.
更多
查看译文
关键词
clustering tasks,shared execution
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要