Scalable Clustering Algorithm For N-Body Simulations In A Shared-Nothing Cluster

SSDBM'10: Proceedings of the 22nd international conference on Scientific and statistical database management(2010)

引用 58|浏览40
暂无评分
摘要
Scientists' ability to generate and collect massive-scale datasets is increasing. As a result, constraints in data analysis capability rather than limitations in the availability of data have become the bottleneck to scientific discovery. Map Reduce-style platforms hold the promise to address this growing data analysis problem, but it is not easy to express many scientific analyses in these new frameworks. In this paper, we study data analysis challenges found in the astronomy simulation domain. In particular, we present a scalable, parallel algorithm for data clustering in this domain. Our algorithm makes two contributions. First, it shows how a clustering problem can be efficiently implemented in a Map Reduce-style framework. Second, it includes optimizations that enable scalability, even in the presence of skew. We implement our solution in the Dryad parallel data processing system using DryadLINQ. We evaluate its performance and scalability using a real dataset comprised of 906 million points, and show that in an 8-node cluster, our algorithm can process even a highly skewed dataset 17 times faster than the conventional implementation and offers near-linear scalability. Our approach matches the performance of an existing hand-optimized implementation used in astrophysics on a dataset with little skew and significantly outperforms it on a skewed dataset.
更多
查看译文
关键词
Local Cluster, Spatial Index, Uniform Partitioning, Distribute Shared Memory, Large Spatial Database
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要