Matrix factorizations at scale: A comparison of scientific data analytics in spark and C+MPI using three case studies

Alex Gittens,Aditya Devarakonda,Evan Racah,Michael F. Ringenburg,Lisa Gerhardt,Jey Kottaalam,Jialin Liu,Kristyn J. Maschhoff,Shane Canon,Jatin Chhugani,Pramod Sharma,Jiyan Yang, James Demmel, Jim Harrell,Venkat Krishnamurthy,Michael W. Mahoney,Prabhat

2016 IEEE International Conference on Big Data (Big Data)（2016）

引用 81|浏览235

暂无评分

摘要

We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to 1.6TB particle physics, 2.2TB and 16TB climate modeling and 1.1TB bioimaging data. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark's data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance.

查看译文

关键词

matrix factorization,linear algebra,Apache Spark,PCA,NMF

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要