Efficient development of high performance data analytics in Python

Future Generation Computer Systems(2020)

引用 16|浏览38
暂无评分
摘要
Our society is generating an increasing amount of data at an unprecedented scale, variety, and speed. This also applies to numerous research areas, such as genomics, high energy physics, and astronomy, for which large-scale data processing has become crucial. However, there is still a gap between the traditional scientific computing ecosystem and big data analytics tools and frameworks. On the one hand, high performance computing (HPC) programming models lack productivity, and do not provide means for processing large amounts of data in a simple manner. On the other hand, existing big data processing tools have performance issues in HPC environments, and are not general-purpose. In this paper, we propose and evaluate PyCOMPSs, a task-based programming model for Python, as an excellent solution for distributed big data processing in HPC infrastructures. Among other useful features, PyCOMPSs offers a highly productive general-purpose programming model, is infrastructure-agnostic, and provides transparent data management with support for distributed storage systems. We show how two machine learning algorithms (Cascade SVM and K-means) can be developed with PyCOMPSs, and evaluate PyCOMPSs’ productivity based on these algorithms. Additionally, we evaluate PyCOMPSs performance on an HPC cluster using up to 1,536 cores and 320 million input vectors. Our results show that PyCOMPSs achieves similar performance and scalability to MPI in HPC infrastructures, while providing a much more productive interface that allows the easy development of data analytics algorithms.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要