A Failure Recovery Solution For Transplanting High-Performance Data-Intensive Algorithms From The Cluster To The Cloud

2013 IEEE 15TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2013 IEEE INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (HPCC_EUC)(2013)

引用 3|浏览3
暂无评分
摘要
The computing-cloud manages huge numbers of virtualized resources to provide uniquely beneficial computing paradigms for scientific research. A modern cloud can behave in a virtual context -much like a local homogeneous computer cluster -to deliver High Performance Computing (HPC) platforms that provide public users with access, cut purchase costs, and eliminate the maintenance burden of sophisticated hardware. For decades most distributed scientific computing software has been designed to run on clusters. Research on how to transplant cluster-based programs and performance-tuning mechanisms onto the cloud platform has gathered momentum in recent years. This paper introduces a fault tolerant approach that assures the reliability virtual clusters on clouds where high-performance and data-intensive computing paradigms are deployed. We have solved the failure recovery issue for TCP connections containing MPI error handlers by exploiting and modeling the constraints of low-level distributed resources. The combined MPI and TCP environment can support software development for multiple parallel programming models, including asynchronous distributed computing based on MPI for scientific HPC and synchronous distributed computing for big data, such as MapReduce and Pregal. This paper sets out detailed MPI/TCP fault-tolerant mechanisms, including primitives and functions. These elements enable the systematic and hierarchical development of a globally optimized HPC on the cloud platform.
更多
查看译文
关键词
High-Performance Computing, Data-Intensive Computing, Cloud Computing, Fault Tolerance
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要