JENERGY: A Fault Tolerant Stateless Architecture for High Performance Computing

SOSE '15 Proceedings of the 2015 IEEE Symposium on Service-Oriented System Engineering(2015)

引用 3|浏览8
暂无评分
摘要
Large scale HPC (high performance computing) applications require thousands of nodes for computing parallel scientific applications. At this scale, hardware and software failures, network congestion or disconnections are frequent faults experienced by compute nodes. This introduces high levels of volatility which reduces the Mean Time between Failures (MTBF) of the whole system down to hours or minutes. To deal with this kind of failure rates, traditional point-to-point transmission semantics can be ill-fitted and cumbersome to re-engineer to support distributed partial failures. In this paper, we propose an application dependent network design that focuses on the sustainability of High Performance Computing (HPC) applications using packet-switching-inspired statistical multiplexing of semantic data tuples and decoupled computations. We report the design and implementation of a distributed tuple space using Cassandra and Zookeeper for tunable spatial and temporal redundancies without negative impact on application performance. We detail the various failure scenarios that can be handled seamlessly by our system and provide a description of the advantages of Stateless Parallel Processing for HPC applications. We report the preliminary results on performance, reliability and overall application scalability. We found that our system can provide high levels of sustained performance, while providing a reliable computing architecture that can withstand a range of failure types without manual checkpoint-restart, in a portable and non-intrusive manner.
更多
查看译文
关键词
fault tolerant computing,packet switching,parallel processing,Cassandra,JENERGY,MTBF,Zookeeper,decoupled computations,distributed tuple space,failure rates,fault tolerant stateless architecture,hardware failure,high-performance computing,large-scale HPC,mean time between failures,network congestion,network disconnections,packet-switching-inspired statistical multiplexing,parallel scientific applications,semantic data tuples,software failure,stateless parallel processing,tunable spatial redundancies,tunable temporal redundancies,volatility levels,Fault tolerance,Performance of systems,Sustainable extreme scale HPC architecture,scalability
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要