Scalable and Highly Available Fault Resilient Programming Middleware for Exascale Computing

Takayuki Tozawa,Yoshio Tanaka

semanticscholar(2014)

引用 0|浏览0
暂无评分
摘要
A hierarchical master-worker model is believed to be a promising programming paradigm that can achieve weak scaling on exascale-level high performance computers [1]. However, “fault resiliency” is one of the most important issues for exascale computing because the Mean Time Between Failure (MTBF) of such computers will be short [2]. We propose a fault resilient programming middleware called Falanx [3] for exascale computing that allows each application programmer to easily code an MPI-based fault resilient application with a hierarchical master-worker model. The Falanx middleware consists of a data store (DS) and a resource management system (RMS) in order to continue with an execution flow: The DS preserves data required for each application, and prevents data loss due to failures. The RMS allocates processes of each task, including data parallelism, to computing nodes avoiding nodes with failures. It is necessary that these components must be scalable and that they themselves have to be implemented in a fault resilient manner in exascale computing environments.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要