A Highly Available Distributed Self-Scheduler For Exascale Computing

ICUIMC(2015)

引用 1|浏览22
暂无评分
摘要
A hierarchical master-worker model is thought to be a promising programming paradigm for exascale-level high performance computers. However, "fault resiliency" is one of the most important issues for exascale computing because the Mean Time Between Failure (MTBF) is expected to be short. We propose a fault resilient middleware suite for exascale computing environments. In this paper, we design a highly available distributed self-scheduler as a resource management system for the proposed middleware suite. The proposed distributed self-scheduler consists of multiple processes in order to achieve scalability, fault resiliency, and persistency. We also develop a prototype system of the middleware, using Apache ZooKeeper and Apache Cassandra. Experiments using the developed prototype system show that the proposed distributed self-scheduler achieves the desired fault resiliency for an application program developed using the middleware, and that the scheduler itself is also fault resilient. We also confirmed that the overheads caused by distributed processing can be reduced, and the scheduler can be scalable.
更多
查看译文
关键词
fault resilience,exascale computing,scheduler,availability
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要