CoLoR: Co-Located Rescuers for Fault Tolerance in HPC Systems

2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS)(2018)

引用 6|浏览41
暂无评分
摘要
With the increase in scale of HPC systems, the frequency of system wide failures is expected to increase. The performance of Coordinated Checkpoint/Restart (C/R), the traditional fault tolerance technique, degrades under high failure rates because of frequent global rollbacks, which themselves are susceptible to failures. We propose CoLoR, a fault tolerance scheme that i)requires only the failing process to recover, ii)overlaps reexecution with restart, and iii)avoids the cumulative effect of successive failures. Our theoretical analysis reveals that such a scheme results in lower expected completion time than coordinated C/R. We also provide a proof-of-concept implementation in MPI using receiver based message logging and colocated rescuer (CoLoR)processes, and evaluate its performance on several HPC benchmarks. Our experimental results, combined with observations from the theoretical analysis, show that CoLoR can outperform both traditional C/R and replication over a large range of system sizes, without using extra logger nodes.
更多
查看译文
关键词
Color,Checkpointing,Fault tolerance,Fault tolerant systems,Receivers,Benchmark testing,Analytical models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要