Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing

Parallel Computing(2015)

引用 24|浏览51
暂无评分
摘要
Fault-tolerant and robust multigrid methods.Hierarchical finite element compression.Asynchronous checkpointing with local restart. We analyse novel fault tolerance schemes for data loss in multigrid solvers, which essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To improve efficiency compared to conventional global checkpointing, we exploit the inherent data compression of the multigrid hierarchy, and relax the synchronicity requirement through a local failure local recovery approach. We experimentally identify the root cause of convergence degradation in the presence of data loss using smoothness considerations. Our resulting schemes form a family of techniques that can be tailored to the expected error probability of (future) large-scale machines. A performance model gives further insight into the benefits and applicability of our techniques.
更多
查看译文
关键词
Fault tolerance,Resilience,Multigrid,Checkpoint-restart,Robust iterative solvers,High-performance computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要