Toward Resilient Task Parallel PDE Solvers.

user-5d7f3c40530c708f991f6404(2017)

引用 0|浏览2
暂无评分
摘要
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the US Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP … ▪ Future systems are expected to be less reliable … ▪ Anecdotes indicate that the majority of application failures … ▪ Exisitng Checkpoint-Restart approach is not a proportional response … ▪ Checkpoint (5.2 MB/core) has to be done … Traditional C/R or runtime-based offline techniques are … ▪ Software framework to augment existing apps with resilience … ▪ The remaining processes stay alive with isolated process/node failure … ▪ Roll-back, roll-forward, asynchronous, algorithm specific, etc … 1. Process recovery: Recover failures without promoting to job failures … – Implicit coordination: create consistent checkpoints without …
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要