Strategies For Fault Tolerance In Multicomponent Applications

PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS)(2011)

引用 13|浏览48
暂无评分
摘要
This paper discusses on-going work with the Integrated Plasma Simulator (IPS), a framework for coupled multiphysics simulations of plasmas, to allow simulations to run through the loss of nodes on which the simulation is executing.While many different techniques are available to improve the fault tolerance of computational science applications on high-performance computer systems, checkpoint/restart (C/R) remains virtually the only one that see widespread use in practice. Our focus here is to augment the traditional C/R approach with additional techniques that can provide a more localized and tailored response to faults based on the ability to restart failed tasks on an individual basis, and the use of information external to the application itself in order to guide decision-making, in many cases avoiding the need to stop and restart the entire simulation. This capability involves several features within the IPS framework, and leverages the Fault Tolerance Backplane, a publish/subscribe event service to disseminate fault-related information throughout HPC systems, to obtain information from the Reliability, Availability and Serviceability (RAS) subsystem of the HPC system.This work is described in the context of Cray XT-series computer systems for concreteness, but is applicable to other environments as well. As part of the analysis of this work, we discuss the requirements to generalize this approach to other complex simulation applications beyond the Integrated Plasma Simulator.
更多
查看译文
关键词
application fault tolerance, computational science, multiphysics framework
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要