Granularity and the cost of error recovery in resilient AMR scientific applications.

SC(2016)

引用 11|浏览86
暂无评分
摘要
Supercomputing platforms are expected to have larger failure rates in the future because of scaling and power concerns. The memory and performance impact may vary with error types and failure modes. Therefore, localized recovery schemes will be important for scientific computations, including failure modes where application intervention is suitable for recovery. We present a resiliency methodology for applications using structured adaptive mesh refinement, where failure modes map to granularities within the application for detection and correction. This approach also enables parameterization of cost for differentiated recovery. The cost model is built with tuning parameters that can be used to customize the strategy for different failure rates in different computing environments. We also show that this approach can make recovery cost proportional to the failure rate.
更多
查看译文
关键词
Scientific computing,High performance computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要