Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach

2022 IEEE International Conference on Cluster Computing (CLUSTER)(2022)

引用 0|浏览23
暂无评分
摘要
Integrating recent advancements in resilient algorithms and techniques into existing codes is a singular challenge in fault tolerance - in part due to the underlying complexity of implementing resilience in the first place, but also due to the difficulty introduced when integrating the functionality of a standalone new strategy with the preexisting resilience layers of an application. We propose that the answer is not to build integrated solutions for users, but runtimes designed to integrate into a larger comprehensive resilience system and thereby enable the necessary jump to multi-layered recovery. Our work designs, implements, and verifies one such comprehensive system of runtimes. Utilizing Fenix, a process resilience tool with integration into preexisting resilience systems as a design priority, we update Kokkos Resilience and the use pattern of VeloC to support application-level integration of resilience runtimes. Our work shows that designing integrable systems rather than integrated systems allows for user-designed optimization and upgrading of resilience techniques while maintaining the simplicity and performance of all-in-one resilience solutions. More application-specific choice in resilience strategies allows for better long-term flexibility, performance, and - importantly - simplicity.
更多
查看译文
关键词
Fault Tolerance,Resilience,Checkpointing,MPI-ULFM,Kokkos,Fenix,HPC
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要