Exploit Approximation to Support Fault Resiliency in MPI-based Applications

DSN-W(2023)

引用 0|浏览17
暂无评分
摘要
Approximate applications feature scalability and intrinsic fault resilience, making them perfect for execution in the HPC scenario. The latter, in particular, is becoming more and more relevant due to the increasing size of HPC clusters, implying a higher fault frequency. To apply fault resilience properties, however, the communication middleware must be able to handle fault presence and limit their impact on the execution. This requirement is not valid in many cases, with MPI being one of the most remarkable cases of fault support lack. In this work, we leverage the Legio framework to enable fault resilience properties in applications without changes in their code. We focus our analysis on the accuracy losses coming from fault management, and we propose a set of solutions to circumvent them. The experimental campaign shows that it is possible to obtain some results with a transparent integration, but the maximum accuracy is reachable by making the application fault-aware.
更多
查看译文
关键词
Approximate Computing,HPC,MPI,Fault Tolerance,ULFM
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要