Automatic Algorithm-Based Fault Tolerance (AABFT) of Stencil Computations

2023 32ND INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT(2023)

引用 0|浏览0
暂无评分
摘要
In this work, we study fault tolerance of transient errors, such as those occurring due to cosmic radiation or hardware component aging and degradation, using Algorithm-Based Fault Tolerance (ABFT). ABFT methods typically work by adding some additional computation in the form of invariant checksums which, by definition, should not change as the program executes. By computing and monitoring checksums, it is possible to detect errors by observing differences in the checksum values. However, this is challenging for two key reasons: (1) it requires careful manual analysis of the input program, and (2) care must be taken to subsequently carry out the checksum computations efficiently enough for it to be worth it. Prior work has shown how to apply ABFT schemes with low overhead for a variety of input programs. Here, we focus on a subclass of programs called stencil applications, an important class of computations found widely in various scientific computing domains. We propose a new compilation scheme to automatically analyze and generate the checksum computations. To the best of our knowledge, this is the first work to do such a thing in a compiler. We show that low overhead code can be easily generated and provide a preliminary evaluation of the tradeoff between performance and effectiveness.
更多
查看译文
关键词
fault-tolerance,program transformations,polyhedral compilation,stencils
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要