Checkpoint/Restart for CUDA Kernels.

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis(2023)

引用 0|浏览6
暂无评分
摘要
In HPC clusters, it has become common to employ Checkpoint/Restart (C/R), that is, saving the execution state of applications in order to restore their computational progress at a later point in time. The benefits of this technique for clusters include more flexibility when reacting to changing workloads and an increased fault tolerance. While many clusters already benefit from C/R tools for traditional CPU applications, there is a lack of comparable tools enabling preemptive and transparent C/R for heterogeneous computing, where applications execute partly on accelerator devices, such as s. This is despite the increasing use of s as accelerators in High-Performance Computing (HPC) clusters. Therefore, we propose a novel C/R tool that enables saving the execution state of CUDA kernels, thus allowing preemptive C/R of applications without the need to wait for kernels to finish. We show that full-featured C/R for NVIDIA s is possible despite the proprietary nature of the hardware and software of these devices.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要