A model of checkpoint behavior for applications that have I/O

The Journal of Supercomputing(2022)

引用 1|浏览8
暂无评分
摘要
Due to the increase and complexity of computer systems, reducing the overhead of fault tolerance techniques has become important in recent years. One technique in fault tolerance is checkpointing, which saves a snapshot with the information that has been computed up to a specific moment, suspending the execution of the application, consuming I/O resources and network bandwidth. Characterizing the files that are generated when performing the checkpoint of a parallel application is useful to determine the resources consumed and their impact on the I/O system. It is also important to characterize the application that performs checkpoints, and one of these characteristics is whether the application does I/O. In this paper, we present a model of checkpoint behavior for parallel applications that performs I/O; this depends on the application and on other factors such as the number of processes, the mapping of processes and the type of I/O used. These characteristics will also influence scalability, the resources consumed and their impact on the IO system. Our model describes the behavior of the checkpoint size based on the characteristics of the system and the type (or model) of I/O used, such as the number I/O aggregator processes, the buffering size utilized by the two-phase I/O optimization technique and components of collective file I/O operations. The BT benchmark and FLASH I/O are analyzed under different configurations of aggregator processes and buffer size to explain our approach. The model can be useful when selecting what type of checkpoint configuration is more appropriate according to the applications’ characteristics and resources available. Thus, the user will be able to know how much storage space the checkpoint consumes and how much the application consumes, in order to establish policies that help improve the distribution of resources.
更多
查看译文
关键词
Checkpoint,I/O applications,Storage,HPC,Fault tolerance
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要