Exploration of Lossy Compression for Application-Level Checkpoint/Restart

2015 IEEE International Parallel and Distributed Processing Symposium(2015)

引用 99|浏览62
暂无评分
摘要
The scale of high performance computing (HPC) systems is exponentially growing, potentially causing prohibitive shrinkage of mean time between failures (MTBF) while the overall increase in the I/O performance of parallel file systems will be far behind the increase in scale. As such, there have been various attempts to decrease the checkpoint overhead, one of which is to employ compression techniques to the checkpoint files. While most of the existing techniques focus on lossless compression, their compression rates and thus effectiveness remain rather limited. Instead, we propose a loss compression technique based on wavelet transformation for checkpoints, and explore its impact to application results. Experimental application of our loss compression technique to a production climate application, NICAM, shows that the overall checkpoint time including compression is reduced by 81%, while relative error remains fairly constant at approximately 1.2% on overall average of all variables of compressed physical quantities compared to original checkpoint without compression.
更多
查看译文
关键词
fault tolerance,checkpoint,lossy compression
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要