Numarck: Machine Learning Algorithm For Resiliency And Checkpointing

SC '14: International Conference for High Performance Computing, Networking, Storage and Analysis New Orleans Louisana November, 2014(2014)

引用 75|浏览105
暂无评分
摘要
Data checkpointing is an important fault tolerance technique in High Performance Computing (HPC) systems. As the HPC systems move towards exascale, the storage space and time costs of checkpointing threaten to overwhelm not only the simulation but also the post-simulation data analysis. One common practice to address this problem is to apply compression algorithms to reduce the data size. However, traditional lossless compression techniques that look for repeated patterns are ineffective for scientific data in which high-precision data is used and hence common patterns are rare to find. This paper exploits the fact that in many scientific applications, the relative changes in data values from one simulation iteration to the next are not very significantly different from each other. Thus, capturing the distribution of relative changes in data instead of storing the data itself allows us to incorporate the temporal dimension of the data and learn the evolving distribution of the changes. We show that an order of magnitude data reduction becomes achievable within guaranteed user-defined error bounds for each data point.We propose NUMARCK, Northwestern University Machine learning Algorithm for Resiliency and ChecKpointing, that makes use of the emerging distributions of data changes between consecutive simulation iterations and encodes them into an indexing space that can be concisely represented. We evaluate NUMARCK using two production scientific simulations, FLASH and CMIP5, and demonstrate a superior performance in terms of compression ratio and compression accuracy. More importantly, our algorithm allows users to specify the maximum tolerable error on a per point basis, while compressing the data by an order of magnitude.
更多
查看译文
关键词
checkpointing,data analysis,iterative methods,learning (artificial intelligence),parallel processing,software fault tolerance,HPC system,NUMARCK,Northwestern University machine learning algorithm for resiliency and check pointing,data analysis,fault tolerance technique,high performance computing,simulation iteration,
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要