Reducing False Node Failure Predictions in HPC

2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC)(2019)

引用 12|浏览16
暂无评分
摘要
Future HPC applications must be able to scale to thousands of compute nodes, while running for several days. The increased runtime and node count inconveniently raises the probability of hardware failures that may interrupt computations. Scientists must therefore protect their simulations against hardware failures. This is typically done using frequent checkpoint& restart, which may have significant overheads. Consequently, the frequency in which checkpoints are taken should be minimized. Predicting hardware failures ahead of time is a promising approach to address this problem, but has remaining issues like false alarms at large scales. In this paper, we introduce the probability of unnecessarily triggering checkpoints (UC) to evaluate the quality of node level failure predictors for checkpointing large-scale applications. This metric is used to show how current predictors suffer from too many false alarms at large node counts. Further, we propose a new failure predictor that chains several machine learning classifiers to make predictions with minimal false alarms. We aim for extremely low false positive rates to guarantee that no unnecessary checkpoints will be performed even for very large node counts. Our experiments based on real system traces from a large production cluster show that our predictor achieves a lead-up time of four minutes, a recall of 0.7302, a false positive rate of 0.0004, a precision of 0.9944 and a probability of unnecessary checkpoints (UC) of 0.00011 for 1024 nodes.
更多
查看译文
关键词
failure prediction,false positives,resilience,fault tolerance
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要