Desh: deep learning for system health prediction of lead times to failure in HPC.

HPDC(2018)

引用 103|浏览354
暂无评分
摘要
Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are likely to experience even higher fault rates due to increased component count and density. Triggering resilience-mitigating techniques remains a challenge due to the absence of well defined failure indicators. System logs consist of unstructured text that obscures essential system health information contained within. In this context, efficient failure prediction via log mining can enable proactive recovery mechanisms to increase reliability. This work aims to predict node failures that occur in supercomputing systems via long short-term memory (LSTM) networks that exploit recurrent neural networks (RNNs). Our framework, Desh1 (Deep Learning for System Health), diagnoses and predicts failures with short lead times. Desh identifies failure indicators with enhanced training and classification for generic applicability to logs from operating systems and software components without the need to modify any of them. Desh uses a novel three-phase deep learning approach to (1) train to recognize chains of log events leading to a failure, (2) re-train chain recognition of events augmented with expected lead times to failure, and (3) predict lead times during testing/inference deployment to predict which specific node fails in how many minutes. Desh obtains as high as 3 minutes average lead time with no less than 85% recall and 83% accuracy to take proactive actions on the failing nodes, which could be used to migrate computation to healthy nodes.
更多
查看译文
关键词
LSTM, Failure Prediction, Log Mining, HPC, Node Failures, Lead Times, Anomaly Detection, Deep Learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要