Live forensics for HPC systems: a case study on distributed storage systems

The International Conference for High Performance Computing, Networking, Storage, and Analysis(2020)

引用 12|浏览23
暂无评分
摘要
ABSTRACTLarge-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead.
更多
查看译文
关键词
kaleidoscope,real-time failure detection,diagnosis framework,hierarchical domain-guided machine learning models,corresponding failure mode,failure occurrence,live forensics,HPC systems,distributed storage systems,large-scale high-performance computing systems,failure modes,reliability failures,overload-related failures,congestion collapse,diagnosing failures,blue waters supercomputer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要