ServerRCA: Root Cause Analysis for Server Failure using Operating System Logs

2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)(2023)

引用 0|浏览5
暂无评分
摘要
The development of the information technology industry has made servers an essential infrastructure for enterprises. Server failure may result in significant economic losses. Therefore, it is essential to conduct root cause analysis (RCA) on server failure to improve server reliability. However, existing RCA approaches suffer from limitations in analysis granularity, adaptation difficulties, and data acquisition constraints. To overcome the limitations, we propose ServerRCA, an automated solution that utilizes operating system (OS) logs for accurate and efficient root cause analysis of server failures. OS logs provide detailed information and are easily accessible. Firstly, ServerRCA employs log parsing to transform raw logs into log templates. Next, we propose a hierarchical matching approach that leverages the hierarchical structure of fault logs to accurately identify fault events. Furthermore, we also introduce a human-in-the-loop feedback mechanism to enhance the ability of ServerRCA. Finally, ServerRCA constructs the fault propagation chain using the fault events identified earlier. Extensive experiments on real server failures demonstrate the effectiveness of ServerRCA, achieving significant improvements in F1-score, HR@1, and HR@3 over comparative methods. Our work contributes to the automated RCA of server failures using OS logs and provides a novel framework for accurate fault event identification in server failure analysis.
更多
查看译文
关键词
Server Failure,Root Cause Analysis,Operating System Logs,Log Analysis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要