Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4
CoRR(2024)
摘要
Root Cause Analysis (RCA) plays a pivotal role in the incident diagnosis
process for cloud services, requiring on-call engineers to identify the primary
issues and implement corrective actions to prevent future recurrences.
Improving the incident RCA process is vital for minimizing service downtime,
customer impact and manual toil. Recent advances in artificial intelligence
have introduced state-of-the-art Large Language Models (LLMs) like GPT-4, which
have proven effective in tackling various AIOps problems, ranging from code
authoring to incident management. Nonetheless, the GPT-4 model's immense size
presents challenges when trying to fine-tune it on user data because of the
significant GPU resource demand and the necessity for continuous model
fine-tuning with the emergence of new data. To address the high cost of
fine-tuning LLM, we propose an in-context learning approach for automated root
causing, which eliminates the need for fine-tuning. We conduct extensive study
over 100,000 production incidents, comparing several large language models
using multiple metrics. The results reveal that our in-context learning
approach outperforms the previous fine-tuned large language models such as
GPT-3 by an average of 24.8% across all metrics, with an impressive 49.7%
improvement over the zero-shot model. Moreover, human evaluation involving
actual incident owners demonstrates its superiority over the fine-tuned model,
achieving a 43.5% improvement in correctness and an 8.7% enhancement in
readability. The impressive results demonstrate the viability of utilizing a
vanilla GPT model for the RCA task, thereby avoiding the high computational and
maintenance costs associated with a fine-tuned model.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要