Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach
arxiv(2024)
摘要
Due to the scale and complexity of cloud systems, a system failure would
trigger an "alert storm", i.e., massive correlated alerts. Although these
alerts can be traced back to a few root causes, the overwhelming number makes
it infeasible for manual handling. Alert aggregation is thus critical to help
engineers concentrate on the root cause and facilitate failure resolution.
Existing methods typically utilize semantic similarity-based methods or
statistical methods to aggregate alerts. However, semantic similarity-based
methods overlook the causal rationale of alerts, while statistical methods can
hardly handle infrequent alerts.
To tackle these limitations, we introduce leveraging external knowledge,
i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose
COLA, a novel hybrid approach based on correlation mining and LLM (Large
Language Model) reasoning for online alert aggregation. The correlation mining
module effectively captures the temporal and spatial relations between alerts,
measuring their correlations in an efficient manner. Subsequently, only
uncertain pairs with low confidence are forwarded to the LLM reasoning module
for detailed analysis. This hybrid design harnesses both statistical evidence
for frequent alerts and the reasoning capabilities of computationally intensive
LLMs, ensuring the overall efficiency of COLA in handling large volumes of
alerts in practical scenarios. We evaluate COLA on three datasets collected
from the production environment of a large-scale cloud platform. The
experimental results show COLA achieves F1-scores from 0.901 to 0.930,
outperforming state-of-the-art methods and achieving comparable efficiency. We
also share our experience in deploying COLA in our real-world cloud system,
Cloud X.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要