Fighting The Fog Of War: Automated Incident Detection For Cloud Systems

PROCEEDINGS OF THE 2021 USENIX ANNUAL TECHNICAL CONFERENCE(2021)

引用 21|浏览58
暂无评分
摘要
Incidents and outages dramatically degrade the availability of large-scale cloud computing systems such as AWS, Azure, and GCP. In current incident response practice, each team has only a partial view of the entire system, which makes the detection of incidents like fighting in the "fog of war". As a result, prolonged mitigation time and more financial loss are incurred. In this work, we propose an automatic incident detection system, namely Warden, as a part of the Incident Management (IcM) platform. Warden collects alerts from different services and detects the occurrence of incidents from a global perspective. For each detected potential incident, Warden notifies relevant on-call engineers so that they could properly prioritize their tasks and initiate cross-team collaboration. We implemented and deployed Warden in the IcM platform of Azure. Our evaluation results based on data collected in an 18-month period from 26 major services show that Warden is effective and outperforms the baseline methods. For the majority of successfully detected incidents (similar to 68%), Warden is faster than human, and this is particularly the case for the incidents that take long time to detect manually.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要