FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation

2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE)(2019)

引用 23|浏览35
暂无评分
摘要
The failures of software service directly affect user experiences and service revenue. Thus operators monitor both service-level KPIs (e.g., response time) and machine-level KPIs (e.g., CPU usage) on each machine underlying the service. When a service fails, the operators must localize the root cause machines, and mitigate the failure as quickly as possible. Existing approaches have limited application due to the difficulty to obtain the required additional measurement data. As a result, failure localization is largely manual and very time-consuming. This paper presents FluxRank, a widely-deployable framework that can automatically and accurately localize the root cause machines, so that some actions can be triggered to mitigate the service failure. Our evaluation using historical cases from five real services (with tens of thousands of machines) of a top search company shows that the root cause machines are ranked top 1 (top 3) for 55 (66) cases out of 70 cases. Comparing to existing approaches, FluxRank cuts the localization time by more than 80% on average. FluxRank has been deployed online at one Internet service and six banking services for three months, and correctly localized the root cause machines as the top 1 for 55 cases out of 59 cases.
更多
查看译文
关键词
KPI,failure mitigation,recommendation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要