ShapleyIQ: Influence Quantification by Shapley Values for Performance Debugging of Microservices

PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, ASPLOS 2023, VOL 4(2023)

引用 0|浏览4
暂无评分
摘要
Years of experience in operating large-scale microservice systems strengthens our belief that their individual components, with inevitable anomalies, still demand a quantification of the influences on the end-to-end performance indicators. On a causal graph that represents the complex dependencies between the system components, the scatteredly detected anomalies, even when they look similar, could have different implications with contrastive remedial actions. To this end, we design ShapleyIQ, an online monitoring and diagnosis service that can effectively improve the system stability. It is guided by rigorous analysis on Shapley values for a causal graph. Notably, a new property on splitting invariance addresses the challenging exponential computation complexity problem of generic Shapley values by decomposition. This service has been deployed on a core infrastructure system on Alibaba Cloud, for over a year with more than 15,000 operations for 86 services across 2,546 machines. Since then, it has drastically improved the DevOps efficiency, and the system failures have been significantly reduced by 83.3%. We also conduct an offline evaluation on an open source microservice system TrainTicket, which is to pinpoint the root causes of the performance issues in hindsight. Extensive experiments and test cases show that our system can achieve 97.3% accuracy in identifying the top-1 root causes for these datasets, which significantly outperforms baseline algorithms by at least 28.7% in absolute difference.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要