Scouts: Improving the Diagnosis Process Through Domain-customized Incident Routing

SIGCOMM '20: Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication Virtual Event USA August, 2020(2020)

引用 20|浏览239
暂无评分
摘要
Incident routing is critical for maintaining service level objectives in the cloud: the time-to-diagnosis can increase by 10x due to mis-routings. Properly routing incidents is challenging because of the complexity of today's data center (DC) applications and their dependencies. For instance, an application running on a VM might rely on a functioning host-server, remote-storage service, and virtual and physical network components. It is hard for any one team, rule-based system, or even machine learning solution to fully learn the complexity and solve the incident routing problem. We propose a different approach using per-team Scouts. Each teams' Scout acts as its gate-keeper --- it routes relevant incidents to the team and routes-away unrelated ones. We solve the problem through a collection of these Scouts. Our PhyNet Scout alone --- currently deployed in production --- reduces the time-to-mitigation of 65% of mis-routed incidents in our dataset.
更多
查看译文
关键词
Data center networks, Machine learning, Diagnosis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要