PreFix: Switch Failure Prediction in Datacenter Networks.

POMACS(2018)

引用 111|浏览271
暂无评分
摘要
In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifically, in our proposed system, named PreFix, we aim to determine during runtime whether a switch failure will happen in the near future. The prediction is based on the measurements of the current switch system status and historical switch hardware failure cases that have been carefully labelled by network operators. Our key observation is that failures of the same switch model share some common syslog patterns before failures occur, and we can apply machine learning methods to extract the common patterns for predicting switch failures. Our novel set of features (message template sequence, frequency, seasonality and surge) for machine learning can efficiently deal with the challenges of noises, sample imbalance, and computation overhead. We evaluated PreFix on a data set collected from 9397 switches (3 different switch models) deployed in more than 20 datacenters owned by a top global search engine in a 2-year period. PreFix achieved an average of 61.81% recall and 1.84x10 -5 false positive ratio, outperforming the other failure prediction methods for computers and ISP devices.
更多
查看译文
关键词
datacenter, failure prediction, machine learning, operations
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要