Multi-task Hierarchical Classification for Disk Failure Prediction in Online Service Systems

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(2022)

引用 9|浏览106
暂无评分
摘要
One of the most common threats to online service system's reliability is disk failure. Many disk failure prediction techniques have been developed to predict failures before they actually occur, allowing proactive steps to be taken to minimize service disruption and increase service reliability. Existing approaches for disk failure prediction do not differentiate among various types of disk failure. In industrial practice, however, different product teams treat distinct types of disk failures as different prediction tasks in large-scale online service systems like Microsoft 365. For example, hardware operation team is concerned with physical disk errors, while database service team focuses on I/O delay. In this paper, we propose MTHC (Multi-Task Hierarchical Classification) to enhance the performance of disk failure prediction for each task via multi-task learning. In addition, MTHC introduces a novel hierarchy-aware mechanism to deal with the data imbalance problem, which is a severe issue in the area of disk failure prediction. We show that MTHC can be easily utilized to enhance most state-of-the-art disk failure prediction models. Our experiments on both industrial and public datasets demonstrate that such disk failure prediction models enhanced by MTHC performs much better than those models working without MTHC. Furthermore, our experiments also present that the hierarchical-aware mechanism underlying MTHC can alleviate the data imbalance problem and thus improve the practical performance of various disk failure prediction models. More encouragingly, the proposed MTHC has been successfully applied to Microsoft 365 online service systems, and averagely reduces the number of virtual machine interruptions by 10% per month.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要