Predicting Uncorrectable Memory Errors from the Correctable Error History: No Free Predictors in the Field.

International Symposium on Memory Systems (MEMSYS)(2021)

引用 4|浏览2
暂无评分
摘要
Being the major causes of hardware failures in datacenters, uncorrectable memory errors result in server crashes. In this paper, we address the problem of predicting uncorrectable errors (UEs) using the historical correctable error (CE) information. We first establish a new UE prediction framework of inferring latent memory faulty status from CE observations and correlating the inferred faulty status with the UE occurrences for prediction. We then design several predictors based on different memory fault modes and examine their performance on 4 datasets of memory errors from contemporary servers in datacenters of 3 top-tier technology companies. While in existing work, UE prediction is studied in a single environment only, this is the first comparative study on the prediction performance across datasets from different environments. Through the cross-dataset study, we demonstrate that predictors performing relatively well in some environments do not perform well in some other environments. The prediction performance are highly impacted by different characteristics in different environments and no free predictors are universally applicable. Finally, in order to capture the characteristics specific to each environment in UE prediction, we propose a properly designed learning process to induce a weighted ensemble of the predictors from the data and show that the ensemble predictor learned consistently outperforms the individual predictors within each dataset.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要