Silent Data Errors: Sources, Detection, and Modeling

2023 IEEE 41st VLSI Test Symposium (VTS)(2023)

引用 2|浏览22
暂无评分
摘要
Chip manufacturers and hyperscalers are becoming increasingly aware of the problem posed by Silent Data Errors (SDE) and are taking steps to address it. Major computing facilities operators like Meta and Google have emphasized the critical role of SDEs in today’s microprocessors. Numerous studies in the literature have highlighted the severity of this issue, especially in datacenter applications operating at large scales. These errors can lead to data loss and require a significant amount of time and effort to resolve through debugging engineering efforts, which can take months to complete. In this paper, we provide an overview of the issue of SDEs, including an explanation of the problem and the current methods used to address it, as well as gaps that still exist in addressing the issue. We also discuss the different sources of SDEs, including post-manufacturing testing failures, voltage and timing marginalities, and hard-to-detect faults. The paper emphasizes the impact of timing marginalities as a significant source of SDEs. Finally, our spotlight points to the architecture and system dimensions of the problem: we describe the challenges of measuring the true (still unknown) rates of SDE from CPUs, and emphasize on the role of detailed microarchitectural simulation models for this purpose. We present data on the severity of SDEs and their predicted rates under various operating conditions, sources of faults, and technology fabrication nodes.
更多
查看译文
关键词
Silent data corruptions,microprocessors,timing marginalities,voltage failures,microarchitectural simulation,microarchitectural modeling,fault injection,failure rates
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要