A Systematic Study of DDR4 DRAM Faults in the Field.

HPCA(2023)

引用 3|浏览2
暂无评分
摘要
This paper presents a study of DDR4 DRAM faults in a large fleet of commodity servers, covering several billion memory device-hours of data. The goal of this study is to understand faults in DDR4 DRAM devices to measure the efficacy of existing hardware resilience techniques and aid in designing more resilient systems for future large-scale systems.The study has several key findings about the fault characteristics of DDR4 DRAMs and adds several novel insights about system reliability to the existing literature. Specifically, the data show sixteen unique fault modes in the DDR4 DRAM under study, including several that have not been previously reported. Over 45% of the faults that occurred affected multiple DRAM bits. The time-to-failure characteristics of faults internal to the DRAM die differ from those external to the DRAM die. We also examine faults from multiple DRAM vendors, finding that fault rates vary by more than 1.34x among vendors.Finally, we use the data to compare chipkill ECC and an ECC that covers a DDR5 "bounded fault." Given the fault rates in this data, a bounded fault ECC increases the rate of faults that cause uncorrectable errors by up to 5.71 FIT per DRAM device compared to chipkill ECC.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要