CARE: Coordinated Augmentation for Elastic Resilience on DRAM Errors in Data Centers

2021 27TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2021)(2021)

Cited 7|Views19
No score
Abstract
As the computation density and memory capacity continues to grow, DRAM errors have become the leading cause of server crashes and/or system failures in modern data centers. While myriads of techniques have been proposed to mitigate their impact on system reliability, these solutions either incur significant overhead on performance, power and memory capacity or require modifying multiple system components; hence, they are impractical to implement or deploy. This paper proposes CARE, a novel error tolerance framework for efficient and elastic resilience on DRAM errors. It introduces a cache-like structure in the memory controller for dynamic error tracking and proactive resilience enhancement to achieve high error tolerance economically and practically. Experiment results show that with around 58KB area overhead in the memory controller, CARE achieves near Chipkill reliability without any memory capacity penalty and incurs negligible performance overhead compared with the baseline SEC-DED systems. CARE provides an attractive alternative to enhance the reliability in data centers.
More
Translated text
Key words
DRAM errors,memory controller,page retirement,data centers
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined