Boosting Correlated Failure Repair in SSD Data Centers

Junmei Chen,Zongpeng Li, Qifu Tyler Sun,Ne Wang,Lina Su

IEEE Internet of Things Journal(2023)

引用 0|浏览1
暂无评分
摘要
Current data centers rely on failure protection mechanisms to ensure data reliability. However, recent research indicates that failures within the same node or rack are common in data centers that use flash-based solid-state drives (SSDs) as the primary storage medium. Such correlated failures bring challenges for traditional protection mechanisms to achieve high reliability and repair performance. To this end, we propose a product erasure code (PECode) that encodes data blocks in multiple stripes cooperatively to generate intra-stripe and inter-stripe parity blocks. Then, we design a multi-stripe cooperative repair algorithm (MSCRepair). MSCRepair first creates the failure distribution matrix (FDM) to represent the distribution of failure blocks in nodes and racks, and then conducts FDM-guided repair to minimize cross-rack traffic upon correlated failures. We prove that MSCRepair achieves the least cross-rack repair traffic at the cost of a longer repair time. We further propose a correlated failure repair scheduling algorithm for MSCRepair, which reduces the repair time by balancing the load and delivering data from links with higher bandwidths. We evaluate MSCRepair through both large-scale simulations and real experiments. In the mise-en-scene of its state-of-the-art alternatives, MSCRepair stands out by reducing up to 19.6% ~ 49.9% of cross-rack traffic, while simultaneously reducing 16.2% ~ 51.4% of recovery time of correlated failures.
更多
查看译文
关键词
Erasure coding,correlated failures,heterogeneous network,SSD-based,repair link
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要