Evaluation of Distributed Recovery in Large-Scale Storage Systems

HPDC(2004)

引用 370|浏览269
暂无评分
摘要
Storage clusters consisting of thousands of disk drives are now being used both for their large capacity and high throughput. However, their reliability is far worse than that of smaller storage systems due to the increased number of storage nodes. RAID technology is no longer sufficient to guarantee the necessary high data reliability for such systems, because disk rebuild time lengthens as disk capacity grows. In this paper, we present FAst Recovery Mechanism (FARM), a distributed recovery approach that exploits excess disk capacity and reduces data recovery time. FARM works in concert with replication and erasure-coding redundancy schemes to dramatically lower the probability of data loss in large-scale storage systems. We have examined essential factors that influence system reliability, performance, and costs, such as failure detections, disk bandwidth usage for recovery, disk space utilization, disk drive replacement, and system scales, by simulating system behavior under disk failures. Our results show the reliability improvement from FARM and demonstrate the impacts of various factors on system reliability. Using our techniques, system designers will be better able to build multi-petabyte storage systems with much higher reliability at lower cost than previously possible.
更多
查看译文
关键词
excess disk capacity,disk drive,large-scale storage systems,influence system reliability,disk capacity,disk space utilization,disk bandwidth usage,higher reliability,disk failure,necessary high data reliability,disk drive replacement,redundancy,distributed processing,application software,throughput,system performance,erasure code,raid,raid technology,computational modeling,high throughput,system design,internet,storage system,bandwidth,reliability engineering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要