Network Aware Reliability Analysis for Distributed Storage Systems.

Symposium on Reliable Distributed Systems Proceedings(2016)

引用 7|浏览65
暂无评分
摘要
It is hard to measure the reliability of a large distributed storage system, since it is influenced by low probability failure events that occur over time. Nevertheless, it is critical to be able to predict reliability in order to plan, deploy and operate the system. Existing approaches suffer from unrealistic assumptions regarding network bandwidth. This paper introduces a new framework that combines simulation and an analytic model to estimate durability for large distributed cloud storage systems. Our approach is the first that takes into account network bandwidth with a focus on the cumulative effect of simultaneous failures on repair time. Using our framework we evaluate the trade-offs between durability, network and storage costs for the OpenStack Swift object store, comparing various system configurations and resiliency schemes, including replication and erasure coding. In particular, we show that when accounting for the cumulative effect of simultaneous failures, the probability of data loss estimates can vary by two to four orders of magnitude.
更多
查看译文
关键词
Erasure codes,Repair bandwidth,Estimation,Analytic model,Simulation,Durability
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要