Fail-Slow at Scale

Haryadi S. Gunawi,Riza O. Suminto,Russell Sears, Casey Golliher,Swaminathan Sundararaman,Xing Lin,Tim Emami, Weiguang Sheng, Nematollah Bidokhti,Caitie McCaffrey,Deepthi Srinivasan, Biswaranjan Panda, Andrew Baptist,Gary Grider, Parks M. Fields,Kevin Harms,Robert B. Ross, Andree Jacobson,Robert Ricci,Kirk Webb,Peter Alvaro, H. Birali Runesha,Mingzhe Hao,Huaicheng Li

ACM Transactions on Storage(2018)

引用 60|浏览18
暂无评分
摘要
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要