Failure Recovery: When the Cure Is Worse Than the Disease.

HotOS'13: Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems(2013)

引用 21|浏览129
暂无评分
摘要
Cloud services inevitably fail: machines lose power, networks become disconnected, pesky software bugs cause sporadic crashes, and so on. Unfortunately, failure recovery itself is often faulty; e.g. recovery can accidentally recursively replicate small failures to other machines until the entire cloud service fails in a catastrophic outage, amplifying a small cold into a contagious deadly plague! We propose that failure recovery should be engineered foremost according to the maxim of primum non nocere, that it "does no harm." Accordingly, we must consider the system holistically when failure occurs and recover only when observed activity safely allows for it.
更多
查看译文
关键词
failure recovery,small failure,cloud service,entire cloud service,small cold,catastrophic outage,observed activity,pesky software bug,primum non nocere,sporadic crash
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要