Engineering a Failure Detection Service for Widely Distributed Systems

msra(2005)

引用 23|浏览1
暂无评分
摘要
Unreliable failure detectors are recognized as important building blocks for implementing fault-tolerant distributed systems. Further, there has been a lot of discussion on how to provide them with sophisticated features that allow for adaptation, flexible use, scalability and quality of service enforcement. Despite that, we are not aware of any real distributed system that uses a sophisticated failure detection service. In fact, most systems deployed use the trivial failure detection scheme provided by the underlying communication technologies (e.g. TCP/IP timeouts). We believe that this state of affairs is due to two main reasons: i) there is no widely supported failure detection service API that incorporates these advanced features in a suitable way; and ii) the benefits of using a sophisticated failure detection service are not clearly understood. This paper targets the first issue by proposing a failure detection service that addresses the main necessities of widely distributed systems and implements the state-of-the-art in failure detection mechanisms. Moreover, to improve the usability of the service we took special care in the design of its programming interface.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要