Fingerpointing : Just-inTime Problem Diagnosis for Distributed Systems


引用 0|浏览0
Distributed systems are growing both in terms of size and complexity. As a result, when a component fails, it can be difficult to determine which part of the system failed, let alone the cause of the failure. As the cost of downtime in production systems increases, it becomes economically important for problems to be discovered and repaired expediently. Recent research has explored the creation of tools that attempt to automatically determine which component in a system failed. The resulting tools and algorithms are able to successfully implicate a faulty component in many situations. However, at this time, these tools see little, if any, use monitoring production systems. A likely reason why systems administrators have not embraced tools for problem diagnosis is that existing tools cannot provide timely notification of problems. These tools collect data from the observed system continually, but only perform analysis a posteriori, after it was determined that a failure occurred by other means. Offline analysis, while useful for the evaluation of fingerpointing algorithms, is unlikely to excite operators in industry’s data centers. This thesis describes and evaluates FPT, a framework for online fingerpointing. FPT emphasizes flexibility while taking efforts to keep processing overheads low. The primary goal of this work is to explore the feasibility of just-in-time problem diagnosis. Additionally, it is hoped that FPT’s flexibility will aid future research in problem diagnosis.
AI 理解论文
Chat Paper