Root Cause Analysis In Microservice Using Neural Granger Causal Discovery
CoRR(2024)
摘要
In recent years, microservices have gained widespread adoption in IT
operations due to their scalability, maintenance, and flexibility. However, it
becomes challenging for site reliability engineers (SREs) to pinpoint the root
cause due to the complex relationships in microservices when facing system
malfunctions. Previous research employed structured learning methods (e.g.,
PC-algorithm) to establish causal relationships and derive root causes from
causal graphs. Nevertheless, they ignored the temporal order of time series
data and failed to leverage the rich information inherent in the temporal
relationships. For instance, in cases where there is a sudden spike in CPU
utilization, it can lead to an increase in latency for other microservices.
However, in this scenario, the anomaly in CPU utilization occurs before the
latency increase, rather than simultaneously. As a result, the PC-algorithm
fails to capture such characteristics. To address these challenges, we propose
RUN, a novel approach for root cause analysis using neural Granger causal
discovery with contrastive learning. RUN enhances the backbone encoder by
integrating contextual information from time series, and leverages a time
series forecasting model to conduct neural Granger causal discovery. In
addition, RUN incorporates Pagerank with a personalization vector to
efficiently recommend the top-k root causes. Extensive experiments conducted on
the synthetic and real-world microservice-based datasets demonstrate that RUN
noticeably outperforms the state-of-the-art root cause analysis methods.
Moreover, we provide an analysis scenario for the sock-shop case to showcase
the practicality and efficacy of RUN in microservice-based applications. Our
code is publicly available at https://github.com/zmlin1998/RUN.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要