CRISP: Critical Path Analysis of Large-Scale Microservice Architectures

Zhizhou Zhang,Murali Krishna Ramanathan, Prithvi Raj, Abhishek Parwal,Timothy Sherwood,Milind Chabbi

USENIX Annual Technical Conference (USENIX ATC)（2022）

引用 30|浏览38

暂无评分

摘要

Microservice architectures have become the lifeblood of modern service-oriented software systems. Remote Procedure Calls (RPCs) among microservices are deeply nested, asynchronous, and large in number, thus making it very hard to identify the underlying service(s) that contribute to the overall end-to-end latency experienced by a top-level request. State-of-the-art RPC tracing tools collect a deluge of data but provide little insight. We need sophisticated tools to bubble-up signals from a myriad of RPC traces to assist developers in identifying optimization opportunities, pinpointing common bottlenecks, setting appropriate time outs, diagnosing error conditions, and planning and managing compute capacity, to name a few. In this paper, we present CRISP - a tool to performcritical path analysis (CPA) over a large number of traces collected from RPCs in microservices environments. CRISP provides three critical performance analysis capabilities: a) a top-down CPAof any specific endpoint, which is tailored for service owners to drill down the root causes of latency issues, b) a bottom-up CPA over all endpoints in the system - tailored for infrastructure and performance engineers - to bubble up those (interior) APIs that have a high impact across many endpoints, and c) an on-the-fly anomaly detection to alert potential problems. We have applied CRISP's capabilities on Uber's entire backend system composed of similar to 40K endpoints that cater to real-time requests from more than 100 million active daily users worldwide. Using the critical path as the basis of performance analysis has a) helped us identify five performance issues and optimization opportunities across two business-critical microservices, b) guided us in our future hardware choice that reduces end-to-end latencies, and c) reduced the false positives in anomaly detection by up to 50% while speeding up the training and inference by up to 28x and up to 67x, respectively, over the state of the art.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要