Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems

EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems(2023)

引用 0|浏览22
暂无评分
摘要
Modern cloud systems are orchestrations of independent and interacting (sub-)systems, each specializing in important services (e.g., data processing, storage, resource management, etc.). Hence, cloud system reliability is affected not only by the reliability of each individual system, but also by the interplay between these systems. We observe that many recent production incidents of cloud systems are manifested through interactions across the system boundaries. However, there is a lack of systematic understanding of this emerging mode of failures, which we term as cross-system interaction failures (or CSI failures). This hinders the development of better design, integration practices, and new tooling. In this paper, we discuss cross-system interaction failures based on analyses of (1) 11 CSI-failure-induced cloud incidents of Google, Azure, andAWS, and (2) 120 CSI failure cases of seven widely co-deployed open-source systems. We focus on understanding discrepancies between interacting systems as the root causes of CSI failures-CSI failures cannot be understood by analyzing one single system in isolation. This paper draws attention to this emerging failure mode, provides a comprehensive understanding of CSI failure patterns, and discusses potential approaches for mitigation. We advocate for cross-system testing and verification and demonstrate its potential by cross-testing the Spark-Hive data plane and exposing 15 new discrepancies.
更多
查看译文
关键词
Cross-system interaction,failure study,root cause analysis,cloud system
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要