How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle

2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)(2023)

引用 0|浏览20
暂无评分
摘要
In online service systems, software changes cause a majority of incidents (i.e., unplanned interruptions and outages). Managing change-induced incidents efficiently is crucial for ensuring the reliability and availability of online service systems. Understanding the incidents can help improve change-induced incident management. The task is challenging because the life cycle of change-induced incidents is complicated due to diverse change deployment and incident resolution procedures. Detailed records of the incidents and changes, together with a comprehensive analysis, are needed to gain an in-depth understanding. In this paper, we conduct a qualitative and quantitative study on 231 change-induced incidents in a real-world, large-scale online service system. Detailed change tickets and incident timeline in the post-mortems provides extensive information about the incident life cycle, enabling us to understand each incident in depth. Based on the data, we give a generic model of the complicated life cycle of change-induced incidents. Following the model, we systematically study the whole life cycle of the incident, including the introduction and resolution stages, and answer what affects the efficiency of resolution. We obtain 9 major findings from our study. Based on the findings, we discuss existing techniques and promising future directions for improving change-induced incident management.
更多
查看译文
关键词
incident management,software change,online service system,empirical study,life cycle
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要