Understanding and Mitigating Hardware Failures in Deep Learning Training Accelerator Systems

Yi He, Mike Hutton, Steven Chan, Robert de Gruijl,Rama Govindaraju,Nishant Patil,Yanjing Li

PROCEEDINGS OF THE 2023 THE 50TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, ISCA 2023(2023)

引用 0|浏览4
暂无评分
摘要
Deep neural network (DNN) training workloads are increasingly susceptible to hardware failures in datacenters. For example, Google experienced "mysterious, difficult to identify problems" in their TPU training systems due to hardware failures [7]. Although these particular problems were subsequently corrected through significant efforts, they have raised the urgency of addressing the growing challenges emerging from hardware failures impacting many DNN training workloads. In this paper, we present the first in-depth resilience study targeting DNN training workloads and hardware failures that occur in the logic portion of deep learning (DL) accelerator systems. We developed a fault injection framework to accurately simulate the effects of various hardware failures based on the design of an industrial DL accelerator, and conducted > 2.9M. experiments (> 490K. node-hours) using representative workloads. Based on our experiments, we present (1) a comprehensive characterization of hardware failure effects, (2) the fundamental understanding on how hardware failures propagate in training devices and interact with training workloads, and (3) the necessary conditions that must be satisfied for these failures to eventually cause unexpected training outcomes. The insights obtained from our study enabled us to develop ultral-ight-weight software techniques to mitigate hardware failures. Our techniques require 24-32 lines of code change, and introduce 0.003% - 0.025% performance overhead for various representative workloads. Our observations and techniques are generally applicable to mitigate various hardware failures in DL training accelerator systems.
更多
查看译文
关键词
Deep learning accelerator systems,neural network training,resilience,reliability,hardware failures,silent data curroption
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要