Quantifying the Impact of Memory Errors in Deep Learning

2019 IEEE International Conference on Cluster Computing (CLUSTER)(2019)

引用 6|浏览13
暂无评分
摘要
The use of deep learning (DL) on HPC resources has become common as scientists explore and exploit DL methods to solve domain problems. On the other hand, in the coming exascale computing era, a high error rate is expected to be problematic for most HPC applications. The impact of errors on DL applications, especially DL training, remains unclear given their stochastic nature. In this paper, we focus on understanding DL training applications on HPC in the presence of silent data corruption. Specifically, we design and perform a quantification study with three representative applications by manually injecting silent data corruption errors (SDCs) across the design space and compare training results with the error-free baseline. The results show only 0.61-1.76% of SDCs cause training failures, and taking the SDC rate in modern hardware into account, the actual chance of a failure is one in thousands to millions of executions. With this quantitatively measured impact, computing centers can make rational design decisions based on their application portfolio, the acceptable failure rate, and financial constraints; for example, they might determine their confidence in the correctness of training results performed on processors without error correction code (ECC) RAM. We also discover that over 75-90% of the SDCs that cause catastrophic errors can be easily detected by a training loss in the next iteration. Thus we propose this error-aware software solution to correct catastrophic errors, as it has significantly lower time and space overhead compared to algorithm-based fault-tolerance (ABFT) and ECC.
更多
查看译文
关键词
error correction code RAM,error-aware software solution,memory errors,deep learning,HPC resources,silent data corruption errors,training failures,high performance computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要