SparseFT: Sparsity-aware Fault Tolerance for Reliable CNN Inference on GPUs

2023 32ND INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT(2023)

引用 0|浏览0
暂无评分
摘要
Graphics Processing Units (GPUs), while offering exceptional performance for CNN inference tasks, are susceptible to both transient and permanent hardware faults due to the integration of numerous processing elements and advancements in technology scaling. This paper proposes a novel and cost-effective fault mitigation technique, called Sparsity-aware Fault Tolerance (SparseFT), to ensure reliable CNN inference on GPUs. SparseFT leverages inherent sparsity in the activation maps to detect and correct errors on the processing elements without hardware redundancy. By exploiting the characteristic of dot-products, where multiplications with zero operands are ineffectual, SparseFT dynamically duplicates an effectual computation (i.e., a multiplication with non-zero operands) to the processing element initially assigned to the ineffectual one. It then compares the duplicated computation results to detect errors. Experimental results demonstrate that SparseFT achieves more than 97% error detection coverage with less than 1% performance overhead for the state-of-the-art CNN models.
更多
查看译文
关键词
Reliability,CNN,GPU,Sparsity
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要