Investigating and Reducing the Architectural Impact of Transient Faults in Special Function Units for GPUs

Journal of Electronic Testing(2024)

引用 0|浏览0
暂无评分
摘要
Ensuring the reliability of GPUs and their internal components is paramount, especially in safety-critical domains like autonomous machines and self-driving cars. These cutting-edge applications heavily rely on GPUs to implement complex algorithms due to their implicit programming flexibility and parallelism, which is crucial for efficient operation. However, as integration technologies advance, there is a growing concern regarding the potential increase in fault sensitivity of the internal components of current GPU generations. In particular, Special Function Unit (SFU) cores inside GPUs are used in multimedia, High-Performance Computing, and neural network training. Despite their frequent usage and critical role in several domains, reliability evaluations on SFUs and the development of effective mitigation solutions have yet to be studied and remain unexplored. This work evaluates the impact of transient faults in the main hardware structures of SFUs in GPUs. In addition, we analyze the main overhead costs and benefits of developing selective-hardening mechanisms for SFUs. We focus on evaluating and analyzing two SFU architectures for GPUs (’fused’ and ’modular’) and their relations to energy, area, and reliability impact on parallel applications. The experiments resort to fine-grain fault injection campaigns on an RTL GPU model (FlexGripPlus) instrumented with both SFUs. The results on both SFU architectures indicate that fused SFUs (in commercial-grade devices) require lower area overhead (about 27
更多
查看译文
关键词
Graphics processing units (GPUs),Fault-tolerance,Reliability evaluation,Special function unit (SFU),T-Stream core
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要