Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning Accelerator
arxiv(2024)
摘要
As Neural Processing Units (NPU) or accelerators are increasingly deployed in
a variety of applications including safety critical applications such as
autonomous vehicle, and medical imaging, it is critical to understand the
fault-tolerance nature of the NPUs. We present a reliability study of Arm's
Ethos-U55, an important industrial-scale NPU being utilised in embedded and IoT
applications. We perform large scale RTL-level fault injections to characterize
Ethos-U55 against the Automotive Safety Integrity Level D (ASIL-D) resiliency
standard commonly used for safety-critical applications such as autonomous
vehicles. We show that, under soft errors, all four configurations of the NPU
fall short of the required level of resiliency for a variety of neural networks
running on the NPU. We show that it is possible to meet the ASIL-D level
resiliency without resorting to conventional strategies like Dual Core Lock
Step (DCLS) that has an area overhead of 100
protection, where hardware structures are selectively protected (e.g.,
duplicated, hardened) based on their sensitivity to soft errors and their
silicon areas. To identify the optimal configuration that minimizes the area
overhead while meeting the ASIL-D standard, the main challenge is the large
search space associated with the time-consuming RTL simulation. To address this
challenge, we present a statistical analysis tool that is validated against Arm
silicon and that allows us to quickly navigate hundreds of billions of fault
sites without exhaustive RTL fault injections. We show that by carefully
duplicating a small fraction of the functional blocks and hardening the Flops
in other blocks meets the ASIL-D safety standard while introducing an area
overhead of only 38
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要