A Characterization of Soft-error Sensitivity in Data-parallel and Model-parallel Distributed Deep Learning

Journal of Parallel and Distributed Computing(2024)

引用 0|浏览4
暂无评分
摘要
The latest advances in artificial intelligence deep learning models are unprecedented. A wide spectrum of application areas is now thriving thanks to available massive training datasets and gigantic complex neural network models. Those two characteristics demand outstanding computing power that only advanced computing platforms can provide. Therefore, distributed deep learning has become a necessity in capitalizing on the potential of cutting-edge artificial intelligence. Two basic schemes have emerged in distributed learning. First, the data-parallel approach, which aims at dividing the training dataset into multiple computing nodes. Second, the model-parallel approach, which splits layers of a model into several computing nodes. Each scheme has its upsides and downsides, particularly when running on large machines that are susceptible to soft errors. Those errors occur as a consequence of several factors involved in the manufacturing process of current electronic components of supercomputers. On many occasions, those errors are expressed as bit flips that do not cause the whole system to crash, but generate wrong numerical results in computations. To study the effect of soft error on different approaches for distributed learning, we leverage checkpoint alteration, a technique that injects bit flips on checkpoint files. It allows researchers to understand the effect of soft errors on applications that produce checkpoint files in HDF5 format. This paper uses the popular deep learning PyTorch tool on two distributed-learning platforms: one for data-parallel training and one for model-parallel training. We use well-known deep learning models with popular training datasets to provide a picture of how soft errors challenge the training phase of a deep learning model.
更多
查看译文
关键词
Deep learning,resilience,checkpoint,neural networks,fault injection
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要