Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights

2021 IEEE International Conference on Cluster Computing (CLUSTER)(2021)

引用 2|浏览15
暂无评分
摘要
In recent years, the increasing complexity in scientific simulations and emerging demands for training heavy artificial intelligence models require massive and fast data accesses, which urges high-performance computing (HPC) platforms to equip with more advanced storage infrastructures such as solid-state disks (SSDs). While SSDs offer high-performance I/O, the reliability challenges faced by the ...
更多
查看译文
关键词
Training,Fault diagnosis,Solid modeling,Fault tolerance,File systems,Computational modeling,Fault tolerant systems
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要