Towards resilient EU HPC systems: a blueprint

Proceedings of the 16th ACM International Conference on Computing Frontiers(2019)

引用 8|浏览13
暂无评分
摘要
In high-performance computing (HPC) a single tightly-coupled job may execute for days on thousands of servers. Since a server failure typically leads to cascading effects on the whole job, requiring redundancy and/or aggressive checkpointing to prevent the whole job from failing. This has an adverse impact on the system performance and resource usage; which limits the ability to scale to larger systems. System resiliency is therefore one of the most important Exascale requirements and challenges.
更多
查看译文
关键词
resilient eu hpc systems,blueprint
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要