Understanding Practical Tradeoffs in HPC Checkpoint-Scheduling Policies.

IEEE Transactions on Dependable and Secure Computing(2018)

引用 18|浏览56
暂无评分
摘要
As the scale of High-Performance Computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as serious design concerns. Efficiently running systems at such large scales critically relies on deploying effective, practical methods for fault tolerance while having a good understanding of their respective performance and energy overheads. The ...
更多
查看译文
关键词
Checkpointing,Fault tolerance,Fault tolerant systems,Writing,Shape,Optimized production technology,Runtime
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要