Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing

International Parallel & Distributed Processing Symposium(2015)

引用 46|浏览62
暂无评分
摘要
Energy efficiency and resilience are two crucial challenges for HPC systems to reach exactable. While energy efficiency and resilience issues have been extensively studied individually, little has been done to understand the interplay between energy efficiency and resilience for HPC systems. Decreasing the supply voltage associated with a given operating frequency for processors and other CMOS-based components can significantly reduce power consumption. However, this often raises system failure rates and consequently increases application execution time. In this work, we present an energy saving undervaluing approach that leverages the mainstream resilience techniques to tolerate the increased failures caused by undervaluing. Our strategy is directed by analytic models, which capture the impact of undervaluing and the interplay between energy efficiency and resilience. Experimental results on a power-aware cluster demonstrate that our approach can save up to 12.1% energy compared to the baseline, and conserve up to 9.1% more energy than a state-of-the-art DVFS solution.
更多
查看译文
关键词
energy, resilience, failures, undervolting, HPC
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要