Fault Tolerant Stencil Computation on Cloud-based GPU Spot Instances

IEEE Transactions on Cloud Computing（2019）

引用 17|浏览26

暂无评分

摘要

This paper describes a fault tolerant framework for distributed stencil computation on cloud-based GPU clusters. It uses pipelining to overlap the data movement with computation in the halo region as well as parallelises data movement within the GPUs. Instead of running stencil codes on traditional clusters and supercomputers, the computation is performed on the Amazon Web Service GPU cloud, and utilizes its spot instances to improve cost-efficiency. The implementation is based on a low-cost fault-tolerant mechanism to handle the possible termination of the spot instances. Coupled with a price bidding module, our stencil framework not only optimizes for performance but also for cost. Experimental results show that our framework outperforms the state-of-the-art solutions achieving a peak of 25 TFLOPS for 2-D decomposition running on 512 nodes. We also show that the use of spot instances yields good cost-efficiency, increasing the average TFLOPS/USD from 132 to 360.

查看译文

关键词

Graphics processing units,Cloud computing,Kernel,Delays,Fault tolerance,Fault tolerant systems,Pipeline processing

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要