KubeGPU: efficient sharing and isolation mechanisms for GPU resource management in container cloud

Wenfeng Shen, Zhengsen Liu, Yunjie Tan, Zhaokai Luo,Zhou Lei

JOURNAL OF SUPERCOMPUTING(2022)

引用 0|浏览5
暂无评分
摘要
With the increasing number of new containerized applications, such as high performance and deep learning applications, started to reply on GPU, efficiently supporting GPU in container cloud becomes essential. While GPU sharing has been extensively studied for VM, limited work has been done for containers. Existing works only use a single specific GPU virtualization technique to deploy containers, like GPU pass-through or API forwarding, and lack remote GPU virtualization optimization. The limitations lead to low system throughput and container performance degradation due to the dynamic and heterogeneous nature of container resource requirement and GPU virtualization technique, and the problem of communication overhead and resource racing. Therefore, we designed and implemented KubeGPU, which extends Kubernetes to enable GPU sharing with adaptive share strategy. Adaptive sharing strategy gives KubeGPU the ability to make a dynamic choice of GPU virtualization to deploy containers according to available GPU resources and containers’ configuration parameters such as GPU resource requirement in order to achieve a good container performance and system throughput. Besides that, network-aware scheduling approach and fine-grained allocation of remote GPU resources are proposed to optimize remote GPU virtualization. Finally, using representative real-world workloads for HPC and deep learning, we demonstrate the superiority of KubeGPU compared to other existing works, and the effectiveness of KubeGPU in minimizing communication overhead and eliminating remote GPU resource racing.
更多
查看译文
关键词
Cloud computing,GPU virtualization,Resource scheduling,Container cloud,Kubernets
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要