Olympian: Scheduling GPU Usage in a Deep Neural Network Model Serving System.

Middleware '18: 19th International Middleware Conference Rennes France December, 2018(2018)

引用 13|浏览56
暂无评分
摘要
Deep neural networks (DNNs) are emerging as important drivers for GPU (Graphical Processing Unit) usage. Routinely, now, cloud offerings include GPU-capable VMs, and GPUs are used for training and testing DNNs. A popular way to run inference (or testing) tasks with DNNs is to use middleware called a serving system. Tensorflow-Serving (TF-Serving) is an example of a DNN serving system. In this paper, we consider the problem of carefully scheduling multiple concurrent DNNs in a serving system on a single GPU to achieve fairness or service differentiation objectives, a capability crucial to cloud-based TF-Serving offerings. In scheduling DNNs, we face two challenges: how to schedule, and switch between, different DNN jobs at low overhead; and, how to account for their usage. Our system, Olympian, extends TF-Serving to enable fair sharing of a GPU across multiple concurrent large DNNs at low overhead, a capability TF-Serving by itself is not able to achieve. Specifically, Olympian can run concurrent instances of several large DNN models such as Inception, ResNet, GoogLeNet, AlexNet and VGG, provide each with an equal share of the GPU, while interleaving them at timescales of 1-2 ms, and incurring an overhead of less than 2%. It achieves this by leveraging the predictability of GPU computations to profile GPU resource usage models offline, then using these to achieve low overhead switching between DNNs.
更多
查看译文
关键词
GPU Scheduling, Deep Neural Network, Serving System, Multiple Concurrent Requests, Predictable Latency, Scheduling Algorithms
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要