iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud
IEEE Transactions on Parallel and Distributed Systems(2023)
摘要
GPUs are essential to accelerating the latency-sensitive deep neural network (DNN) inference workloads in cloud datacenters. To fully utilize GPU resources,
spatial sharing
of GPUs among co-located DNN inference workloads becomes increasingly compelling. However, GPU sharing inevitably brings
severe performance interference
among co-located inference workloads, as motivated by an empirical measurement study of DNN inference on EC2 GPU instances. While existing works on guaranteeing inference performance service level objectives (SLOs) focus on either
temporal sharing
of GPUs or
reactive
GPU resource scaling and inference migration techniques, how to
proactively
mitigate such severe performance interference has received comparatively little attention. In this paper, we propose
iGniter
, an
interference-aware
GPU resource provisioning framework for cost-efficiently achieving predictable DNN inference in the cloud.
iGniter
is comprised of two key components: (1) a
lightweight
DNN inference performance model, which leverages the system and workload metrics that are practically accessible to capture the performance interference; (2) A
cost-efficient
GPU resource provisioning strategy that
jointly
optimizes the GPU resource allocation and adaptive batching based on our inference performance model, with the aim of achieving predictable performance of DNN inference workloads. We implement a prototype of
iGniter
based on the NVIDIA Triton inference server hosted on EC2 GPU instances. Extensive prototype experiments on four representative DNN models and datasets demonstrate that
iGniter
can guarantee the performance SLOs of DNN inference workloads with practically acceptable runtime overhead, while saving the monetary cost by up to
$25\%$
in comparison to the state-of-the-art GPU resource provisioning strategies.
更多查看译文
关键词
Cloud-based DNN inference,predictable performance,GPU resource provisioning,performance interference
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要