IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency
CoRR(2023)
摘要
Efficiently optimizing multi-model inference pipelines for fast, accurate,
and cost-effective inference is a crucial challenge in machine learning
production systems, given their tight end-to-end latency requirements. To
simplify the exploration of the vast and intricate trade-off space of latency,
accuracy, and cost in inference pipelines, providers frequently opt to consider
one of them. However, the challenge lies in reconciling latency, accuracy, and
cost trade-offs. To address this challenge and propose a solution to
efficiently manage model variants in inference pipelines, we present IPA, an
online deep learning Inference Pipeline Adaptation system that efficiently
leverages model variants for each deep learning task. Model variants are
different versions of pre-trained models for the same deep learning task with
variations in resource requirements, latency, and accuracy. IPA dynamically
configures batch size, replication, and model variants to optimize accuracy,
minimize costs, and meet user-defined latency Service Level Agreements (SLAs)
using Integer Programming. It supports multi-objective settings for achieving
different trade-offs between accuracy and cost objectives while remaining
adaptable to varying workloads and dynamic traffic patterns. Navigating a wider
variety of configurations allows to achieve better trade-offs between
cost and accuracy objectives compared to existing methods. Extensive
experiments in a Kubernetes implementation with five real-world inference
pipelines demonstrate that IPA improves end-to-end accuracy by up to 21
minimal cost increase. The code and data for replications are available at
https://github.com/reconfigurable-ml-pipeline/ipa.
更多查看译文
关键词
inference pipeline adaptation,high accuracy,cost-efficiency
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要