Towards Efficient and Reliable LLM Serving: A Real-World Workload Study
CoRR(2024)
摘要
Large language models (LLMs), especially Generative Pretrained Transformer
(GPT) models, have significantly advanced in the industry in recent years.
However, these models' broader development faces considerable challenges due to
high operational and deployment costs. This has led to active research in
improving the hardware efficiency of LLMs. Yet, the characteristics of
real-world LLM workloads are often overlooked in current optimizations of LLM
serving systems. In this work, we find that the absence of reliable workload
data for evaluating LLM serving systems impacts the quality of service (QoS)
and reliability in industrial deployments. This paper introduces the first
real-world trace dataset of LLM serving workloads, detailing user, system, and
LLM behaviors. We analyze this trace, highlighting burstiness, request and
response distributions, and focusing on the reliability of GPT services. Based
on this, we have developed a benchmark suite that reflects our dataset's
workload patterns, enabling performance evaluation of serving systems. This
suite captures the core patterns of workload distributions, allowing for
precise scaling of the workload dataset to match system sizes. Our evaluation
uncovers a previously unrecognized vulnerability of LLM serving systems to
short-term burstiness, particularly in common workload scenarios. We observe
that GPU memory limitations, caused by the fluctuating nature of burstiness,
lead to significant performance degradation in existing LLM serving systems.
Beyond benchmarking, understanding these patterns is valuable for optimizing
LLM workload management, enabling elastic hardware resource adjustments to
varying workloads. We will make the dataset and benchmark suite publicly
available to encourage further research.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要