The Design and Implementation of Low-Latency Prediction Serving Systems

user-5bd69975530c70d56f390249(2019)

引用 5|浏览34
暂无评分
摘要
Author(s): Crankshaw, Daniel | Advisor(s): Gonzalez, Joseph E | Abstract: Machine learning is being deployed in a growing number of applications which demand real- time, accurate, and cost-efficient predictions under heavy query load. These applications employ a variety of machine learning frameworks and models, often composing several models within the same application. However, most machine learning frameworks and systems are optimized for model training and not deployment.In this thesis, I discuss three prediction serving systems designed to meet the needs of modern interactive machine learning applications. The key idea in this work is to utilize a decoupled, layered design that interposes systems on top of training frameworks to build low-latency, scalable serving systems. Velox introduced this decoupled architecture to enable fast online learning and model personalization in response to feedback. Clipper generalized this system architecture to be framework-agnostic and introduced a set of optimizations to reduce and bound prediction latency and improve prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. And InferLine provisions and manages the individual stages of prediction pipelines to minimize cost while meeting end-to-end tail latency constraints.
更多
查看译文
关键词
Systems architecture,Latency (engineering),Robustness (computer science),Personalization,Scalability,Throughput,Clipper (electronics),Computer architecture,Latency (engineering),Computer science
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要