Towards high-performance prediction serving systems

NIPS ML Systems Workshop(2017)

引用 3|浏览14
Many Machine Learning (ML) frameworks such as Google TensorFlow [3], Facebook Caffe2 [2], Scikit-learn [5], or Microsoft’s Internal ML Toolkit (IMLT) allow data scientists to declaratively author sequences of transformations to train models from largescale multi-dimensional input datasets. The sequences internally are represented as Directed Acyclic Graphs (DAGs) of operators comprising data transformations and featurizers (eg, string tokenization, hashing, etc.), and ML models (eg, Neural networks, Linear models, etc.). 1When trained DAGs are served for prediction, the full set of operators is deployed altogether to massage and featurize the raw input data points before ML model scoring. Training and prediction DAGs have however different system characteristics: for instance ML models at training time have to scale over large datasets, while, once trained, they can behave as other regular featurizers and data transformations; furthermore, prediction DAGs are often surfaced for direct users’ access and therefore require low latency, high throughput, and high predictability. Specifically, prediction systems have three main performance requirements in order to be usable by consumers and be profitable for ML-as-a-service providers:(R1) latency has to be minimal—in the order of milliseconds—and predictable because scoring is often one segment in more complex services (eg, smart phone or web applications) which potentially provide a Service Level Agreement (SLA);(R2) small resource usage—such as memory and CPU—to save operational costs; and (R3) high throughput to handle as many concurrent requests as possible. Existing …
AI 理解论文
Chat Paper