cedar: Composable and Optimized Machine Learning Input Data Pipelines
CoRR(2024)
摘要
The input data pipeline is an essential component of each machine learning
(ML) training job. It is responsible for reading massive amounts of training
data, processing batches of samples using complex of transformations, and
loading them onto training nodes at low latency and high throughput. Performant
input data systems are becoming increasingly critical, driven by skyrocketing
data volumes and training throughput demands. Unfortunately, current input data
systems cannot fully leverage key performance optimizations, resulting in
hugely inefficient infrastructures that require significant resources – or
worse – underutilize expensive accelerators.
To address these demands, we present cedar, a programming model and framework
that allows users to easily build, optimize, and execute input data pipelines.
cedar presents an easy-to-use programming interface, allowing users to define
input data pipelines using composable operators that support arbitrary ML
frameworks and libraries. Meanwhile, cedar transparently applies a complex and
extensible set of optimization techniques (e.g., offloading, caching,
prefetching, fusion, and reordering). It then orchestrates processing across a
customizable set of local and distributed compute resources in order to
maximize processing performance and efficiency, all without user input. On
average across six diverse input data pipelines, cedar achieves a 2.49x, 1.87x,
2.18x, and 2.74x higher performance compared to tf.data, tf.data service, Ray
Data, and PyTorch's DataLoader, respectively.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要