DistMind: Efficient Resource Disaggregation for Deep Learning Workloads

IEEE-ACM TRANSACTIONS ON NETWORKING(2024)

引用 0|浏览5
暂无评分
摘要
Deep learning (DL) systems suffer from low resource utilization due to 1) monolithic server model that tightly couples compute and memory; and 2) limited sharing between different inference applications, and across inference and training, because of strict service level objectives (SLOs). To address this problem, we present DistMind, a disaggregated DL system that enables efficient multiplexing of DL applications with near-optimal resource utilization. DistMind decouples compute from host memory, and exposes the abstractions of a GPU pool and a memory pool, each of which can be independently provisioned. The key challenge is to dynamically allocate GPU resources to different applications based on their real-time demands while meeting strict SLOs. We tackle this challenge by exploiting the power of high-speed 100 Gbps networks, and design three-stage pipelining, cache-aware load balancing, and DNN-aware sharding mechanisms based on the characteristics of DL workloads, to achieve millisecond-scale application loading overhead and improve system efficiency. We have implemented a prototype of DistMind and integrated it with PyTorch. Experimental results on AWS EC2 show that DistMind achieves near 100% resource utilization, and compared with NVIDIA MPS and Ray, DistMind improves the throughput by up to 279% and reduces the inference latency by up to 94%.
更多
查看译文
关键词
Machine learning systems,resource disaggregation,resource management,scheduling,deep learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要