Ray-based Elastic Distributed Data Parallel Framework with Distributed Data Cache

Haoran Lin, Xinwei Qin, Shuang Qiu, Yi Sun,Zekun Yin,Weiguo Liu

IPDPS Workshops(2023)

引用 0|浏览3
暂无评分
摘要
With the development of large-scale machine learning, distributed data parallel has become the de facto standard strategy for model training. However, when training model using distributed data parallel on large-scale clusters, some unexpected factors may lead to the failure of the training tasks. Thus, a high-performance, scalabe, yet fault-tolerant distributed training framework is urgently needed. Most commonly used opensourced distributed training frameworks (e.g., PyTorch) do not fully meet this need. In this paper, we have designed an elastic distributed training framework based on Ray (a high-performance distributed framework). Our framework takes advantage of Ray's fault-tolerant store, scalability, and the stateful actor. In our framework, training tasks will not be terminated when the number of training processes changes. Moreover, we have designed an elastic distributed data cache using Ray's object store and provided an efficient dataloader (called elastic dataloader). Performance evaluation shows that elastic dataloader is more than 2 times faster than PyTorch's Dataloader on a cluster equipped with 10 Gigabit Ethernet.
更多
查看译文
关键词
distributed data parallel,elastic,distributed data cache,Ray
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要