iRDMA: Efficient Use of RDMA in Distributed Deep Learning Systems

Yufei Ren,Xingbo Wu,Li Zhang,Yandong Wang,Wei Zhang, Zijun Wang,Michel Hack,Song Jiang

2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)（2017）

引用 17|浏览118

暂无评分

摘要

Distributed deep learning systems place stringent requirement on communication bandwidth in its model training with large volumes of input data under user-time constraint. The communications take place mainly between cluster of worker nodes for training data and parameter servers for maintaining a global trained model. For fast convergence the worker nodes and parameter servers have to frequently exchange billions of parameters to quickly broadcast updates and minimize staleness. Demand on the bandwidth becomes even higher with the introduction of dedicated GPUs in the computation. While RDMA-capable network has a great potential to provide sufficiently high bandwidth, its current use over TCP/IP or tied to particular programming models, such as MPI, limits its capability to break the bandwidth bottleneck. In this work we propose iRDMA, an RDMA-based parameter server architecture optimized for high-performance network environment supporting both GPU- and CPU-based training. It utilizes native asynchronous RDMA verbs to achieve network line speed while minimizing the communication processing cost on both worker and parameter-server sides. Furthermore, iRDMA exposes the parameter server system as a POSIX-compatible file API for convenient support of load balance and fault tolerance as well as its easy use. We have implemented iRDMA at IBM's deep learning platform. Experiment results show that our design can help deep learning applications, including image recognition and language classification, to achieve near-linear improvement on convergence speed and training accuracy acceleration by using distributed computing resources. From the system perspective, iRDMA can efficiently utilize about 95% network bandwidth of fast networks to synchronize models among distributed training processes.

查看译文

关键词

RDMA, deep learning, network

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要