AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
A generic communication scheduler for distributed DNN training acceleration
Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp.16-29, (2019)
- One mini-batch travels through the DNN model layer-bylayer and generates a loss
- This process is called forward propagation (FP).
- Data parallelism is a popular strategy for scaling DNN training across many devices
- It partitions the dataset onto multiple compute devices (“workers”), where each worker shares the same model parameters.
- In DNN training, all-reduce computes the sum of gradients across workers, and each worker updates its parameters locally.
- This is shown in Figure 1: since push0 and push1 both require upload bandwidth, push1 gets executed before push0; pull1 could be executed before pull0
- We present the analysis of how different system setups may impact the final performance and system parameter choices, and propose an auto-tuning algorithm based on Bayesian Optimization
- The performance gap from the optimum is bounded. ▷ We identify key system parameters, i.e., partition size and credit size, and design Bayesian Optimization-based autotuning
- After forward propagation (FP), the gradients are calculated from the last layer to the first layer, and this process is called backward propagation (BP)
- The gradients are used to update model parameters based on some optimization algorithm, e.g., Stochastic Gradient Descent (SGD)
- ByteScheduler achieves up to 196% end-to-end speedup
- In Deep Neural Networks (DNNs) training, all-reduce computes the sum of gradients across workers, and each worker updates its parameters locally. 2.2 Communication Scheduling Computation-communication dependency DAG
- From the closest to user to the lowest level, ML frameworks and communication stacks include: 1) user code that declares DNN models, 2) framework frontend with high-level API (e.g., Python frontend), 3) framework engine, 4) message-level communication library, and 5) TCP or RDMA stack.
- Implementing the scheduler in message-level communication library is a good choice if in a clean slate.
- Under the TCP or RDMA stack, like socket and verbs, the authors will lose application-level information, such as message priorities
- 6.1 Methodology Testbed setup.
- The authors' testbed has 16 physical machines, each with 64 CPU cores, 256GB memory, 8 Tesla V100 GPUs without NVLinks, and 100Gbps bandwidth between any two servers using Mellanox CX-5 single-port NICs. Benchmarks.
- The authors have run ByteScheduler in 8 different setups: MXNet PS, MXNet NCCL, TensorFlow PS, and PyTorch NCCL, each with TCP or RDMA, respectively.
- The authors only show results in 5 setups: MXNet PS TCP, MXNet PS RDMA, TensorFlow PS TCP, MXNet NCCL RDMA, and PyTorch NCCL TCP
- Discussion and Future
Dynamic partition size and credit size. The authors use autotuning to search for the best parameters, i.e., partition size and credit size, at the beginning of training, and assume their values stay constant throughout the training.
- The authors may further allow dynamic partition size and credit size over the training course, by consistently searching for the best values using newly profiled results.
- The authors may use different partition and credit sizes for different layers in the DNN.
- Both improvements may incur significantly more search costs.
- The authors' current implementation supports TensorFlow, PyTorch and MXNet.ByteScheduler is a generic communication scheduler for distributed DNN training acceleration.
- The authors have open-sourced the implementation and expect that the community will add support to more existing and future frameworks
- Table1: Best partition size (MB) and credit size (MB)
- Speed up communication in DNN training. Existing approaches include: (1) speeding up individual messages by using RDMA  or NCCL ; (2) compressing data transmission such as gradient quantization [9, 37] and sparse parameter synchronization ; (3) optimizing communication approach, e.g., Stale Synchronous Parallel  for PS and different all-reduce algorithms [10, 14]; (4) minimizing network flow completion time by using flow control  or Coflow scheduling . These work mainly focus on accelerating a single communication operation, and are orthogonal and complementary to ByteScheduler. Overlap communication with computation. Most DNN frameworks, e.g., TensorFlow, PyTorch, MXNet and Poseidon , support overlapping communication with backward propagation. On top of this, P3  further attempts to overlap communication with forward propagation by layer partitioning and scheduling on MXNet PS architecture. TicTac  proposes a similar idea but shows a much smaller training speedup (less than 20%) based on TensorFlow PS TCP. We suspect that it is due to the global barrier (§2.3).
- Deep Neural Networks (DNNs) have been extensively used for a wide range of applications, such as Computer Vision, The work of Yanghua Peng, Yangrui Chen and Chuan Wu was supported in part by a ByteDance Research Collaboration Project
- Yanghua Peng is also supported by SOSP 2019 Student Scholarship from the ACM Special Interest Group in Operating Systems
- 2019. ByteScheduler Appendix. https://www.dropbox.com/s/smoq6xd6pr7av81/bytescheduler_appendix.pdf ?dl=0.
- 2019. ByteScheduler Source Code. https://github.com/bytedance/byteps.
- 2019. MLPerf Training v0.6 Results. https://mlperf.org/training-results0-6/.
- 2019. NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl.
- 2019. TensorFlow Grapper. https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/grappler.
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 201TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI).
- Alham Fikri Aji and Kenneth Heafield. 201Sparse Communication for Distributed Gradient Descent. arXiv preprint arXiv:1704.05021 (2017).
- Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. 2017. Cherrypick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI).
- Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
- Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, and Dhabaleswar K Panda. 2018. Optimized Broadcast for Deep Learning Workloads on Dense-GPU Infiniband Clusters: MPI or NCCL?. In Proceedings of the 25th European MPI Users’ Group Meeting.
- Eric Brochu, Vlad M Cora, and Nando De Freitas. 2010. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. arXiv preprint arXiv:1012.2599 (2010).
- Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2016. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Proceedings of NIPS Workshop on Machine Learning Systems.
- Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. 2014. Efficient Coflow Scheduling with Varys. In Proceedings of ACM Special Interest Group on Data Communication (SIGCOMM).
- Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, and Vinay Amatya. 2018. GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent. arXiv preprint arXiv:1803.05880 (2018).
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large Scale Distributed Deep Networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
- Songyun Duan, Vamsidhar Thummala, and Shivnath Babu. 2009. Tuning Database Configuration Parameters with iTuned. In Proceedings of Very Large Data Bases (VLDB) Endowment.
- Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI).
- Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell.
- 20TicTac: Accelerating Distributed Deep Learning with Communication Scheduling. In Proceedings of Systems and Machine Learning (SysML).
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
-  Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing.
- 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
-  Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. 2019. Priority-Based Parameter Propagation for Distributed DNN Training. In Proceedings of Systems and Machine Learning (SysML).
-  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014.
- Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM International Conference on Multimedia.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
-  Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI).
-  Jiuxing Liu, Jiesheng Wu, and Dhabaleswar K Panda. 2004. High Performance RDMA-based MPI Implementation over InfiniBand. International Journal of Parallel Programming (2004).
-  Luo Mai, Chuntao Hong, and Paolo Costa. 2015. Optimizing Network Performance in Distributed Machine Learning. In Proceedings of USENIX Workshop on Hot Topics in Cloud Computing (HotCloud).
-  Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. Mllib: Machine Learning in Apache Spark. Journal of Machine Learning Research (2016).
-  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic Differentiation in PyTorch. In Proceedings of NIPS Autodiff Workshop.
-  Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In Proceedings of the 13th ACM European Conference on Computer Systems (EuroSys).
-  Sebastian Ruder. 2016. An Overview of Gradient Descent Optimization Algorithms. arXiv preprint arXiv:1609.04747 (2016).
-  Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft’s Open-Source Deep-Learning Toolkit. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (KDD).
-  Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and Easy Distributed Deep Learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
-  Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).
-  Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical Bayesian Optimization of Machine Learning Algorithms. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
-  Evan R Sparks, Ameet Talwalkar, Virginia Smith, Jey Kottalam, Xinghao Pan, Joseph Gonzalez, Michael J Franklin, Michael I Jordan, and Tim Kraska. 2013. MLI: An API for Distributed Machine Learning. In Proceedings of IEEE International Conference on Data Mining (ICDM).
-  Minjie Wang, Chien-chin Huang, and Jinyang Li. 2019. Supporting Very Large Models Using Automatic Dataflow Graph Partitioning. In Proceedings of the 14th ACM European Conference on Computer Systems (EuroSys).
-  Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
-  Wikipedia. 2019. Monkey Patch. https://en.wikipedia.org/wiki/ Monkey_patch.
-  Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017.
- Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In Proceedings of USENIX Annual Technical Conference (USENIX ATC).