AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We present the analysis of how different system setups may impact the final performance and system parameter choices, and propose an auto-tuning algorithm based on Bayesian Optimization

A generic communication scheduler for distributed DNN training acceleration

Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp.16-29, (2019)

Cited by: 19|Views191
EI
Full Text
Bibtex
Weibo

Abstract

We present ByteScheduler, a generic communication scheduler for distributed DNN training acceleration. ByteScheduler is based on our principled analysis that partitioning and rearranging the tensor transmissions can result in optimal results in theory and good performance in real-world even with scheduling overhead. To make ByteScheduler ...More

Code:

Data:

0
Introduction
  • One mini-batch travels through the DNN model layer-bylayer and generates a loss
  • This process is called forward propagation (FP).
  • Data parallelism is a popular strategy for scaling DNN training across many devices
  • It partitions the dataset onto multiple compute devices (“workers”), where each worker shares the same model parameters.
  • In DNN training, all-reduce computes the sum of gradients across workers, and each worker updates its parameters locally.
  • This is shown in Figure 1: since push0 and push1 both require upload bandwidth, push1 gets executed before push0; pull1 could be executed before pull0
Highlights
  • We present the analysis of how different system setups may impact the final performance and system parameter choices, and propose an auto-tuning algorithm based on Bayesian Optimization
  • The performance gap from the optimum is bounded. ▷ We identify key system parameters, i.e., partition size and credit size, and design Bayesian Optimization-based autotuning
  • After forward propagation (FP), the gradients are calculated from the last layer to the first layer, and this process is called backward propagation (BP)
  • The gradients are used to update model parameters based on some optimization algorithm, e.g., Stochastic Gradient Descent (SGD)
  • ByteScheduler achieves up to 196% end-to-end speedup
  • In Deep Neural Networks (DNNs) training, all-reduce computes the sum of gradients across workers, and each worker updates its parameters locally. 2.2 Communication Scheduling Computation-communication dependency DAG
Methods
  • From the closest to user to the lowest level, ML frameworks and communication stacks include: 1) user code that declares DNN models, 2) framework frontend with high-level API (e.g., Python frontend), 3) framework engine, 4) message-level communication library, and 5) TCP or RDMA stack.
  • Implementing the scheduler in message-level communication library is a good choice if in a clean slate.
  • Under the TCP or RDMA stack, like socket and verbs, the authors will lose application-level information, such as message priorities
Results
  • 6.1 Methodology Testbed setup.
  • The authors' testbed has 16 physical machines, each with 64 CPU cores, 256GB memory, 8 Tesla V100 GPUs without NVLinks, and 100Gbps bandwidth between any two servers using Mellanox CX-5 single-port NICs. Benchmarks.
  • The authors have run ByteScheduler in 8 different setups: MXNet PS, MXNet NCCL, TensorFlow PS, and PyTorch NCCL, each with TCP or RDMA, respectively.
  • The authors only show results in 5 setups: MXNet PS TCP, MXNet PS RDMA, TensorFlow PS TCP, MXNet NCCL RDMA, and PyTorch NCCL TCP
Conclusion
  • Discussion and Future

    Directions

    Dynamic partition size and credit size. The authors use autotuning to search for the best parameters, i.e., partition size and credit size, at the beginning of training, and assume their values stay constant throughout the training.
  • The authors may further allow dynamic partition size and credit size over the training course, by consistently searching for the best values using newly profiled results.
  • The authors may use different partition and credit sizes for different layers in the DNN.
  • Both improvements may incur significantly more search costs.
  • The authors' current implementation supports TensorFlow, PyTorch and MXNet.ByteScheduler is a generic communication scheduler for distributed DNN training acceleration.
  • The authors have open-sourced the implementation and expect that the community will add support to more existing and future frameworks
Tables
  • Table1: Best partition size (MB) and credit size (MB)
Download tables as Excel
Related work
  • Speed up communication in DNN training. Existing approaches include: (1) speeding up individual messages by using RDMA [25] or NCCL [4]; (2) compressing data transmission such as gradient quantization [9, 37] and sparse parameter synchronization [7]; (3) optimizing communication approach, e.g., Stale Synchronous Parallel [20] for PS and different all-reduce algorithms [10, 14]; (4) minimizing network flow completion time by using flow control [26] or Coflow scheduling [13]. These work mainly focus on accelerating a single communication operation, and are orthogonal and complementary to ByteScheduler. Overlap communication with computation. Most DNN frameworks, e.g., TensorFlow, PyTorch, MXNet and Poseidon [39], support overlapping communication with backward propagation. On top of this, P3 [21] further attempts to overlap communication with forward propagation by layer partitioning and scheduling on MXNet PS architecture. TicTac [18] proposes a similar idea but shows a much smaller training speedup (less than 20%) based on TensorFlow PS TCP. We suspect that it is due to the global barrier (§2.3).
Funding
  • Deep Neural Networks (DNNs) have been extensively used for a wide range of applications, such as Computer Vision, The work of Yanghua Peng, Yangrui Chen and Chuan Wu was supported in part by a ByteDance Research Collaboration Project
  • Yanghua Peng is also supported by SOSP 2019 Student Scholarship from the ACM Special Interest Group in Operating Systems
Reference
  • 2019. ByteScheduler Appendix. https://www.dropbox.com/s/smoq6xd6pr7av81/bytescheduler_appendix.pdf ?dl=0.
    Findings
  • 2019. ByteScheduler Source Code. https://github.com/bytedance/byteps.
    Findings
  • 2019. MLPerf Training v0.6 Results. https://mlperf.org/training-results0-6/.
    Findings
  • 2019. NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl.
    Findings
  • 2019. TensorFlow Grapper. https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/grappler.
    Findings
  • Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 201TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI).
    Google ScholarLocate open access versionFindings
  • Alham Fikri Aji and Kenneth Heafield. 201Sparse Communication for Distributed Gradient Descent. arXiv preprint arXiv:1704.05021 (2017).
    Findings
  • Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. 2017. Cherrypick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI).
    Google ScholarLocate open access versionFindings
  • Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
    Google ScholarLocate open access versionFindings
  • Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, and Dhabaleswar K Panda. 2018. Optimized Broadcast for Deep Learning Workloads on Dense-GPU Infiniband Clusters: MPI or NCCL?. In Proceedings of the 25th European MPI Users’ Group Meeting.
    Google ScholarLocate open access versionFindings
  • Eric Brochu, Vlad M Cora, and Nando De Freitas. 2010. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. arXiv preprint arXiv:1012.2599 (2010).
    Findings
  • Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2016. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Proceedings of NIPS Workshop on Machine Learning Systems.
    Google ScholarLocate open access versionFindings
  • Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. 2014. Efficient Coflow Scheduling with Varys. In Proceedings of ACM Special Interest Group on Data Communication (SIGCOMM).
    Google ScholarLocate open access versionFindings
  • Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, and Vinay Amatya. 2018. GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent. arXiv preprint arXiv:1803.05880 (2018).
    Findings
  • Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large Scale Distributed Deep Networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
    Google ScholarLocate open access versionFindings
  • Songyun Duan, Vamsidhar Thummala, and Shivnath Babu. 2009. Tuning Database Configuration Parameters with iTuned. In Proceedings of Very Large Data Bases (VLDB) Endowment.
    Google ScholarLocate open access versionFindings
  • Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI).
    Google ScholarLocate open access versionFindings
  • Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell.
    Google ScholarFindings
  • 20TicTac: Accelerating Distributed Deep Learning with Communication Scheduling. In Proceedings of Systems and Machine Learning (SysML).
    Google ScholarLocate open access versionFindings
  • [19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • [20] Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing.
    Google ScholarFindings
  • 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
    Google ScholarLocate open access versionFindings
  • [21] Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. 2019. Priority-Based Parameter Propagation for Distributed DNN Training. In Proceedings of Systems and Machine Learning (SysML).
    Google ScholarLocate open access versionFindings
  • [22] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014.
    Google ScholarFindings
  • Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM International Conference on Multimedia.
    Google ScholarLocate open access versionFindings
  • [23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
    Google ScholarLocate open access versionFindings
  • [24] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI).
    Google ScholarLocate open access versionFindings
  • [25] Jiuxing Liu, Jiesheng Wu, and Dhabaleswar K Panda. 2004. High Performance RDMA-based MPI Implementation over InfiniBand. International Journal of Parallel Programming (2004).
    Google ScholarLocate open access versionFindings
  • [26] Luo Mai, Chuntao Hong, and Paolo Costa. 2015. Optimizing Network Performance in Distributed Machine Learning. In Proceedings of USENIX Workshop on Hot Topics in Cloud Computing (HotCloud).
    Google ScholarLocate open access versionFindings
  • [27] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. Mllib: Machine Learning in Apache Spark. Journal of Machine Learning Research (2016).
    Google ScholarLocate open access versionFindings
  • [28] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic Differentiation in PyTorch. In Proceedings of NIPS Autodiff Workshop.
    Google ScholarLocate open access versionFindings
  • [29] Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In Proceedings of the 13th ACM European Conference on Computer Systems (EuroSys).
    Google ScholarLocate open access versionFindings
  • [30] Sebastian Ruder. 2016. An Overview of Gradient Descent Optimization Algorithms. arXiv preprint arXiv:1609.04747 (2016).
    Findings
  • [31] Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft’s Open-Source Deep-Learning Toolkit. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (KDD).
    Google ScholarLocate open access versionFindings
  • [32] Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and Easy Distributed Deep Learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
    Findings
  • [33] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).
    Findings
  • [34] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical Bayesian Optimization of Machine Learning Algorithms. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
    Google ScholarLocate open access versionFindings
  • [35] Evan R Sparks, Ameet Talwalkar, Virginia Smith, Jey Kottalam, Xinghao Pan, Joseph Gonzalez, Michael J Franklin, Michael I Jordan, and Tim Kraska. 2013. MLI: An API for Distributed Machine Learning. In Proceedings of IEEE International Conference on Data Mining (ICDM).
    Google ScholarLocate open access versionFindings
  • [36] Minjie Wang, Chien-chin Huang, and Jinyang Li. 2019. Supporting Very Large Models Using Automatic Dataflow Graph Partitioning. In Proceedings of the 14th ACM European Conference on Computer Systems (EuroSys).
    Google ScholarLocate open access versionFindings
  • [37] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
    Google ScholarLocate open access versionFindings
  • [38] Wikipedia. 2019. Monkey Patch. https://en.wikipedia.org/wiki/ Monkey_patch.
    Findings
  • [39] Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017.
    Google ScholarFindings
  • Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In Proceedings of USENIX Annual Technical Conference (USENIX ATC).
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科