AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We have proposed the CA-distributed stochastic gradient descent scheme, where each device employs gradient sparsification with error accumulation followed by linear projection to reduce the typically very large parameter vector dimension to the limited channel bandwidth

Federated Learning over Wireless Fading Channels

IEEE Transactions on Wireless Communications, no. 5 (2020): 3546-3557

Cited by: 149|Views386
EI WOS

Abstract

We study federated machine learning at the wireless network edge, where limited power wireless devices, each with its own dataset, build a joint model with the help of a remote parameter server (PS). We consider a bandwidth-limited fading multiple access channel (MAC) from the wireless devices to the PS, and propose various techniques to ...More

Code:

Data:

0
Introduction
  • As the dataset sizes and model complexity grow, distributed machine learning (ML) is becoming the only viable alternative to centralized ML, where all the dataset is gathered at a centralized server, and a joint model is trained.
  • Learning (FL) has been proposed as an alternative privacy-preserving distributed ML scheme, where each device participates in training using only locally available data with the help of a parameter server (PS) [1].
  • U∈B where θ ∈ Rd denotes the model parameters to be optimized, B is the training dataset with size |B| consisting of data samples and their labels, and f (·) is the loss function defined by the learning task.
Highlights
  • As the dataset sizes and model complexity grow, distributed machine learning (ML) is becoming the only viable alternative to centralized ML, where all the dataset is gathered at a centralized server, and a joint model is trained
  • It is evident that the analog schemes again perform significantly better than the D-distributed stochastic gradient descent (DSGD) scheme, and compressed analog DSGD (CA-DSGD) improves upon ESA-DSGD and ECESA-DSGD noticeably in terms of the convergence speed and accuracy
  • We have studied distributed ML at the wireless edge, where M devices with limited transmit power and datasets communicate with the parameter server (PS) over a bandwidth-limited fading multiple access channel (MAC) to minimize a loss function by performing DSGD
  • At each iteration of the proposed digital DSGD (D-DSGD) scheme, one device is selected depending on the channel states, and the selected device first quantizes its gradient estimate, and transmits the quantized bits to the PS using a capacity-achieving channel code
  • We have proposed the CA-DSGD scheme, where each device employs gradient sparsification with error accumulation followed by linear projection to reduce the typically very large parameter vector dimension to the limited channel bandwidth
  • Numerical results have shown significant improvements in the performance of CA-DSGD compared to D-DSGD and the state-of-the-art analog schemes
Results
  • The final test accuracies of different schemes under consideration are presented in Table I.
  • As it can be seen, CA-DSGD performs significantly better than both ESA-DSGD and ECESA-DSGD.
  • It is evident that the analog schemes again perform significantly better than the D-DSGD scheme, and CA-DSGD improves upon ESA-DSGD and ECESA-DSGD noticeably in terms of the convergence speed and accuracy
Conclusion
  • The authors have studied distributed ML at the wireless edge, where M devices with limited transmit power and datasets communicate with the PS over a bandwidth-limited fading MAC to minimize a loss function by performing DSGD.
  • The authors studied the alternative analog transmission approach, cosidered recently in [25]–[27], which does not employ quantization or channel coding, and exploits the superposition property of the wireless MAC, rather than orthogonalizing the transmissions from different devices.
  • The authors have proposed the CA-DSGD scheme, where each device employs gradient sparsification with error accumulation followed by linear projection to reduce the typically very large parameter vector dimension to the limited channel bandwidth.
  • Numerical results have shown significant improvements in the performance of CA-DSGD compared to D-DSGD and the state-of-the-art analog schemes
Tables
  • Table1: Final test accuracy comparison following a training period of NT = 100 time slots
Download tables as Excel
Funding
  • This work was supported in part by the European Research Council (ERC) Starting Grant BEACON (grant agreement no. 725731)
Study subjects and analysis
test samples: 10000
Here we compare the performances of the presented wireless edge learning schemes for the task of image classification. We run experiments on the MNIST dataset [30] with 60000 training and 10000 test samples, and train a single layer neural network with d = 7850 parameters utilizing ADAM optimizer [31]. To model the data distribution across the devices, we assume that B training data samples, selected at random, are assigned to each device at the beginning of training, in which case, with high probability, the datasets of any two devices will have non-empty intersection; that is, the local datasets are not independent across the devices

Reference
  • J. Konecny, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” arXiv:1610.05492v2 [cs.LG], Oct. 2017.
    Findings
  • B. McMahan and D. Ramage, “Federated learning: Collaborative machine learning without centralized training data,” [online]. Available. https://ai.googleblog.com/2017/04/federated-learning-collaborative.html, Apr.2017.
    Findings
  • J. Konecny and P. Richtarik, “Randomized distributed mean estimation: Accuracy vs communication,” arXiv:1611.07555 [cs.DC], Nov. 2016.
    Findings
  • H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in AISTATS, 2017.
    Google ScholarFindings
  • V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi-task learning,” in Proc. Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 2017.
    Google ScholarLocate open access versionFindings
  • S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in ICML, Jul. 2015.
    Google ScholarFindings
  • F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs,” in INTERSPEECH, Singapore, Sep. 2014, pp. 1058–1062.
    Google ScholarLocate open access versionFindings
  • D. Alistarh, D. Grubic, J. Z. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via randomized quantization and encoding,” in NIPS, Long Beach, CA, Dec. 2017, pp. 1709–1720.
    Google ScholarLocate open access versionFindings
  • W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “TernGrad: Ternary gradients to reduce communication in distributed deep learning,” arXiv:1705.07878v6 [cs.LG], Dec. 2017.
    Findings
  • S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv:1606.06160v3 [cs.NE], Feb. 2018.
    Findings
  • H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and S. Wright, “ATOMO: Communication-efficient learning via atomic sparsification,” arXiv:1806.04090v2 [stat.ML], Jun. 2018.
    Findings
  • J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: Compressed optimisation for non-convex problems,” arXiv:1802.04434v3 [cs.LG], Aug. 2018.
    Findings
  • B. Li, W. Wen, J. Mao, S. Li, Y. Chen, and H. Li, “Running sparse and low-precision neural network: When algorithm meets hardware,” in Proc. Asia and South Pacific Design Automation Conference (ASP-DAC), Jeju, South Korea, Jan. 2018.
    Google ScholarLocate open access versionFindings
  • N. Strom, “Scalable distributed DNN training using commodity gpu cloud computing,” in INTERSPEECH, 2015.
    Google ScholarFindings
  • A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” arXiv:1704.05021v2 [cs.CL], Jul.
    Findings
  • Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” arXiv:1712.01887v2 [cs.CV], Feb. 2018.
    Findings
  • X. Sun, X. Ren, S. Ma, and H. Wang, “meProp: Sparsified back propagation for accelerated deep learning with reduced overfitting,” arXiv:1706.06197v4 [cs.LG], Oct. 2017.
    Findings
  • F. Sattler, S. Wiedemann, K. Muller, and W. Samek, “Sparse binary compression: Towards distributed deep learning with minimal communication,” arXiv:1805.08768v1 [cs.LG], May 2018.
    Findings
  • C. Renggli, D. Alistarh, T. Hoefler, and M. Aghagolzadeh, “SparCML: High-performance sparse communication for machine learning,” arXiv:1802.08021v2 [cs.DC], Oct. 2018.
    Findings
  • D. Alistarh, T. Hoefler, M. Johansson, S. Khirirat, N. Konstantinov, and C. Renggli, “The convergence of sparsified gradient methods,” arXiv:1809.10505v1 [cs.LG], Sep. 2018.
    Findings
  • Y. Tsuzuku, H. Imachi, and T. Akiba, “Variance-based gradient compression for efficient distributed deep learning,” arXiv:1802.06058v2 [cs.LG], Feb. 2018.
    Findings
  • S. U. Stich, “Local SGD converges fast and communicates little,” arXiv:1805.09767v2 [math.OC], Jun. 2018.
    Findings
  • T. Lin, S. U. Stich, and M. Jaggi, “Don’t use large mini-batches, use local SGD,” arXiv:1808.07217v3 [cs.LG], Oct. 2018.
    Findings
  • T. Chen, G. B. Giannakis, T. Sun, and W. Yin, “LAG: Lazily aggregated gradient for communication-efficient distributed learning,” arXiv:1805.09965 [stat.ML], May 2018.
    Findings
  • M. M. Amiri and D. Gündüz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” arXiv:1901.00844 [cs.DC], Jan. 2019.
    Findings
  • G. Zhu, Y. Wang, and K. Huang, “Low-latency broadband analog aggregation for federated edge learning,” arXiv:1812.11494 [cs.IT], Jan. 2019.
    Findings
  • K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” arXiv:1812.11750 [cs.LG], Jan.
    Findings
  • D. Tse and P. Viswanath, Fundamentals of wireless communication. Cambridge, UK: Cambridge University Press, 2005.
    Google ScholarFindings
  • D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algorithms for compressed sensing,” Proc. Nat. Acad. Sci. USA, vol. 106, no. 45, pp. 18 914–18 919, Nov. 2009.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, C. Cortes, and C. Burges, “The MNIST database of handwritten digits,” http://yann.lecun.com/exdb/mnist/, 1998.
    Findings
  • D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980v9 [cs.LG], Jan. 2017.
    Findings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科