Model Fusion with Kullback--Leibler Divergence

    Sebastian Claici
    Sebastian Claici
    Soumya Ghosh
    Soumya Ghosh

    ICML 2020, 2020.

    Cited by: 0|Bibtex|Views32|Links
    Keywords:
    Machine Learningout of distributionvariational inferenceApproximate Merging of Posteriors with Symmetryassignment problemMore(14+)
    Wei bo:
    Model fusion extracts a global model in a single shot: Local models are combined into the global model by solving a single optimization problem, and the learning procedure is complete

    Abstract:

    We propose a method to fuse posterior distributions learned from heterogeneous datasets. Our algorithm relies on a mean field assumption for both the fused model and the individual dataset posteriors and proceeds using a simple assign-andaverage approach. The components of the dataset posteriors are assigned to the proposed global model c...More

    Code:

    Data:

    0
    Introduction
    • The authors study model fusion, the problem of learning a unified global model from a collection of pre-trained local models.
    • Any one hospital may be able to use its patient data to train a model aiding diagnosis or treatment, but due to limited data and skew the resulting model may not be effective
    • To overcome this issue, a group of hospitals could in principle collaborate to produce a stronger model by pooling their data, but it is typically not permissible to share individual patient information between (b) Fused Bayesian neural network institutions.
    • Some key aspects that distinguish federated learning from classical distributed learning are (1) constraints on the frequency of communication and (2) heterogeneity of the local datasets
    Highlights
    • In this paper, we study model fusion, the problem of learning a unified global model from a collection of pre-trained local models
    • We present a Bayesian approach to model fusion, in which local models trained on individual datasets are combined to learn a single global model
    • Model fusion extracts a global model in a single shot: Local models are combined into the global model by solving a single optimization problem, and the learning procedure is complete
    • Working with mean field approximations and exponential family distributions leads to a feasible algorithm while staying applicable to a wide array of practical scenarios, as illustrated by the examples in §5
    • Mixed norm regularization dynamically adjusts the dimensionality of our fused model
    • Model fusion in the Bayesian setting accompanies the fused model with uncertainty estimates, valuable for detecting out-of-distribution samples that are not captured by any of the individual local models as in Figure 1
    Results
    • The authors consider the problem of fusing Gaussian mixture models with arbitrary means and covariances.
    • The authors' goal is to estimate true data-generating mixture components by fusing local posterior approximations.
    • To simulate instances of the heterogeneous fusion problem, when generating local dataset, the authors sample a random subset of the global mixture components and add Gaussian noise to add heterogeneity in model size and parameters.
    Conclusion
    • Federated learning techniques vary in complexity and communication overhead. On one extreme, some approaches hand information back and forth between different entities as they reach a consensus on the global model.
    • Working with mean field approximations and exponential family distributions leads to a feasible algorithm while staying applicable to a wide array of practical scenarios, as illustrated by the examples in §5.
    • This setup allows the method to use information about the full local distributions, rather than point estimates as in previous work (Yurochkin et al, 2019a).
    • Model fusion in the Bayesian setting accompanies the fused model with uncertainty estimates, valuable for detecting out-of-distribution samples that are not captured by any of the individual local models as in Figure 1
    Summary
    • Introduction:

      The authors study model fusion, the problem of learning a unified global model from a collection of pre-trained local models.
    • Any one hospital may be able to use its patient data to train a model aiding diagnosis or treatment, but due to limited data and skew the resulting model may not be effective
    • To overcome this issue, a group of hospitals could in principle collaborate to produce a stronger model by pooling their data, but it is typically not permissible to share individual patient information between (b) Fused Bayesian neural network institutions.
    • Some key aspects that distinguish federated learning from classical distributed learning are (1) constraints on the frequency of communication and (2) heterogeneity of the local datasets
    • Results:

      The authors consider the problem of fusing Gaussian mixture models with arbitrary means and covariances.
    • The authors' goal is to estimate true data-generating mixture components by fusing local posterior approximations.
    • To simulate instances of the heterogeneous fusion problem, when generating local dataset, the authors sample a random subset of the global mixture components and add Gaussian noise to add heterogeneity in model size and parameters.
    • Conclusion:

      Federated learning techniques vary in complexity and communication overhead. On one extreme, some approaches hand information back and forth between different entities as they reach a consensus on the global model.
    • Working with mean field approximations and exponential family distributions leads to a feasible algorithm while staying applicable to a wide array of practical scenarios, as illustrated by the examples in §5.
    • This setup allows the method to use information about the full local distributions, rather than point estimates as in previous work (Yurochkin et al, 2019a).
    • Model fusion in the Bayesian setting accompanies the fused model with uncertainty estimates, valuable for detecting out-of-distribution samples that are not captured by any of the individual local models as in Figure 1
    Tables
    • Table1: MoCap labeling quality comparison
    • Table2: Comparison of local and fused BNNs
    Download tables as Excel
    Related work
    • This paper develops model fusion techniques for approximate Bayesian inference, combining parametric approximations to an intractable posterior distribution. While we primarily focus on mean-field variational inference (VI), owing to its popularity, our methods are equally applicable to Laplace approximations (Bishop, 2006), assumed density filtering (Opper, 1998), and expectation propagation (Minka, 2001) methods that learn a parametric approximation to the posterior. Variational inference seeks to approximate the true posterior distribution by a tractable approximate distribution by minimizing the KL divergence between the variational approximation and the true posterior. In contrast with Markov chain Monte Carlo methods, VI relies on optimization and is thus able to exploit advances in stochastic gradient methods allowing for VI based inference algorithms to scale to large data and models with a large number of parameters, such as Bayesian Neural Networks (BNNs) (Neal, 1995) considered in this work.

      Distributed posterior inference has been actively studied in the literature (Hasenclever et al, 2017; Broderick et al, 2013; Bui et al, 2018; Srivastava et al, 2015; Bardenet et al, 2017). As with distributed optimization, however, the goal is typically to achieve computational speedups, leading to approaches ill suited for model fusion due to high number of communication rounds required for convergence and assumption on the homogeneity of the datasets. Moreover, the inherent permutation invariance structure of many highutility models (e.g., topic models, mixture models, HMMs, and BNNs) is ignored by prior distributed Bayesian learning methods as it is of minor importance when many communication rounds are permissible. On the contrary, our model fusion formulation requires careful consideration of the permutation structure as we show in the subsequent section. Aggregation of Bayesian posteriors respecting permutation structure was considered in Campbell & How (2014), but their method is limited to homogeneous data and requires combinatorial optimization except few special cases. Subsequent work relaxes the homogeneity constraint and propose a greedy streaming approach for Dirichlet process mixture models (Campbell et al, 2015).
    Funding
    • Justin Solomon and the MIT Geometric Data Processing group acknowledge the generous support of Army Research Office grants W911NF1710068 and W911NF2010168, of Air Force Office of Scientific Research award FA9550-19-1-031, of National Science Foundation grant IIS-1838071, from the MIT–IBM Watson AI Laboratory, from the Toyota–CSAIL Joint Research Center, from a gift from Adobe Systems, and from the Skoltech– MIT Next Generation Program
    Reference
    • Attias, H. A variational Bayesian framework for graphical models. In Advances in Neural Information Processing Systems, pp. 209–215, 2000.
      Google ScholarLocate open access versionFindings
    • Banerjee, A., Dhillon, I. S., Ghosh, J., and Sra, S. Clustering on the unit hypersphere using von Mises-Fisher distributions. Journal of Machine Learning Research, 6: 1345–1382, September 2005.
      Google ScholarLocate open access versionFindings
    • Bardenet, R., Doucet, A., and Holmes, C. On Markov chain Monte Carlo methods for tall data. Journal of Machine Learning Research, 18(1):1515–1557, 2017.
      Google ScholarLocate open access versionFindings
    • Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006.
      Google ScholarFindings
    • Blei, D. M. and Jordan, M. I. Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1):121– 143, 2006.
      Google ScholarLocate open access versionFindings
    • Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3: 993–1022, March 2003.
      Google ScholarLocate open access versionFindings
    • Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I. Streaming variational Bayes. In Advances in Neural Information Processing Systems, pp. 1727–1735, 2013.
      Google ScholarLocate open access versionFindings
    • Bui, T. D., Nguyen, C. V., Swaroop, S., and Turner, R. E. Partitioned variational inference: A unified framework encompassing federated and continual learning. arXiv preprint arXiv:1811.11206, 2018.
      Findings
    • Campbell, T. and How, J. P. Approximate decentralized Bayesian inference. arXiv:1403.7471, 2014.
      Findings
    • Campbell, T., Straub, J., Fisher III, J. W., and How, J. P. Streaming, distributed variational inference for Bayesian nonparametrics. In Advances in Neural Information Processing Systems, pp. 280–288, 2015.
      Google ScholarLocate open access versionFindings
    • Carli, F. P., Ning, L., and Georgiou, T. T. Convex clustering via optimal mass transport. arXiv:1307.5459, 2013.
      Findings
    • Ferguson, T. S. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, pp. 209–230, 1973.
      Google ScholarLocate open access versionFindings
    • Fox, E. B., Sudderth, E. B., Jordan, M. I., and Willsky, A. S. An HDP-HMM for systems with state persistence. In International Conference on Machine Learning, pp. 312–319. ACM, 2008.
      Google ScholarLocate open access versionFindings
    • Fox, E. B., Hughes, M. C., Sudderth, E. B., and Jordan, M. I. Joint modeling of multiple time series via the beta process with application to motion capture segmentation. The Annals of Applied Statistics, pp. 1281–1313, 2014.
      Google ScholarLocate open access versionFindings
    • Ghosh, S., Yao, J., and Doshi-Velez, F. Structured variational learning of Bayesian neural networks with horseshoe priors. In International Conference on Machine Learning, pp. 1744–1753, 2018.
      Google ScholarLocate open access versionFindings
    • Ghosh, S., Yao, J., and Doshi-Velez, F. Model selection in Bayesian neural networks via horseshoe priors. Journal of Machine Learning Research, 20(182):1–46, 2019.
      Google ScholarLocate open access versionFindings
    • Hasenclever, L., Webb, S., Lienart, T., Vollmer, S., Lakshminarayanan, B., Blundell, C., and Teh, Y. W. Distributed bayesian learning with stochastic natural gradient expectation propagation and the posterior server. The Journal of Machine Learning Research, 18(1):3744–3780, 2017.
      Google ScholarLocate open access versionFindings
    • Kuhn, H. W. The Hungarian method for the assignment problem. Naval Research Logistics (NRL), 2(1-2):83–97, 1955.
      Google ScholarLocate open access versionFindings
    • Minka, T. P. Expectation propagation for approximate Bayesian inference. In Conference on Uncertainty in Artificial Intelligence, pp. 362–369, 2001.
      Google ScholarLocate open access versionFindings
    • Monteiller, P., Claici, S., Chien, E., Mirzazadeh, F., Solomon, J. M., and Yurochkin, M. Alleviating label switching with optimal transport. In Advances in Neural Information Processing Systems, pp. 13612–13622, 2019.
      Google ScholarLocate open access versionFindings
    • Neal, R. M. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, 1995.
      Google ScholarFindings
    • Nguyen, X. Convergence of latent mixing measures in finite and infinite mixture models. The Annals of Statistics, 41 (1):370–400, 2013.
      Google ScholarLocate open access versionFindings
    • Nguyen, X. Posterior contraction of the population polytope in finite admixture models. Bernoulli, 21(1):618–646, 02 2015.
      Google ScholarLocate open access versionFindings
    • Opper, M. A Bayesian approach to on-line learning. On-line Learning in Neural Networks, pp. 363–378, 1998.
      Google ScholarLocate open access versionFindings
    • Rand, W. M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850, 1971.
      Google ScholarLocate open access versionFindings
    • Smirnov, D., Fisher, M., Kim, V. G., Zhang, R., and Solomon, J. Deep parametric shape predictions using distance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 561–570, 2020.
      Google ScholarLocate open access versionFindings
    • Srivastava, S., Cevher, V., Dinh, Q., and Dunson, D. Wasp: Scalable bayes via barycenters of subset posteriors. In Artificial Intelligence and Statistics, pp. 912–920, 2015.
      Google ScholarLocate open access versionFindings
    • Vinh, N. X., Epps, J., and Bailey, J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(Oct):2837–2854, 2010.
      Google ScholarLocate open access versionFindings
    • Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., and Hoang, N. Statistical model aggregation via parameter matching. In Advances in Neural Information Processing Systems, pp. 10954–10964, 2019a.
      Google ScholarLocate open access versionFindings
    • Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, N., and Khazaeni, Y. Bayesian nonparametric federated learning of neural networks. In International Conference on Machine Learning, pp. 7252–7261, 2019b.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments