Meta-Learning PAC-Bayes Priors in Model Averaging

national conference on artificial intelligence, 2020.

Cited by: 1|Bibtex|Views72|Links
Keywords:
multiple modeldistribution dependent priorfrequentist modelmodel selectionMean Squared Prediction ErrorMore(13+)
Weibo:
Moral-Benito already pointed out the concern, “From a pure empirical viewpoint, model uncertainty represents a concern because estimates may well depend on the particular model considered.” combining multiple models to reduce the model uncertainty is very desirable

Abstract:

Nowadays model uncertainty has become one of the most important problems in both academia and industry. In this paper, we mainly consider the scenario in which we have a common model set used for model averaging instead of selecting a single final model via a model selection procedure to account for this model's uncertainty to improve r...More

Code:

Data:

0
Introduction
  • It is very common in practice that the distributions generating the observed data are described more adequately by multiple models.
  • The authors propose a specific risk bound under the settings and two data-based methods for adjusting the priors in the PAC-Bayes framework.
  • In Section 2, an upper bound for the averaging model and a practical historical data related algorithm are established for obtaining a better prior.
Highlights
  • It is very common in practice that the distributions generating the observed data are described more adequately by multiple models
  • The selection of one particular model may lead to riskier decisions since it ignores the model uncertainty
  • Moral-Benito (2015) already pointed out the concern, “From a pure empirical viewpoint, model uncertainty represents a concern because estimates may well depend on the particular model considered.” combining multiple models to reduce the model uncertainty is very desirable
  • We propose a specific risk bound under our settings and two data-based methods for adjusting the priors in the PAC-Bayes framework
  • In case that there is no historical data, Section 3 proposes another method called a sequential batch sampling algorithm to adjust the prior step by step
  • Using Ti to obtain the candidate priors ξi by any Bayesian procedure, for example, minimizing the PAC bound introduced in Lemma 1 with non-informative prior
Results
  • In case that there is no historical data, Section 3 proposes another method called a sequential batch sampling algorithm to adjust the prior step by step.
  • The goal of the learning task is to find an optimal distribution ξ, the posterior of h, which minimizes the expected risk R(ξ, D) := Eh∼ξE(x,y)∼DL(h, x, y), and the prediction is made by y = Eh∼ξh(x) =
  • Since the choice of ξ balances the tradeoff between the empirical risk R(ξ, S) and the regularization term, if the prior ξ0 is far away from the true optimal model distribution ξ∗, the posterior ξ will be bad.
  • For each sample task Ti, a sample set Si with ni samples is generated from an unknown distribution Di. Without ambiguity, the authors use the notation ξ(ξ0, S) to denote the posterior under the prior ξ0 after observing the sample set S.
  • Using Ti to obtain the candidate priors ξi by any Bayesian procedure, for example, minimizing the PAC bound introduced in Lemma 1 with non-informative prior.
  • If Algorithm 1 is used to obtain the prior from multi-tasks, the authors could get the following theorem theoretically by combining Lemmas 1 and 2.
  • The authors will discuss how to adjust the prior of models if there is no information from extra similar tasks.
  • The learner can sample the data according to the prior distribution in the current round.
  • Compared with dealing with the whole data at once, this procedure of adjusting prior leads to a smaller upper bound.
  • 2: Get the posterior ξ1 based on the sample set B1 by minimizing the risk bound with non-informative prior.
Conclusion
  • 5: Get the posterior ξi based on the sample set Bi by minimizing the risk bound with the prior ξi−1.
  • The new task data {(xi, yi)}2i=0 1 is generated from the linear model yi = 1 + xTi β + σεi, where εi ∼ N (0, 1), Figure 3: Comparison among RBM, SOIL and HDR of Example 3.
  • The different priors lead to a similar result since the current data has a key influence.
Summary
  • It is very common in practice that the distributions generating the observed data are described more adequately by multiple models.
  • The authors propose a specific risk bound under the settings and two data-based methods for adjusting the priors in the PAC-Bayes framework.
  • In Section 2, an upper bound for the averaging model and a practical historical data related algorithm are established for obtaining a better prior.
  • In case that there is no historical data, Section 3 proposes another method called a sequential batch sampling algorithm to adjust the prior step by step.
  • The goal of the learning task is to find an optimal distribution ξ, the posterior of h, which minimizes the expected risk R(ξ, D) := Eh∼ξE(x,y)∼DL(h, x, y), and the prediction is made by y = Eh∼ξh(x) =
  • Since the choice of ξ balances the tradeoff between the empirical risk R(ξ, S) and the regularization term, if the prior ξ0 is far away from the true optimal model distribution ξ∗, the posterior ξ will be bad.
  • For each sample task Ti, a sample set Si with ni samples is generated from an unknown distribution Di. Without ambiguity, the authors use the notation ξ(ξ0, S) to denote the posterior under the prior ξ0 after observing the sample set S.
  • Using Ti to obtain the candidate priors ξi by any Bayesian procedure, for example, minimizing the PAC bound introduced in Lemma 1 with non-informative prior.
  • If Algorithm 1 is used to obtain the prior from multi-tasks, the authors could get the following theorem theoretically by combining Lemmas 1 and 2.
  • The authors will discuss how to adjust the prior of models if there is no information from extra similar tasks.
  • The learner can sample the data according to the prior distribution in the current round.
  • Compared with dealing with the whole data at once, this procedure of adjusting prior leads to a smaller upper bound.
  • 2: Get the posterior ξ1 based on the sample set B1 by minimizing the risk bound with non-informative prior.
  • 5: Get the posterior ξi based on the sample set Bi by minimizing the risk bound with the prior ξi−1.
  • The new task data {(xi, yi)}2i=0 1 is generated from the linear model yi = 1 + xTi β + σεi, where εi ∼ N (0, 1), Figure 3: Comparison among RBM, SOIL and HDR of Example 3.
  • The different priors lead to a similar result since the current data has a key influence.
Tables
  • Table1: Simulation settings of Example 1
  • Table2: Comparison among RBM, SOIL and SBS for Model 1 of Example 1
  • Table3: Comparison among RBM, SOIL and SBS for Model 2 of Example 1
  • Table4: Comparison among RBM, SOIL and SBS for Model 3 of Example 1
  • Table5: Comparison among RBM, SOIL and SBS of Example 2
  • Table6: Comparison among RBM, SOIL and HDR in real data
  • Table7: Comparisons of different learning methods on 20 test tasks of classification
Download tables as Excel
Reference
  • [2018] Alquier, P., and Guedj, B. 2018. Simpler PACBayesian bounds for hostile data. Machine Learning 107(5):887–902.
    Google ScholarLocate open access versionFindings
  • [2016] Alquier, P.; Ridgway, J.; and Chopin, N. 2016. On the properties of variational approximations of Gibbs posteriors. The Journal of Machine Learning Research 17(1):8374– 8414.
    Google ScholarLocate open access versionFindings
  • [2018] Amit, R., and Meir, R. 2018. Meta-learning by adjusting priors based on extended PAC-Bayes theory. international conference on machine learning 205–214.
    Google ScholarFindings
  • [2004] Catoni, O. 200Statistical learning theory and stochastic optimization: Ecole d’Etede Probabilites de Saint-Flour XXXI-2001. Springer.
    Google ScholarFindings
  • [2007] Catoni, O. 2007. PAC-Bayesian supervised classification: the thermodynamics of statistical learning. IMS.
    Google ScholarLocate open access versionFindings
  • [2016] Catoni, O. 201PAC-Bayesian bounds for the gram matrix and least squares regression with a random design. arXiv preprint arXiv:1603.05229.
    Findings
  • [2018] Dziugaite, G. K., and Roy, D. M. 2018. Datadependent PAC-Bayes priors via differential privacy. In Advances in Neural Information Processing Systems, 8430– 8441.
    Google ScholarLocate open access versionFindings
  • [2017] Finn, C.; Abbeel, P.; and Levine, S. 2017. Modelagnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 1126–1135. JMLR. org.
    Google ScholarLocate open access versionFindings
  • [2016] Grunwald, P. D., and Mehta, N. A. 2016. Fast rates with unbounded losses. arXiv preprint arXiv:1605.00252 2:12.
    Findings
  • [2013] Guedj, B.; Alquier, P.; et al. 2013. PAC-Bayesian estimation and prediction in sparse additive models. Electronic Journal of Statistics 7:264–291.
    Google ScholarLocate open access versionFindings
  • [2007] Hansen, B. E. 2007. Least squares model averaging. Econometrica 75(4):1175–1189.
    Google ScholarLocate open access versionFindings
  • [2003] Hjort, N. L., and Claeskens, G. 2003. Frequentist model average estimators. Journal of the American Statistical Association 98(464):879–899.
    Google ScholarLocate open access versionFindings
  • [2008] Huang, J.; Ma, S.; and Zhang, C. H. 2008. Adaptive lasso for sparse high-dimensional regression. Stat Sin 18(4):1603–1618.
    Google ScholarLocate open access versionFindings
  • [1978] Leamer, E. E. 1978. Specification searches. New York: Wiley.
    Google ScholarFindings
  • [1998] LeCun, Y. 1998. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/.
    Findings
  • [2006] Leung, G., and Barron, A. R. 2006. Information theory and mixing least-squares regressions. IEEE Transactions on Information Theory 52(8):3396–3410.
    Google ScholarLocate open access versionFindings
  • [2013] Lever, G.; Laviolette, F.; and Shawe-Taylor, J. 2013. Tighter PAC-Bayes bounds through distribution-dependent priors. Theoretical Computer Science 473(2):4–28.
    Google ScholarLocate open access versionFindings
  • [2017] Li, Z.; Zhou, F.; Chen, F.; and Li, H. 2017. Metasgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835.
    Findings
  • [2019] Lugosi, G.; Mendelson, S.; et al. 2019.
    Google ScholarFindings
  • [1999] Mcallester, D. A. 1999. PAC-Bayesian model averaging. In In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, 164–170.
    Google ScholarLocate open access versionFindings
  • [2015] Moral-Benito, E. 2015. Model averaging in economics: An overview. Journal of Economic Surveys 29(1):46–75.
    Google ScholarLocate open access versionFindings
  • [2016] Oneto, L.; Anguita, D.; and Ridella, S. 2016. PACbayesian analysis of distribution dependent priors: Tighter risk bounds and stability analysis. Pattern Recognition Letters 80:200–207.
    Google ScholarLocate open access versionFindings
  • [1995] Raftery, A. E. 1995. Bayesian model selection in social research. Sociological Methodology 25(25):111–163.
    Google ScholarLocate open access versionFindings
  • [1996] Raftery, A. E. 1996. Approximate bayes factors and accounting for model uncertainty in generalised linear models. Biometrika 83(2):251–266.
    Google ScholarLocate open access versionFindings
  • [2018] Rivasplata, O.; Szepesvari, C.; Shawe-Taylor, J. S.; Parrado-Hernandez, E.; and Sun, S. 2018. PAC-Bayes bounds for stable algorithms with instance-dependent priors. In Advances in Neural Information Processing Systems, 9214–9224.
    Google ScholarLocate open access versionFindings
  • [2002] Seeger, M. 2002. PAC-Bayesian generalisation error bounds for gaussian process classification. Journal of Machine Learning Research 3(2):233–269.
    Google ScholarLocate open access versionFindings
  • [2010] Seldin, Y., and Tishby, N. 2010. PAC-Bayesian analysis of co-clustering and beyond. Journal of Machine Learning Research 11(Dec):3595–3646.
    Google ScholarLocate open access versionFindings
  • [1954] Tuddenham, R. D., and Snyder, M. M. 1954. Physical growth of california boys and girls from birth to eighteen years. Publications in Child Development 1:183–364.
    Google ScholarLocate open access versionFindings
  • [2009] Wang, H.; Zhang, X.; and Zou, G. 2009. Frequentist model averaging estimation: a review. Journal of Systems Science and Complexity 22(4):732.
    Google ScholarLocate open access versionFindings
  • [2000] Yang, Y. 2000. Combining different procedures for adaptive regression. Journal of multivariate analysis 74(1):135–161.
    Google ScholarLocate open access versionFindings
  • [2001] Yang, Y. 2001. Adaptive regression by mixing. Journal of the American Statistical Association 96(454):574– 588.
    Google ScholarLocate open access versionFindings
  • [2016] Ye, C.; Yang, Y.; and Yang, Y. 2016. Sparsity oriented importance learning for high-dimensional linear regression. Journal of the American Statistical Association (2):1–16.
    Google ScholarLocate open access versionFindings
  • [2018] Zhou, Q.; Ernst, P. A.; Morgan, K. L.; Rubin, D. B.; and Zhang, A. 2018. Sequential rerandomization. Biometrika 105(3):745–752.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments