# Meta-Learning PAC-Bayes Priors in Model Averaging

national conference on artificial intelligence, 2020.

Keywords:

multiple modeldistribution dependent priorfrequentist modelmodel selectionMean Squared Prediction ErrorMore(13+)

Weibo:

Abstract:

Nowadays model uncertainty has become one of the most important problems in both academia and industry. In this paper, we mainly consider the scenario in which we have a common model set used for model averaging instead of selecting a single final model via a model selection procedure to account for this model's uncertainty to improve r...More

Code:

Data:

Introduction

- It is very common in practice that the distributions generating the observed data are described more adequately by multiple models.
- The authors propose a specific risk bound under the settings and two data-based methods for adjusting the priors in the PAC-Bayes framework.
- In Section 2, an upper bound for the averaging model and a practical historical data related algorithm are established for obtaining a better prior.

Highlights

- It is very common in practice that the distributions generating the observed data are described more adequately by multiple models
- The selection of one particular model may lead to riskier decisions since it ignores the model uncertainty
- Moral-Benito (2015) already pointed out the concern, “From a pure empirical viewpoint, model uncertainty represents a concern because estimates may well depend on the particular model considered.” combining multiple models to reduce the model uncertainty is very desirable
- We propose a specific risk bound under our settings and two data-based methods for adjusting the priors in the PAC-Bayes framework
- In case that there is no historical data, Section 3 proposes another method called a sequential batch sampling algorithm to adjust the prior step by step
- Using Ti to obtain the candidate priors ξi by any Bayesian procedure, for example, minimizing the PAC bound introduced in Lemma 1 with non-informative prior

Results

- In case that there is no historical data, Section 3 proposes another method called a sequential batch sampling algorithm to adjust the prior step by step.
- The goal of the learning task is to find an optimal distribution ξ, the posterior of h, which minimizes the expected risk R(ξ, D) := Eh∼ξE(x,y)∼DL(h, x, y), and the prediction is made by y = Eh∼ξh(x) =
- Since the choice of ξ balances the tradeoff between the empirical risk R(ξ, S) and the regularization term, if the prior ξ0 is far away from the true optimal model distribution ξ∗, the posterior ξ will be bad.
- For each sample task Ti, a sample set Si with ni samples is generated from an unknown distribution Di. Without ambiguity, the authors use the notation ξ(ξ0, S) to denote the posterior under the prior ξ0 after observing the sample set S.
- Using Ti to obtain the candidate priors ξi by any Bayesian procedure, for example, minimizing the PAC bound introduced in Lemma 1 with non-informative prior.
- If Algorithm 1 is used to obtain the prior from multi-tasks, the authors could get the following theorem theoretically by combining Lemmas 1 and 2.
- The authors will discuss how to adjust the prior of models if there is no information from extra similar tasks.
- The learner can sample the data according to the prior distribution in the current round.
- Compared with dealing with the whole data at once, this procedure of adjusting prior leads to a smaller upper bound.
- 2: Get the posterior ξ1 based on the sample set B1 by minimizing the risk bound with non-informative prior.

Conclusion

- 5: Get the posterior ξi based on the sample set Bi by minimizing the risk bound with the prior ξi−1.
- The new task data {(xi, yi)}2i=0 1 is generated from the linear model yi = 1 + xTi β + σεi, where εi ∼ N (0, 1), Figure 3: Comparison among RBM, SOIL and HDR of Example 3.
- The different priors lead to a similar result since the current data has a key influence.

Summary

- It is very common in practice that the distributions generating the observed data are described more adequately by multiple models.
- The authors propose a specific risk bound under the settings and two data-based methods for adjusting the priors in the PAC-Bayes framework.
- In Section 2, an upper bound for the averaging model and a practical historical data related algorithm are established for obtaining a better prior.
- In case that there is no historical data, Section 3 proposes another method called a sequential batch sampling algorithm to adjust the prior step by step.
- The goal of the learning task is to find an optimal distribution ξ, the posterior of h, which minimizes the expected risk R(ξ, D) := Eh∼ξE(x,y)∼DL(h, x, y), and the prediction is made by y = Eh∼ξh(x) =
- Since the choice of ξ balances the tradeoff between the empirical risk R(ξ, S) and the regularization term, if the prior ξ0 is far away from the true optimal model distribution ξ∗, the posterior ξ will be bad.
- For each sample task Ti, a sample set Si with ni samples is generated from an unknown distribution Di. Without ambiguity, the authors use the notation ξ(ξ0, S) to denote the posterior under the prior ξ0 after observing the sample set S.
- Using Ti to obtain the candidate priors ξi by any Bayesian procedure, for example, minimizing the PAC bound introduced in Lemma 1 with non-informative prior.
- If Algorithm 1 is used to obtain the prior from multi-tasks, the authors could get the following theorem theoretically by combining Lemmas 1 and 2.
- The authors will discuss how to adjust the prior of models if there is no information from extra similar tasks.
- The learner can sample the data according to the prior distribution in the current round.
- Compared with dealing with the whole data at once, this procedure of adjusting prior leads to a smaller upper bound.
- 2: Get the posterior ξ1 based on the sample set B1 by minimizing the risk bound with non-informative prior.
- 5: Get the posterior ξi based on the sample set Bi by minimizing the risk bound with the prior ξi−1.
- The new task data {(xi, yi)}2i=0 1 is generated from the linear model yi = 1 + xTi β + σεi, where εi ∼ N (0, 1), Figure 3: Comparison among RBM, SOIL and HDR of Example 3.
- The different priors lead to a similar result since the current data has a key influence.

- Table1: Simulation settings of Example 1
- Table2: Comparison among RBM, SOIL and SBS for Model 1 of Example 1
- Table3: Comparison among RBM, SOIL and SBS for Model 2 of Example 1
- Table4: Comparison among RBM, SOIL and SBS for Model 3 of Example 1
- Table5: Comparison among RBM, SOIL and SBS of Example 2
- Table6: Comparison among RBM, SOIL and HDR in real data
- Table7: Comparisons of different learning methods on 20 test tasks of classification

Reference

- [2018] Alquier, P., and Guedj, B. 2018. Simpler PACBayesian bounds for hostile data. Machine Learning 107(5):887–902.
- [2016] Alquier, P.; Ridgway, J.; and Chopin, N. 2016. On the properties of variational approximations of Gibbs posteriors. The Journal of Machine Learning Research 17(1):8374– 8414.
- [2018] Amit, R., and Meir, R. 2018. Meta-learning by adjusting priors based on extended PAC-Bayes theory. international conference on machine learning 205–214.
- [2004] Catoni, O. 200Statistical learning theory and stochastic optimization: Ecole d’Etede Probabilites de Saint-Flour XXXI-2001. Springer.
- [2007] Catoni, O. 2007. PAC-Bayesian supervised classification: the thermodynamics of statistical learning. IMS.
- [2016] Catoni, O. 201PAC-Bayesian bounds for the gram matrix and least squares regression with a random design. arXiv preprint arXiv:1603.05229.
- [2018] Dziugaite, G. K., and Roy, D. M. 2018. Datadependent PAC-Bayes priors via differential privacy. In Advances in Neural Information Processing Systems, 8430– 8441.
- [2017] Finn, C.; Abbeel, P.; and Levine, S. 2017. Modelagnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 1126–1135. JMLR. org.
- [2016] Grunwald, P. D., and Mehta, N. A. 2016. Fast rates with unbounded losses. arXiv preprint arXiv:1605.00252 2:12.
- [2013] Guedj, B.; Alquier, P.; et al. 2013. PAC-Bayesian estimation and prediction in sparse additive models. Electronic Journal of Statistics 7:264–291.
- [2007] Hansen, B. E. 2007. Least squares model averaging. Econometrica 75(4):1175–1189.
- [2003] Hjort, N. L., and Claeskens, G. 2003. Frequentist model average estimators. Journal of the American Statistical Association 98(464):879–899.
- [2008] Huang, J.; Ma, S.; and Zhang, C. H. 2008. Adaptive lasso for sparse high-dimensional regression. Stat Sin 18(4):1603–1618.
- [1978] Leamer, E. E. 1978. Specification searches. New York: Wiley.
- [1998] LeCun, Y. 1998. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/.
- [2006] Leung, G., and Barron, A. R. 2006. Information theory and mixing least-squares regressions. IEEE Transactions on Information Theory 52(8):3396–3410.
- [2013] Lever, G.; Laviolette, F.; and Shawe-Taylor, J. 2013. Tighter PAC-Bayes bounds through distribution-dependent priors. Theoretical Computer Science 473(2):4–28.
- [2017] Li, Z.; Zhou, F.; Chen, F.; and Li, H. 2017. Metasgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835.
- [2019] Lugosi, G.; Mendelson, S.; et al. 2019.
- [1999] Mcallester, D. A. 1999. PAC-Bayesian model averaging. In In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, 164–170.
- [2015] Moral-Benito, E. 2015. Model averaging in economics: An overview. Journal of Economic Surveys 29(1):46–75.
- [2016] Oneto, L.; Anguita, D.; and Ridella, S. 2016. PACbayesian analysis of distribution dependent priors: Tighter risk bounds and stability analysis. Pattern Recognition Letters 80:200–207.
- [1995] Raftery, A. E. 1995. Bayesian model selection in social research. Sociological Methodology 25(25):111–163.
- [1996] Raftery, A. E. 1996. Approximate bayes factors and accounting for model uncertainty in generalised linear models. Biometrika 83(2):251–266.
- [2018] Rivasplata, O.; Szepesvari, C.; Shawe-Taylor, J. S.; Parrado-Hernandez, E.; and Sun, S. 2018. PAC-Bayes bounds for stable algorithms with instance-dependent priors. In Advances in Neural Information Processing Systems, 9214–9224.
- [2002] Seeger, M. 2002. PAC-Bayesian generalisation error bounds for gaussian process classification. Journal of Machine Learning Research 3(2):233–269.
- [2010] Seldin, Y., and Tishby, N. 2010. PAC-Bayesian analysis of co-clustering and beyond. Journal of Machine Learning Research 11(Dec):3595–3646.
- [1954] Tuddenham, R. D., and Snyder, M. M. 1954. Physical growth of california boys and girls from birth to eighteen years. Publications in Child Development 1:183–364.
- [2009] Wang, H.; Zhang, X.; and Zou, G. 2009. Frequentist model averaging estimation: a review. Journal of Systems Science and Complexity 22(4):732.
- [2000] Yang, Y. 2000. Combining different procedures for adaptive regression. Journal of multivariate analysis 74(1):135–161.
- [2001] Yang, Y. 2001. Adaptive regression by mixing. Journal of the American Statistical Association 96(454):574– 588.
- [2016] Ye, C.; Yang, Y.; and Yang, Y. 2016. Sparsity oriented importance learning for high-dimensional linear regression. Journal of the American Statistical Association (2):1–16.
- [2018] Zhou, Q.; Ernst, P. A.; Morgan, K. L.; Rubin, D. B.; and Zhang, A. 2018. Sequential rerandomization. Biometrika 105(3):745–752.

Tags

Comments