Scalable Bayesian Optimization Using Deep Neural Networks

International Conference on Machine Learning, 2015.

Cited by: 495|Bibtex|Views192|Links
EI
Keywords:
Gaussian processesmaximum a posterioriOntario Graduate Scholarshipacquisition functionprobabilistic modelMore(12+)
Weibo:
We introduced deep networks for global optimization, or Deep Networks for Global Optimization, which enables efficient optimization of noisy, expensive black-box functions

Abstract:

Bayesian optimization is an effective methodology for the global optimization of functions with expensive evaluations. It relies on querying a distribution over functions defined by a relatively cheap surrogate model. An accurate model for this distribution over functions is critical to the effectiveness of the approach, and is typically ...More

Code:

Data:

0
Introduction
  • The field of machine learning has seen unprecedented growth due to a new wealth of data, increases in computational power, new algorithms, and a plethora of exciting new applications.
  • The growing complexity of machine learning models inevitably comes with the introduction of additional hyperparameters
  • These range from design decisions such as the shape of a neural network architecture, to optimization parameters such as learning rates, to regularization hyperparameters such as weight decay.
  • Proper setting of these hyperparameters is critical for performance on difficult problems.
  • Bayesian optimization proceeds by performing a proxy optimization over this acquisition function in order to determine the input to evaluate
Highlights
  • Recently, the field of machine learning has seen unprecedented growth due to a new wealth of data, increases in computational power, new algorithms, and a plethora of exciting new applications
  • It may seem that we are merely moving the problem of setting the hyperparameters of the model being tuned to setting them for the tuner itself, we show that for a suitable set of design choices it is possible to create a robust, scalable, and effective Bayesian optimization system that generalizes across many global optimization problems
  • We demonstrate the effectiveness of Deep Networks for Global Optimization on a number of difficult problems, including benchmark problems for Bayesian optimization, convolutional neural networks for object recognition, and multi-modal neural language models for image caption generation
  • Bayesian optimization relies on the construction of a probabilistic model that defines a distribution over objective functions from the input space to the objective of interest
  • We introduced deep networks for global optimization, or Deep Networks for Global Optimization, which enables efficient optimization of noisy, expensive black-box functions
  • While this model maintains desirable properties of the Gaussian processes such as tractability and principled management of uncertainty, it greatly improves its scalability from cubic to linear as a function of the number of observations
Methods
  • Human Expert LBL Regularized LSTM Soft-Attention LSTM 10 LSTM ensemble Hard-Attention LSTM

    Single LBL 2 LBL ensemble 3 LBL ensemble Test BLEU

    To demonstrate the effectiveness of the approach, the authors compare DNGO to these scalable model-based optimization variants, as well as the input-warped Gaussian process method of Snoek et al (2014) on the benchmark set of continuous problems from the HPOLib package (Eggensperger et al, 2013).
  • As Table 1 shows, DNGO significantly outperforms SMAC and TPE, and is competitive with the Gaussian process approach
  • This shows that, despite vast improvements in scalability, DNGO retains the statistical efficiency of the Gaussian process method in terms of the number of evaluations required to find the minimum.
  • In this experiment, the authors explore the effectiveness of DNGO on a practical and expensive problem where highly parallel evaluation is necessary to make progress in a reasonable amount of time.
  • We optimize the hyperparameters of the logbilinear model (LBL) from Kiros et al (2014) to maximize (a) “A person riding a wave in the ocean.”
Results
  • The authors find hyperparameter settings that achieve competitive with state-of-the-art results of 6.37% and 27.4% on CIFAR-10 and CIFAR-100 respectively, and BLEU scores of 25.1 and 26.7 on the Microsoft COCO 2014 dataset using a single model and a 3model ensemble.
Conclusion
  • The authors introduced deep networks for global optimization, or DNGO, which enables efficient optimization of noisy, expensive black-box functions.
  • While this model maintains desirable properties of the GP such as tractability and principled management of uncertainty, it greatly improves its scalability from cubic to linear as a function of the number of observations.
  • One promising line of work, for example by Nickson et al (2014), is to introduce a similar methodology by instead employing the sparse Gaussian process as the underlying probabilistic model (Snelson & Ghahramani, 2005; Titsias, 2009; Hensman et al, 2013)
Summary
  • Introduction:

    The field of machine learning has seen unprecedented growth due to a new wealth of data, increases in computational power, new algorithms, and a plethora of exciting new applications.
  • The growing complexity of machine learning models inevitably comes with the introduction of additional hyperparameters
  • These range from design decisions such as the shape of a neural network architecture, to optimization parameters such as learning rates, to regularization hyperparameters such as weight decay.
  • Proper setting of these hyperparameters is critical for performance on difficult problems.
  • Bayesian optimization proceeds by performing a proxy optimization over this acquisition function in order to determine the input to evaluate
  • Objectives:

    The goal of this work is to develop a method for scaling Bayesian optimization, while still maintaining its desirable flexibility and characterization of uncertainty.
  • Methods:

    Human Expert LBL Regularized LSTM Soft-Attention LSTM 10 LSTM ensemble Hard-Attention LSTM

    Single LBL 2 LBL ensemble 3 LBL ensemble Test BLEU

    To demonstrate the effectiveness of the approach, the authors compare DNGO to these scalable model-based optimization variants, as well as the input-warped Gaussian process method of Snoek et al (2014) on the benchmark set of continuous problems from the HPOLib package (Eggensperger et al, 2013).
  • As Table 1 shows, DNGO significantly outperforms SMAC and TPE, and is competitive with the Gaussian process approach
  • This shows that, despite vast improvements in scalability, DNGO retains the statistical efficiency of the Gaussian process method in terms of the number of evaluations required to find the minimum.
  • In this experiment, the authors explore the effectiveness of DNGO on a practical and expensive problem where highly parallel evaluation is necessary to make progress in a reasonable amount of time.
  • We optimize the hyperparameters of the logbilinear model (LBL) from Kiros et al (2014) to maximize (a) “A person riding a wave in the ocean.”
  • Results:

    The authors find hyperparameter settings that achieve competitive with state-of-the-art results of 6.37% and 27.4% on CIFAR-10 and CIFAR-100 respectively, and BLEU scores of 25.1 and 26.7 on the Microsoft COCO 2014 dataset using a single model and a 3model ensemble.
  • Conclusion:

    The authors introduced deep networks for global optimization, or DNGO, which enables efficient optimization of noisy, expensive black-box functions.
  • While this model maintains desirable properties of the GP such as tractability and principled management of uncertainty, it greatly improves its scalability from cubic to linear as a function of the number of observations.
  • One promising line of work, for example by Nickson et al (2014), is to introduce a similar methodology by instead employing the sparse Gaussian process as the underlying probabilistic model (Snelson & Ghahramani, 2005; Titsias, 2009; Hensman et al, 2013)
Tables
  • Table1: Evaluation of DNGO on global optimization benchmark problems versus scalable (TPE, SMAC) and non-scalable (Spearmint) Bayesian optimization methods. All problems are minimization problems. For each problem, each method was run 10 times to produce error bars
  • Table2: Image caption generation results using BLEU-4 on the Microsoft COCO 2014 test set. Regularized and ensembled LSTM results are reported in <a class="ref-link" id="cZaremba_et+al_2015_a" href="#rZaremba_et+al_2015_a">Zaremba et al (2015</a>). The baseline LBL tuned by a human expert and the Soft and Hard Attention models are reported in <a class="ref-link" id="cXu_et+al_2015_a" href="#rXu_et+al_2015_a">Xu et al (2015</a>). We see that ensembling our top models resulting from the optimization further improves results significantly. We noticed that there were distinct multiple local optima in the hyperparameter space, which may explain the dramatic improvement from ensembling a small subset of models
  • Table3: Our convolutional neural network architecture. This choice was chosen to be maximally generic. Each convolution layer is followed by a ReLU nonlinearity
  • Table4: We use our algorithm to optimize validation set error as a function of various hyperparameters of a convolutional neural network. We report the test errors of the models with the optimal hyperparameter configurations, as compared to current state-ofthe-art results
Download tables as Excel
Funding
  • This work was supported by the Director, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program of the U.S Department of Energy under Contract No DE-AC02-05CH11231
  • This work used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S Department of Energy under Contract No DE-AC02-05CH11231
  • We would like to acknowledge the NERSC systems staff, in particular Helen He and Harvey Wasserman, for providing us with generous access to the Babbage Xeon Phi testbed. The image caption generation computations in this paper were run on the Odyssey cluster supported by the FAS Division of Science, Research Computing Group at Harvard University
  • This work was partially funded by NSF IIS-1421780, the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canadian Institute for Advanced Research (CIFAR)
Reference
  • Bardenet, R., Brendel, M., Kegl, B., and Sebag, M. Collaborative hyperparameter tuning. In ICML, 2013.
    Google ScholarFindings
  • Bergstra, J. and Bengio, Y. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13:281– 305, 2012.
    Google ScholarLocate open access versionFindings
  • Bergstra, J. S., Bardenet, R., Bengio, Y., and Kegl, B. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems. 2011.
    Google ScholarLocate open access versionFindings
  • Bishop, C. M. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., 2006.
    Google ScholarFindings
  • Brochu, E., Brochu, T., and de Freitas, N. A Bayesian interactive optimization approach to procedural animation design. In ACM SIGGRAPH/Eurographics Symposium on Computer Animation, 2010.
    Google ScholarLocate open access versionFindings
  • Bull, A. D. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, (3-4):2879– 2904, 2011.
    Google ScholarLocate open access versionFindings
  • Buntine, W. L. and Weigend, A. S. Bayesian back-propagation. Complex systems, 5(6):603–643, 1991.
    Google ScholarLocate open access versionFindings
  • Calandra, R., Peters, J., Rasmussen, C. E., and Deisenroth, M. P. Manifold Gaussian processes for regression. preprint arXiv:1402.5876, 2014a.
    Findings
  • Calandra, R., Peters, J., Seyfarth, A., and Deisenroth, M. P. An experimental evaluation of Bayesian optimization on bipedal locomotion. In International Conference on Robotics and Automation, 2014b.
    Google ScholarLocate open access versionFindings
  • Carvalho, C. M., Polson, N. G., and Scott, J. G. Handling sparsity via the horseshoe. In Artificial Intelligence and Statistics, 2009.
    Google ScholarLocate open access versionFindings
  • De Freitas, J. F. Bayesian methods for neural networks. PhD thesis, Trinity College, University of Cambridge, 2003.
    Google ScholarFindings
  • de Freitas, N., Smola, A. J., and Zoghi, M. Exponential regret bounds for Gaussian process bandits with deterministic observations. In ICML, 2012.
    Google ScholarLocate open access versionFindings
  • Djolonga, J., Krause, A., and Cevher, V. High dimensional Gaussian process bandits. In Advances in Neural Information Processing Systems, 2013.
    Google ScholarLocate open access versionFindings
  • Eggensperger, K., Feurer, M., Hutter, F., Bergstra, J., Snoek, J., Hoos, H., and Leyton-Brown, K. Towards an empirical foundation for assessing Bayesian optimization of hyperparameters. In NIPS Workshop on Bayesian Optimization in Theory and Practice, 2013.
    Google ScholarLocate open access versionFindings
  • Feurer, M., Springenberg, T., and Hutter, F. Initializing Bayesian hyperparameter optimization via meta-learning. In AAAI Conference on Artificial Intelligence, 2015.
    Google ScholarLocate open access versionFindings
  • Garnett, R., Osborne, M. A., and Roberts, S. J. Bayesian optimization for sensor set selection. In International Conference on Information Processing in Sensor Networks, 2010.
    Google ScholarLocate open access versionFindings
  • Gelbart, M. A., Snoek, J., and Adams, R. P. Bayesian optimization with unknown constraints. In Uncertainty in Artificial Intelligence, 2014.
    Google ScholarLocate open access versionFindings
  • Ginsbourger, D. and Riche, R. L. Dealing with asynchronicity in parallel Gaussian process based global optimization. http://hal.archives-ouvertes.fr/hal-00507632, 2010.
    Findings
  • Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A. C., and Bengio, Y. Maxout networks. In ICML, 2013.
    Google ScholarFindings
  • Gramacy, R. B. and Lee, H. K. H. Optimization under unknown constraints, 2010. arXiv:1004.4027.
    Findings
  • Hensman, J., Fusi, N., and Lawrence, N. Gaussian processes for big data. In Uncertainty in Artificial Intelligence, 2013.
    Google ScholarLocate open access versionFindings
  • Hinton, G. E. and van Camp, D. Keeping neural networks simple by minimizing the description length of the weights. In ACM Conference on Computational Learning Theory, 1993.
    Google ScholarLocate open access versionFindings
  • Hinton, G. E. and Salakhutdinov, R. Using deep belief nets to learn covariance kernels for Gaussian processes. In Advances in neural information processing systems, pp. 1249– 1256, 2008.
    Google ScholarLocate open access versionFindings
  • Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
    Findings
  • Hoffman, M., Brochu, E., and de Freitas, N. Portfolio allocation for Bayesian optimization. In Uncertainty in Artificial Intelligence, 2011.
    Google ScholarLocate open access versionFindings
  • Hutter, F., Hoos, H. H., and Leyton-Brown, K. Sequential modelbased optimization for general algorithm configuration. In Learning and Intelligent Optimization 5, 2011.
    Google ScholarLocate open access versionFindings
  • Jones, D. R. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21, 2001.
    Google ScholarLocate open access versionFindings
  • Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
    Google ScholarLocate open access versionFindings
  • Kiros, R., Salakhutdinov, R., and Zemel, R. S. Multimodal neural language models. In ICML, 2014.
    Google ScholarLocate open access versionFindings
  • Krause, A. and Ong, C. S. Contextual Gaussian process bandit optimization. In Advances in Neural Information Processing Systems, 2011.
    Google ScholarLocate open access versionFindings
  • Kushner, H. J. A new method for locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86, 1964.
    Google ScholarLocate open access versionFindings
  • Lazaro-Gredilla, M. and Figueiras-Vidal, A. R. Marginalized neural network mixtures for large-scale regression. Neural Networks, IEEE Transactions on, 21(8):1345–1351, 2010.
    Google ScholarLocate open access versionFindings
  • Lee, C.-Y., Xie, S., Gallagher, P., Zhang, Z., and Tu, Z. Deeply supervised nets. In Deep Learning and Representation Learning Workshop, NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Lin, M., Chen, Q., and Yan, S. Network in network. CoRR, abs/1312.4400, 2013. URL http://arxiv.org/abs/1312.4400.
    Findings
  • Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C. L. Microsoft COCO: Common objects in context. In ECCV 2014, pp. 740–755.
    Google ScholarLocate open access versionFindings
  • Lizotte, D. Practical Bayesian Optimization. PhD thesis, University of Alberta, Edmonton, Alberta, 2008.
    Google ScholarFindings
  • MacKay, D. J. A practical Bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
    Google ScholarLocate open access versionFindings
  • Mahendran, N., Wang, Z., Hamze, F., and de Freitas, N. Adaptive MCMC with Bayesian optimization. In Artificial Intelligence and Statistics, 2012.
    Google ScholarLocate open access versionFindings
  • Mnih, A. and Gregor, K. Neural variational inference and learning in belief networks. In ICML, 2014.
    Google ScholarLocate open access versionFindings
  • Mockus, J., Tiesis, V., and Zilinskas, A. The application of Bayesian methods for seeking the extremum. Towards Global Optimization, 2, 1978.
    Google ScholarLocate open access versionFindings
  • Neal, R. Slice sampling. Annals of Statistics, 31:705–767, 2000.
    Google ScholarLocate open access versionFindings
  • Neal, R. M. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.
    Google ScholarFindings
  • Nickson, T., Osborne, M. A., Reece, S., and Roberts, S. Automated machine learning using stochastic algorithm tuning. NIPS Workshop on Bayesian Optimization, 2014.
    Google ScholarLocate open access versionFindings
  • Osborne, M. A., Garnett, R., and Roberts, S. J. Gaussian processes for global optimization. In Learning and Intelligent Optimization, 2009.
    Google ScholarLocate open access versionFindings
  • Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and variational inference in deep latent Gaussian models. In ICML, 2014.
    Google ScholarLocate open access versionFindings
  • Snelson, E. and Ghahramani, Z. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, pp. 1257–1264, 2005.
    Google ScholarLocate open access versionFindings
  • Snoek, J. Bayesian Optimization and Semiparametric Models with Applications to Assistive Technology. PhD thesis, University of Toronto, Toronto, Canada, 2013.
    Google ScholarFindings
  • Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, 2012.
    Google ScholarLocate open access versionFindings
  • Snoek, J., Swersky, K., Zemel, R. S., and Adams, R. P. Input warping for Bayesian optimization of non-stationary functions. In ICML, 2014.
    Google ScholarLocate open access versionFindings
  • Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. A. Striving for simplicity: The all convolutional net. CoRR, abs/1412.6806, 2014. URL http://arxiv.org/abs/1412.6806.
    Findings
  • Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: no regret and experimental design. In ICML, 2010.
    Google ScholarLocate open access versionFindings
  • Swersky, K., Snoek, J., and Adams, R. P. Multi-task Bayesian optimization. In Advances in Neural Information Processing Systems, 2013.
    Google ScholarLocate open access versionFindings
  • Titsias, M. K. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pp. 567–574, 2009.
    Google ScholarLocate open access versionFindings
  • Wan, L., Zeiler, M. D., Zhang, S., LeCun, Y., and Fergus, R. Regularization of neural networks using dropconnect. In ICML, 2013.
    Google ScholarFindings
  • Wang, Z., Zoghi, M., Hutter, F., Matheson, D., and de Freitas, N. Bayesian optimization in high dimensions via random embeddings. In IJCAI, 2013.
    Google ScholarFindings
  • Williams, C. K. I. Computing with infinite networks. In Advances in Neural Information Processing Systems, 1996.
    Google ScholarLocate open access versionFindings
  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044v2, 2015.
    Findings
  • Zaremba, W., Sutskever, I., and Vinyals, O. Recurrent neural network regularization. arXiv preprint arXiv:1207.0580, 2015.
    Findings
Your rating :
0

 

Tags
Comments