# Scalable Bayesian Optimization Using Deep Neural Networks

International Conference on Machine Learning, 2015.

EI

Keywords:

Gaussian processesmaximum a posterioriOntario Graduate Scholarshipacquisition functionprobabilistic modelMore(12+)

Weibo:

Abstract:

Bayesian optimization is an effective methodology for the global optimization of functions with expensive evaluations. It relies on querying a distribution over functions defined by a relatively cheap surrogate model. An accurate model for this distribution over functions is critical to the effectiveness of the approach, and is typically ...More

Code:

Data:

Introduction

- The field of machine learning has seen unprecedented growth due to a new wealth of data, increases in computational power, new algorithms, and a plethora of exciting new applications.
- The growing complexity of machine learning models inevitably comes with the introduction of additional hyperparameters
- These range from design decisions such as the shape of a neural network architecture, to optimization parameters such as learning rates, to regularization hyperparameters such as weight decay.
- Proper setting of these hyperparameters is critical for performance on difficult problems.
- Bayesian optimization proceeds by performing a proxy optimization over this acquisition function in order to determine the input to evaluate

Highlights

- Recently, the field of machine learning has seen unprecedented growth due to a new wealth of data, increases in computational power, new algorithms, and a plethora of exciting new applications
- It may seem that we are merely moving the problem of setting the hyperparameters of the model being tuned to setting them for the tuner itself, we show that for a suitable set of design choices it is possible to create a robust, scalable, and effective Bayesian optimization system that generalizes across many global optimization problems
- We demonstrate the effectiveness of Deep Networks for Global Optimization on a number of difficult problems, including benchmark problems for Bayesian optimization, convolutional neural networks for object recognition, and multi-modal neural language models for image caption generation
- Bayesian optimization relies on the construction of a probabilistic model that defines a distribution over objective functions from the input space to the objective of interest
- We introduced deep networks for global optimization, or Deep Networks for Global Optimization, which enables efficient optimization of noisy, expensive black-box functions
- While this model maintains desirable properties of the Gaussian processes such as tractability and principled management of uncertainty, it greatly improves its scalability from cubic to linear as a function of the number of observations

Methods

- Human Expert LBL Regularized LSTM Soft-Attention LSTM 10 LSTM ensemble Hard-Attention LSTM

Single LBL 2 LBL ensemble 3 LBL ensemble Test BLEU

To demonstrate the effectiveness of the approach, the authors compare DNGO to these scalable model-based optimization variants, as well as the input-warped Gaussian process method of Snoek et al (2014) on the benchmark set of continuous problems from the HPOLib package (Eggensperger et al, 2013). - As Table 1 shows, DNGO significantly outperforms SMAC and TPE, and is competitive with the Gaussian process approach
- This shows that, despite vast improvements in scalability, DNGO retains the statistical efficiency of the Gaussian process method in terms of the number of evaluations required to find the minimum.
- In this experiment, the authors explore the effectiveness of DNGO on a practical and expensive problem where highly parallel evaluation is necessary to make progress in a reasonable amount of time.
- We optimize the hyperparameters of the logbilinear model (LBL) from Kiros et al (2014) to maximize (a) “A person riding a wave in the ocean.”

Results

- The authors find hyperparameter settings that achieve competitive with state-of-the-art results of 6.37% and 27.4% on CIFAR-10 and CIFAR-100 respectively, and BLEU scores of 25.1 and 26.7 on the Microsoft COCO 2014 dataset using a single model and a 3model ensemble.

Conclusion

- The authors introduced deep networks for global optimization, or DNGO, which enables efficient optimization of noisy, expensive black-box functions.
- While this model maintains desirable properties of the GP such as tractability and principled management of uncertainty, it greatly improves its scalability from cubic to linear as a function of the number of observations.
- One promising line of work, for example by Nickson et al (2014), is to introduce a similar methodology by instead employing the sparse Gaussian process as the underlying probabilistic model (Snelson & Ghahramani, 2005; Titsias, 2009; Hensman et al, 2013)

Summary

## Introduction:

The field of machine learning has seen unprecedented growth due to a new wealth of data, increases in computational power, new algorithms, and a plethora of exciting new applications.- The growing complexity of machine learning models inevitably comes with the introduction of additional hyperparameters
- These range from design decisions such as the shape of a neural network architecture, to optimization parameters such as learning rates, to regularization hyperparameters such as weight decay.
- Proper setting of these hyperparameters is critical for performance on difficult problems.
- Bayesian optimization proceeds by performing a proxy optimization over this acquisition function in order to determine the input to evaluate
## Objectives:

The goal of this work is to develop a method for scaling Bayesian optimization, while still maintaining its desirable flexibility and characterization of uncertainty.## Methods:

Human Expert LBL Regularized LSTM Soft-Attention LSTM 10 LSTM ensemble Hard-Attention LSTM

Single LBL 2 LBL ensemble 3 LBL ensemble Test BLEU

To demonstrate the effectiveness of the approach, the authors compare DNGO to these scalable model-based optimization variants, as well as the input-warped Gaussian process method of Snoek et al (2014) on the benchmark set of continuous problems from the HPOLib package (Eggensperger et al, 2013).- As Table 1 shows, DNGO significantly outperforms SMAC and TPE, and is competitive with the Gaussian process approach
- This shows that, despite vast improvements in scalability, DNGO retains the statistical efficiency of the Gaussian process method in terms of the number of evaluations required to find the minimum.
- In this experiment, the authors explore the effectiveness of DNGO on a practical and expensive problem where highly parallel evaluation is necessary to make progress in a reasonable amount of time.
- We optimize the hyperparameters of the logbilinear model (LBL) from Kiros et al (2014) to maximize (a) “A person riding a wave in the ocean.”
## Results:

The authors find hyperparameter settings that achieve competitive with state-of-the-art results of 6.37% and 27.4% on CIFAR-10 and CIFAR-100 respectively, and BLEU scores of 25.1 and 26.7 on the Microsoft COCO 2014 dataset using a single model and a 3model ensemble.## Conclusion:

The authors introduced deep networks for global optimization, or DNGO, which enables efficient optimization of noisy, expensive black-box functions.- While this model maintains desirable properties of the GP such as tractability and principled management of uncertainty, it greatly improves its scalability from cubic to linear as a function of the number of observations.
- One promising line of work, for example by Nickson et al (2014), is to introduce a similar methodology by instead employing the sparse Gaussian process as the underlying probabilistic model (Snelson & Ghahramani, 2005; Titsias, 2009; Hensman et al, 2013)

- Table1: Evaluation of DNGO on global optimization benchmark problems versus scalable (TPE, SMAC) and non-scalable (Spearmint) Bayesian optimization methods. All problems are minimization problems. For each problem, each method was run 10 times to produce error bars
- Table2: Image caption generation results using BLEU-4 on the Microsoft COCO 2014 test set. Regularized and ensembled LSTM results are reported in <a class="ref-link" id="cZaremba_et+al_2015_a" href="#rZaremba_et+al_2015_a">Zaremba et al (2015</a>). The baseline LBL tuned by a human expert and the Soft and Hard Attention models are reported in <a class="ref-link" id="cXu_et+al_2015_a" href="#rXu_et+al_2015_a">Xu et al (2015</a>). We see that ensembling our top models resulting from the optimization further improves results significantly. We noticed that there were distinct multiple local optima in the hyperparameter space, which may explain the dramatic improvement from ensembling a small subset of models
- Table3: Our convolutional neural network architecture. This choice was chosen to be maximally generic. Each convolution layer is followed by a ReLU nonlinearity
- Table4: We use our algorithm to optimize validation set error as a function of various hyperparameters of a convolutional neural network. We report the test errors of the models with the optimal hyperparameter configurations, as compared to current state-ofthe-art results

Funding

- This work was supported by the Director, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program of the U.S Department of Energy under Contract No DE-AC02-05CH11231
- This work used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S Department of Energy under Contract No DE-AC02-05CH11231
- We would like to acknowledge the NERSC systems staff, in particular Helen He and Harvey Wasserman, for providing us with generous access to the Babbage Xeon Phi testbed. The image caption generation computations in this paper were run on the Odyssey cluster supported by the FAS Division of Science, Research Computing Group at Harvard University
- This work was partially funded by NSF IIS-1421780, the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canadian Institute for Advanced Research (CIFAR)

Reference

- Bardenet, R., Brendel, M., Kegl, B., and Sebag, M. Collaborative hyperparameter tuning. In ICML, 2013.
- Bergstra, J. and Bengio, Y. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13:281– 305, 2012.
- Bergstra, J. S., Bardenet, R., Bengio, Y., and Kegl, B. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems. 2011.
- Bishop, C. M. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., 2006.
- Brochu, E., Brochu, T., and de Freitas, N. A Bayesian interactive optimization approach to procedural animation design. In ACM SIGGRAPH/Eurographics Symposium on Computer Animation, 2010.
- Bull, A. D. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, (3-4):2879– 2904, 2011.
- Buntine, W. L. and Weigend, A. S. Bayesian back-propagation. Complex systems, 5(6):603–643, 1991.
- Calandra, R., Peters, J., Rasmussen, C. E., and Deisenroth, M. P. Manifold Gaussian processes for regression. preprint arXiv:1402.5876, 2014a.
- Calandra, R., Peters, J., Seyfarth, A., and Deisenroth, M. P. An experimental evaluation of Bayesian optimization on bipedal locomotion. In International Conference on Robotics and Automation, 2014b.
- Carvalho, C. M., Polson, N. G., and Scott, J. G. Handling sparsity via the horseshoe. In Artificial Intelligence and Statistics, 2009.
- De Freitas, J. F. Bayesian methods for neural networks. PhD thesis, Trinity College, University of Cambridge, 2003.
- de Freitas, N., Smola, A. J., and Zoghi, M. Exponential regret bounds for Gaussian process bandits with deterministic observations. In ICML, 2012.
- Djolonga, J., Krause, A., and Cevher, V. High dimensional Gaussian process bandits. In Advances in Neural Information Processing Systems, 2013.
- Eggensperger, K., Feurer, M., Hutter, F., Bergstra, J., Snoek, J., Hoos, H., and Leyton-Brown, K. Towards an empirical foundation for assessing Bayesian optimization of hyperparameters. In NIPS Workshop on Bayesian Optimization in Theory and Practice, 2013.
- Feurer, M., Springenberg, T., and Hutter, F. Initializing Bayesian hyperparameter optimization via meta-learning. In AAAI Conference on Artificial Intelligence, 2015.
- Garnett, R., Osborne, M. A., and Roberts, S. J. Bayesian optimization for sensor set selection. In International Conference on Information Processing in Sensor Networks, 2010.
- Gelbart, M. A., Snoek, J., and Adams, R. P. Bayesian optimization with unknown constraints. In Uncertainty in Artificial Intelligence, 2014.
- Ginsbourger, D. and Riche, R. L. Dealing with asynchronicity in parallel Gaussian process based global optimization. http://hal.archives-ouvertes.fr/hal-00507632, 2010.
- Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A. C., and Bengio, Y. Maxout networks. In ICML, 2013.
- Gramacy, R. B. and Lee, H. K. H. Optimization under unknown constraints, 2010. arXiv:1004.4027.
- Hensman, J., Fusi, N., and Lawrence, N. Gaussian processes for big data. In Uncertainty in Artificial Intelligence, 2013.
- Hinton, G. E. and van Camp, D. Keeping neural networks simple by minimizing the description length of the weights. In ACM Conference on Computational Learning Theory, 1993.
- Hinton, G. E. and Salakhutdinov, R. Using deep belief nets to learn covariance kernels for Gaussian processes. In Advances in neural information processing systems, pp. 1249– 1256, 2008.
- Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
- Hoffman, M., Brochu, E., and de Freitas, N. Portfolio allocation for Bayesian optimization. In Uncertainty in Artificial Intelligence, 2011.
- Hutter, F., Hoos, H. H., and Leyton-Brown, K. Sequential modelbased optimization for general algorithm configuration. In Learning and Intelligent Optimization 5, 2011.
- Jones, D. R. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21, 2001.
- Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
- Kiros, R., Salakhutdinov, R., and Zemel, R. S. Multimodal neural language models. In ICML, 2014.
- Krause, A. and Ong, C. S. Contextual Gaussian process bandit optimization. In Advances in Neural Information Processing Systems, 2011.
- Kushner, H. J. A new method for locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86, 1964.
- Lazaro-Gredilla, M. and Figueiras-Vidal, A. R. Marginalized neural network mixtures for large-scale regression. Neural Networks, IEEE Transactions on, 21(8):1345–1351, 2010.
- Lee, C.-Y., Xie, S., Gallagher, P., Zhang, Z., and Tu, Z. Deeply supervised nets. In Deep Learning and Representation Learning Workshop, NIPS, 2014.
- Lin, M., Chen, Q., and Yan, S. Network in network. CoRR, abs/1312.4400, 2013. URL http://arxiv.org/abs/1312.4400.
- Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C. L. Microsoft COCO: Common objects in context. In ECCV 2014, pp. 740–755.
- Lizotte, D. Practical Bayesian Optimization. PhD thesis, University of Alberta, Edmonton, Alberta, 2008.
- MacKay, D. J. A practical Bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
- Mahendran, N., Wang, Z., Hamze, F., and de Freitas, N. Adaptive MCMC with Bayesian optimization. In Artificial Intelligence and Statistics, 2012.
- Mnih, A. and Gregor, K. Neural variational inference and learning in belief networks. In ICML, 2014.
- Mockus, J., Tiesis, V., and Zilinskas, A. The application of Bayesian methods for seeking the extremum. Towards Global Optimization, 2, 1978.
- Neal, R. Slice sampling. Annals of Statistics, 31:705–767, 2000.
- Neal, R. M. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.
- Nickson, T., Osborne, M. A., Reece, S., and Roberts, S. Automated machine learning using stochastic algorithm tuning. NIPS Workshop on Bayesian Optimization, 2014.
- Osborne, M. A., Garnett, R., and Roberts, S. J. Gaussian processes for global optimization. In Learning and Intelligent Optimization, 2009.
- Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and variational inference in deep latent Gaussian models. In ICML, 2014.
- Snelson, E. and Ghahramani, Z. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, pp. 1257–1264, 2005.
- Snoek, J. Bayesian Optimization and Semiparametric Models with Applications to Assistive Technology. PhD thesis, University of Toronto, Toronto, Canada, 2013.
- Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, 2012.
- Snoek, J., Swersky, K., Zemel, R. S., and Adams, R. P. Input warping for Bayesian optimization of non-stationary functions. In ICML, 2014.
- Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. A. Striving for simplicity: The all convolutional net. CoRR, abs/1412.6806, 2014. URL http://arxiv.org/abs/1412.6806.
- Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: no regret and experimental design. In ICML, 2010.
- Swersky, K., Snoek, J., and Adams, R. P. Multi-task Bayesian optimization. In Advances in Neural Information Processing Systems, 2013.
- Titsias, M. K. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pp. 567–574, 2009.
- Wan, L., Zeiler, M. D., Zhang, S., LeCun, Y., and Fergus, R. Regularization of neural networks using dropconnect. In ICML, 2013.
- Wang, Z., Zoghi, M., Hutter, F., Matheson, D., and de Freitas, N. Bayesian optimization in high dimensions via random embeddings. In IJCAI, 2013.
- Williams, C. K. I. Computing with infinite networks. In Advances in Neural Information Processing Systems, 1996.
- Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044v2, 2015.
- Zaremba, W., Sutskever, I., and Vinyals, O. Recurrent neural network regularization. arXiv preprint arXiv:1207.0580, 2015.

Tags

Comments