# Unbounded Bayesian Optimization via Regularization

JMLR Workshop and Conference Proceedings, pp. 1168-1176, 2016.

EI

Keywords:

Markov chain Monte Carlohyperparameter tuningobjective functionbayesian optimization frameworkefficient global optimizationMore(12+)

Weibo:

Abstract:

Bayesian optimization has recently emerged as a powerful and flexible tool in machine learning for hyperparameter tuning and more generally for the efficient global optimization of expensive black box functions. The established practice requires a user-defined bounded domain, which is assumed to contain the global optimizer. However, when...More

Code:

Data:

Introduction

- Since the technique was introduced over 50 years ago, Bayesian optimization has been applied to optimize black box objective functions in many different application domains.
- The current state of the art requires the user to prescribe a bounded domain within which to search for the optimum
- Setting these bounds— often done arbitrarily—is one of the main difficulties hindering the broader use of Bayesian optimization as a standard framework for hyperparameter tuning.
- This obstacle was raised at the NIPS 2014 Workshop on Bayesian optimization as one of the open challenges in the field.
- The second is a regularization method that is practical and easy to implement in any existing Bayesian optimization toolbox based on Gaussian Process priors over objective functions

Highlights

- Since the technique was introduced over 50 years ago, Bayesian optimization has been applied to optimize black box objective functions in many different application domains
- RS: As an additional benchmark, on the neural network tuning tasks, we considered a random selection strategy, which uniformly sampled within the user-defined bounding box
- When compared to the standard EI, EI-H boasts over 20% improvement in accuracy on the multi-layered perceptron and almost 10% on the convolutional neural network
- We propose a versatile new approach to Bayesian optimization which is not limited to a search within a bounding box
- Our method fits seamlessly within the current Bayesian optimization framework, and can be readily used with any acquisition function which is induced by a Gaussian process
- We emphasize that in this work we have addressed one of the challenges that must be overcome toward the development of a practical Bayesian optimization tool for hyper-parameter tuning and efficient global optimization in general

Methods

- The authors evaluate the proposed methods and show that they achieve the desirable behaviour on two synthetic benchmarking functions, and a simple task of tuning the stochastic gradient descent and regularization parameters used in training a multi-layered perceptron (MLP) and a convolutional neural network (CNN) on the MNIST dataset.

Experimental protocol. - The authors evaluate the proposed methods and show that they achieve the desirable behaviour on two synthetic benchmarking functions, and a simple task of tuning the stochastic gradient descent and regularization parameters used in training a multi-layered perceptron (MLP) and a convolutional neural network (CNN) on the MNIST dataset.
- Experiments were repeated to report and compare the mean and standard error of the algorithms: the synthetic experiments were repeated 40 times, while the MNIST experiments were repeated 25 and 20 times for the MLP and the CNN, respectively.
- All algorithms were implemented in the pybo framework available on github,4 and are labelled in the following figures as follows: EI: Vanilla expected improvement with hyperparameter marginalization via MCMC

Results

- The Hartmann tests show that the volume doubling heuristic is a good baseline method; and the plateaus suggest that this method warrants further study in, perhaps adaptive, scheduling strategies.
- It is less effective than EI-H as the dimensionality increases, it is an improvement over standard EI in all cases.
- When compared to the standard EI, EI-H boasts over 20% improvement in accuracy on the MLP and almost 10% on the CNN

Conclusion

- The authors propose a versatile new approach to Bayesian optimization which is not limited to a search within a bounding box.
- Given an initial bounding box that does not include the optimum, the authors have demonstrated that the approach can expand its region of interest and achieve greater function values.
- The authors emphasize that in this work the authors have addressed one of the challenges that must be overcome toward the development of a practical Bayesian optimization tool for hyper-parameter tuning and efficient global optimization in general.

Summary

## Introduction:

Since the technique was introduced over 50 years ago, Bayesian optimization has been applied to optimize black box objective functions in many different application domains.- The current state of the art requires the user to prescribe a bounded domain within which to search for the optimum
- Setting these bounds— often done arbitrarily—is one of the main difficulties hindering the broader use of Bayesian optimization as a standard framework for hyperparameter tuning.
- This obstacle was raised at the NIPS 2014 Workshop on Bayesian optimization as one of the open challenges in the field.
- The second is a regularization method that is practical and easy to implement in any existing Bayesian optimization toolbox based on Gaussian Process priors over objective functions
## Methods:

The authors evaluate the proposed methods and show that they achieve the desirable behaviour on two synthetic benchmarking functions, and a simple task of tuning the stochastic gradient descent and regularization parameters used in training a multi-layered perceptron (MLP) and a convolutional neural network (CNN) on the MNIST dataset.

Experimental protocol.- The authors evaluate the proposed methods and show that they achieve the desirable behaviour on two synthetic benchmarking functions, and a simple task of tuning the stochastic gradient descent and regularization parameters used in training a multi-layered perceptron (MLP) and a convolutional neural network (CNN) on the MNIST dataset.
- Experiments were repeated to report and compare the mean and standard error of the algorithms: the synthetic experiments were repeated 40 times, while the MNIST experiments were repeated 25 and 20 times for the MLP and the CNN, respectively.
- All algorithms were implemented in the pybo framework available on github,4 and are labelled in the following figures as follows: EI: Vanilla expected improvement with hyperparameter marginalization via MCMC
## Results:

The Hartmann tests show that the volume doubling heuristic is a good baseline method; and the plateaus suggest that this method warrants further study in, perhaps adaptive, scheduling strategies.- It is less effective than EI-H as the dimensionality increases, it is an improvement over standard EI in all cases.
- When compared to the standard EI, EI-H boasts over 20% improvement in accuracy on the MLP and almost 10% on the CNN
## Conclusion:

The authors propose a versatile new approach to Bayesian optimization which is not limited to a search within a bounding box.- Given an initial bounding box that does not include the optimum, the authors have demonstrated that the approach can expand its region of interest and achieve greater function values.
- The authors emphasize that in this work the authors have addressed one of the challenges that must be overcome toward the development of a practical Bayesian optimization tool for hyper-parameter tuning and efficient global optimization in general.

Related work

- Although the notion of using a non-trivial Gaussian process prior mean is not new, it is usually expected to encode domain expert knowledge or known structure in the response surface. To the best of the authors’ knowledge, only one recent work has considered using the prior mean as a regularization term and it was primarily to avoid selecting points along boundaries and in corners of the bounding box [Snoek et al, 2015].

In this work we demonstrate that a regularizing prior mean can be used to carry out Bayesian optimization without a rigid bounded domain. We compare this regularized approach to a volume doubling baseline. While the regularized algorithms exhibit a much more homogeneous search behaviour (i.e. boundaries and corners are not disproportionately favoured), the volume doubling baseline performs very well in practice.

We begin with a brief review of Bayesian optimization with Gaussian processes in the next section, followed by an introduction to regularization via nonstationary prior means in Section 3, including visualizations that show that our proposed approach indeed ventures out of the initial user-defined bounding box. Section 4 reports our results on two synthetic benchmarking problems as well as two real hyperparameter tuning tasks, namely tuning the stochastic gradient descent optimizer of two neural network architectures on the MNIST handwritten digit recognition task.

Funding

- Introduces a new alternative method and compare it to a volume doubling baseline on two common synthetic benchmarking test functions
- Demonstrates that a regularizing prior mean can be used to carry out Bayesian optimization without a rigid bounded domain
- Has described the probabilistic model uses to represent our prior belief about the unknown objective f , and how to update this belief given observations Dn with and

Reference

- S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, 2013.
- J. Bergstra and Y. Bengio. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13:281–305, 2012.
- J. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems, pages 2546–2554, 2011.
- A. D. Bull. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12:2879–2904, 2011.
- O. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems, pages 2249–2257, 2011.
- N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evol. Comput., 9(2):159–195, 2001.
- P. Hennig and C. J. Schuler. Entropy search for information-efficient global optimization. The Journal of Machine Learning Research, pages 1809–1837, 2012.
- J. M. Hernandez-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In Advances in Neural Information Processing Systems. 2014.
- M. W. Hoffman, B. Shahriari, and N. de Freitas. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In AI and Statistics, pages 365– 374, 2014.
- D. R. Jones. A taxonomy of global optimization methods based on response surfaces. J. of Global Optimization, 21(4):345–383, 2001.
- D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. J. of Global optimization, 13(4):455–492, 1998.
- E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory, volume 7568 of Lecture Notes in Computer Science, pages 199–213. Springer Berlin Heidelberg, 2012.
- H. J. Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Fluids Engineering, 86 (1):97–106, 1964.
- N. Mahendran, Z. Wang, F. Hamze, and N. de Freitas. Adaptive MCMC with Bayesian optimization. Journal of Machine Learning Research - Proceedings Track, 22:751–760, 2012.
- C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.
- S. L. Scott. A modern Bayesian look at the multiarmed bandit. Applied Stochastic Models in Business and Industry, 26(6):639–658, 2010.
- B. Shahriari, Z. Wang, M. W. Hoffman, A. BouchardCote, and N. de Freitas. An entropy search portfolio. In NIPS workshop on Bayesian Optimization, 2014.
- B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
- J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pages 2951–2959, 2012.
- J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Ali, R. P. Adams, et al. Scalable Bayesian optimization using deep neural networks. International Conference on Machine Learning, 2015.
- N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning, pages 1015–1022, 2010.
- K. Swersky, J. Snoek, and R. P. Adams. Multi-task Bayesian optimization. In Advances in Neural Information Processing Systems, pages 2004–2012, 2013.
- W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
- J. Villemonteix, E. Vazquez, and E. Walter. An informational approach to the global optimization of expensive-to-evaluate functions. J. of Global Optimization, 44(4):509–534, 2009.

Tags

Comments