# Scalable gradient-based tuning of continuous regularization hyperparameters

international conference on machine learning, pp. 2952-2960, 2016.

Keywords:

gradient-based hyperparameter selectionneural network modelinghyperparameter selectiontraining runrandom searchMore(9+)

Weibo:

Abstract:

Hyperparameter selection generally relies on running multiple full training trials, with selection based on validation set performance. We propose a gradient-based approach for locally adjusting hyperparameters during training of the model. Hyperparameters are adjusted so as to make the model parameter gradients, and hence updates, more...More

Code:

Data:

Introduction

- Specifying and training artificial neural networks requires several design choices that are often not trivial to make.
- There are a number of automated methods (Bergstra et al, 2011; Snoek et al, 2012), all of which rely on multiple complete training runs with varied fixed hyperparameters, with the hyperparameter selection based on the validation set performance.
- Effective, these methods are expensive as the user needs to run multiple full training runs.
- In many practical applications such an approach is too tedious and time-consuming, and it would be useful if a method existed that could automatically find acceptable hyperparameter values in one training run even if the user did not have a strong intuition regarding good values to try for the hyperparameters

Highlights

- Specifying and training artificial neural networks requires several design choices that are often not trivial to make
- Number of automated methods (Bergstra et al, 2011; Snoek et al, 2012), all of which rely on multiple complete training runs with varied fixed hyperparameters, with the hyperparameter selection based on the validation set performance
- These methods are expensive as the user needs to run multiple full training runs
- The proposed method could work in principle for any continuous hyperparameter, we have focused on studying tuning of regularization hyperparameters
- The initial regularization hyperparameter value is denoted with a star, while the final value is marked with a square
- We experimented on tuning regularization hyperparameters when training different model structures on the MNIST and SVHN datasets.The T1 − T2 model consistently managed to improve on the initial levels of additive noise and L2 weight penalty

Methods

- T1 − T2, for tuning continuous hyperparameters of a model using the gradient of the performance of the model on a separate validation set T2.
- When training a neural network model, the authors try to minimize an objective function that depends on the training set, model weights and hyperparameters that determine the strength of possible regularization terms.
- Where C1(·) and Ω(·) are cost and regularization penalty terms, T1 = {} is the training data set, θ = {Wl, bl} a set of elementary parameters including weights and biases of each layer, λ denotes various hyperparameters that determine the strength of regularization, while η1 is a learning rate.

Results

- Figure 1 illustrates resulting hyperparameters changes during T1−T2 training.
- To see how the T1−T2 method behaves, the authors visualized trajectories of hyperparameter values during training in the hyperparameter cost space.
- For each point in the two-dimensional hyperparameter space, the authors compute the corresponding test cost without T1 − T2.
- The background of the figures corresponds to grid search on the two-dimensional hyperparameter interval.
- As can be seen from the figure, all runs converge to a reasonable set of hyperparameters irrespective of the starting value, gradually moving to a point of lower log-likelihood.
- Note that because the optimal values of learning rates for each hyperparameter direction are unknown, hyperparameters will change the most along directions corresponding to either the local gradient or the higher relative learning rate

Conclusion

- The authors have proposed a method called T1 − T2 for gradientbased automatic tuning of continuous hyperparameters during training, based on the performance of the model on a separate validation set.
- While the T1 − T2 method is helpful for minimizing the objective function on the validation set, as illustrated in Figure 7, a set of hyperparameters minimizing a continuous objective like cross-entropy, might not be optimal for the classification error.
- It may be worthwhile to try objective functions which approximate the classification error better, as well as trying the method on unsupervised objectives

Summary

## Introduction:

Specifying and training artificial neural networks requires several design choices that are often not trivial to make.- There are a number of automated methods (Bergstra et al, 2011; Snoek et al, 2012), all of which rely on multiple complete training runs with varied fixed hyperparameters, with the hyperparameter selection based on the validation set performance.
- Effective, these methods are expensive as the user needs to run multiple full training runs.
- In many practical applications such an approach is too tedious and time-consuming, and it would be useful if a method existed that could automatically find acceptable hyperparameter values in one training run even if the user did not have a strong intuition regarding good values to try for the hyperparameters
## Methods:

T1 − T2, for tuning continuous hyperparameters of a model using the gradient of the performance of the model on a separate validation set T2.- When training a neural network model, the authors try to minimize an objective function that depends on the training set, model weights and hyperparameters that determine the strength of possible regularization terms.
- Where C1(·) and Ω(·) are cost and regularization penalty terms, T1 = {} is the training data set, θ = {Wl, bl} a set of elementary parameters including weights and biases of each layer, λ denotes various hyperparameters that determine the strength of regularization, while η1 is a learning rate.
## Results:

Figure 1 illustrates resulting hyperparameters changes during T1−T2 training.- To see how the T1−T2 method behaves, the authors visualized trajectories of hyperparameter values during training in the hyperparameter cost space.
- For each point in the two-dimensional hyperparameter space, the authors compute the corresponding test cost without T1 − T2.
- The background of the figures corresponds to grid search on the two-dimensional hyperparameter interval.
- As can be seen from the figure, all runs converge to a reasonable set of hyperparameters irrespective of the starting value, gradually moving to a point of lower log-likelihood.
- Note that because the optimal values of learning rates for each hyperparameter direction are unknown, hyperparameters will change the most along directions corresponding to either the local gradient or the higher relative learning rate
## Conclusion:

The authors have proposed a method called T1 − T2 for gradientbased automatic tuning of continuous hyperparameters during training, based on the performance of the model on a separate validation set.- While the T1 − T2 method is helpful for minimizing the objective function on the validation set, as illustrated in Figure 7, a set of hyperparameters minimizing a continuous objective like cross-entropy, might not be optimal for the classification error.
- It may be worthwhile to try objective functions which approximate the classification error better, as well as trying the method on unsupervised objectives

Funding

- Jelena Luketina and Tapani Raiko were funded by the Academy of Finland
- Mathias Berglund was funded by the HICT doctoral education network

Study subjects and analysis

samples: 55000

For MNIST we tried various network sizes: shallow 1000 × 1000 × 1000 to deep 4000 × 2000 × 1000 × 500 × 250. Training set T1 had 55 000 samples, and validation T2 had 5 000 samples. The split between T1 and T2 was made using a different random seed in each of the experiments to avoid overfitting to a particular subset of the training set

training samples: 73257

Global contrast normalization was used as the only preprocessing step. Out of 73257 training samples, we picked a random 65 000 samples for T1 and the remaining 8 257 samples for T2. None of the SVHN experiments used tied hyperparameters, i.e. each layer was parametrized with a separate hyperparameter, which was tuned independently

samples: 45000

None of the SVHN experiments used tied hyperparameters, i.e. each layer was parametrized with a separate hyperparameter, which was tuned independently. To test on CIFAR-10 with convolutional networks, we used 45 000 samples for T1 and 5 000 samples for T2. The data was preprocessed using global contrast normalization and whitening

datasets: 2

Test error after one run with T1 − T2 compared to a rerun where we use the final values of the hyperparameters at the end of T1 − T2 training as fixed hyperparameters for a new run (left: MNIST, middle: SVHN, right: CIFAR-10). Th correlation indicates that T1 − T2 is useful also for finding approximate hyperparameters for training without an adaptive hyperparameter method. Classification error of validation set vs test set, at the end of T1 − T2 training for MNIST (left), SVHN (middle), and CIFAR-10 (right). For MNIST there is no apparent structure, but all the results lie in a region of low error. The results for the other two datasets correlate strongly, suggesting that validation set performance is still indicative of test set performance. right) shows the validation error compared to the final test error of a model trained with T1 − T2. We do not observe overfitting, with validation performance being strongly indicative of the test set performance. For MNIST all the results cluster tightly in the region of low error, hence there is no apparent structure. It should be noted though, that in these experiments we had at most 20 hyperparameters, making overfitting to validation set unlikely. Grid search results on a pair of hyperparameters (no tuning with T1 − T2). Figures on the right represent the test error at the end of training as a function of hyperparameters. Figures on the left represent the test log-likelihood at the end of training as a function of hyperparameters. We can see that the set of hyperparameters minimizing test log-likelihood is different from the set of hyperparameters minimizing test classification error

Reference

- Bengio, Y. (2000). Gradient-based optimization of hyperparameters. Neural computation, 12(8), 1889–1900.
- Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13, 281–305.
- Bergstra, J. S., Bardenet, R., Bengio, Y., and Kegl, B. (2011). Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems 24, pages 2546–2554.
- Chen, D. and Hagan, M. T. (1999). Optimal use of regularization and cross-validation in neural network modeling. In International Joint Conference on Neural Networks, pages 1275–1289.
- Dahl, G. E., Sainath, T. N., and Hinton, G. E. (2013). Improving deep neural networks for LVCSR using rectified linear units and dropout. In ICASSP, pages 8609–8613.
- Desjardins, G., Simonyan, K., Pascanu, R., and Kavukcuoglu, K. (2015). Natural neural networks. In Advances in Neural Information Processing Systems.
- Foo, C.-s., Do, C. B., and Ng, A. (2008). Efficient multiple hyperparameter learning for log-linear models. In Advances in neural information processing systems (NIPS), pages 377–384.
- Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
- Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization. In the International Conference on Learning Representations (ICLR), San Diego. arXiv:1412.6980.
- Krizhevsky, A. (2009). Learning multiple layers of features from tiny images.
- Larsen, J., Svarer, C., Andersen, L. N., and Hansen, L. K. (1998). Adaptive regularization in neural network modeling. In Neural Networks: Tricks of the Trade, pages 113–132. Springer.
- LeCun, Y., Cortes, C., and Burges, C. J. (1998). The MNIST database of handwritten digits.
- Maclaurin, D., Duvenaud, D., and Adams, R. P. (2015). Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning.
- Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 4.
- Pearlmutter, B. A. (1994). Fast Exact Multiplication by the Hessian. Neural Computation, pages 147–160.
- Raiko, T., Valpola, H., and LeCun, Y. (2012). Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics, pages 924–932.
- Rasmus, A., Valpola, H., Honkala, M., Berglund, M., and Raiko, T. (2015). Semi-supervised learning with ladder network. Neural Information Processing Systems.
- Schraudolph, N. N. (2002). Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 14(7), 1723–1738.
- Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. ArXiv e-prints.
- Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. (2014). Striving for Simplicity: The All Convolutional Net.
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.
- Team, T. T. D. (2016). Theano: A Python framework for fast computation of mathematical expressions.
- Vatanen, T., Raiko, T., Valpola, H., and LeCun, Y. (2013). Pushing stochastic gradient towards second-order methods–backpropagation learning with transformations in nonlinearities. In Neural Information Processing, pages 442–449. Springer.
- Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and antoine Manzagol, P. (2010). Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research.
- Wang, S. I. and Manning, C. D. (2013). Fast dropout training. In In Proceedings of the 30th International Conference on Machine Learning (ICML).
- Xu, B., Wang, N., and Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. CoRR.

Tags

Comments