# Understanding and Simplifying One-Shot Architecture Search

ICML, pp. 549-558, 2018.

EI

Keywords:

search spaceefficient architecture searchneural architecture searchsearch methodweight sharingMore(8+)

Wei bo:

Abstract:

There is growing interest in automating neural network architecture design. Existing architecture search methods can be computationally expensive, requiring thousands of different architectures to be trained from scratch. Recent work has explored weight sharing across models to amortize the cost of training. Although previous methods redu...More

Code:

Data:

Introduction

- Designing neural networks is a labor-intensive process that requires a large amount of trial and error by experts.
- Zoph et al (2017) show that one can find an architecture that simultaneously achieves state-of-the-art performance on the CIFAR-10, ImageNet, and COCO datasets
- These search methods are incredibly resource-hungry.
- Zoph et al (2017) used 450 GPUs for four days in order to run a single experiment
- They proposed an RL-based approach in which a neural network enumerates a set of architectures to evaluate.
- The weights of the controller are subsequently updated based on the validation accuracies of the trained models

Highlights

- Designing neural networks is a labor-intensive process that requires a large amount of trial and error by experts
- A given convolutional layer might only be used for a subset of the architectures in the search space
- Comparing our sample of 10 top architectures (“One-Shot Top”) against 10 randomly sampled architectures (“randomly sampled architectures”), we find that the top architectures have better accuracies by around .5% absolute
- We analyzed a class of efficient architecture search methods based on weight sharing in a model containing the entire search space of architectures
- The baseline All On trains the model with all paths turned on. This approach gets .1% higher accuracies, with the best models obtaining 96.5%
- We explained how the fixed set of weights in a one-shot model can be used to predict the performance of stand-alone architectures, demonstrating that one shot architecture search only needs gradient descent, not reinforcement learning or hypernetworks, to work well

Methods

- L2 regularization is applied only to parts of the model that are used by the current architecture.
- Without this change, layers that are dropped out frequently are regularized more.
- The accuracies of the best models decrease by only 5 - 10 percentage points when the authors switch from stand-alone to oneshot training, the accuracies of less promising architectures drop by as much as 60 percentage points.
- Why should the spread of one-shot model accuracies be so much larger than the spread of stand-alone model accuracies?

Results

- The best models get up to 96.5% accuracy with around 41M parameters.
- The baseline All On trains the model with all paths turned on.
- This approach gets .1% higher accuracies, with the best models obtaining 96.5%

Conclusion

- The authors analyzed a class of efficient architecture search methods based on weight sharing in a model containing the entire search space of architectures.
- The authors designed a training method and search space to address the fundamental challenges in making these methods work.
- Through this simplified lens, the authors explained how the fixed set of weights in a one-shot model can be used to predict the performance of stand-alone architectures, demonstrating that one shot architecture search only needs gradient descent, not reinforcement learning or hypernetworks, to work well

Summary

## Introduction:

Designing neural networks is a labor-intensive process that requires a large amount of trial and error by experts.- Zoph et al (2017) show that one can find an architecture that simultaneously achieves state-of-the-art performance on the CIFAR-10, ImageNet, and COCO datasets
- These search methods are incredibly resource-hungry.
- Zoph et al (2017) used 450 GPUs for four days in order to run a single experiment
- They proposed an RL-based approach in which a neural network enumerates a set of architectures to evaluate.
- The weights of the controller are subsequently updated based on the validation accuracies of the trained models
## Methods:

L2 regularization is applied only to parts of the model that are used by the current architecture.- Without this change, layers that are dropped out frequently are regularized more.
- The accuracies of the best models decrease by only 5 - 10 percentage points when the authors switch from stand-alone to oneshot training, the accuracies of less promising architectures drop by as much as 60 percentage points.
- Why should the spread of one-shot model accuracies be so much larger than the spread of stand-alone model accuracies?
## Results:

The best models get up to 96.5% accuracy with around 41M parameters.- The baseline All On trains the model with all paths turned on.
- This approach gets .1% higher accuracies, with the best models obtaining 96.5%
## Conclusion:

The authors analyzed a class of efficient architecture search methods based on weight sharing in a model containing the entire search space of architectures.- The authors designed a training method and search space to address the fundamental challenges in making these methods work.
- Through this simplified lens, the authors explained how the fixed set of weights in a one-shot model can be used to predict the performance of stand-alone architectures, demonstrating that one shot architecture search only needs gradient descent, not reinforcement learning or hypernetworks, to work well

- Table1: Architecture search on CIFAR-10. We evaluate ten models and report the mean x and standard deviation y as x ± y
- Table2: Architecture search results on ImageNet

Related work

- The use of meta-learning to improve machine learning has a long history (Schmidhuber, 1987; Hochreiter et al, 2001; Thrun & Pratt, 2012). Beyond architecture search, metalearning has been used to optimize other components of learning algorithms such as update rules (Andrychowicz et al, 2016; Wichrowska et al, 2017; Bello et al, 2017) and activation functions (Ramachandran et al, 2017).

Our work is most closely related to SMASH (Brock et al, 2017), which in turn is motivated by NAS (Zoph & Le, 2016). In NAS, a neural network controller is used to search for good architectures. The training of the NAS controller requires a loop: The controller proposes child model architectures, which are trained and evaluated. The controller is then updated by policy gradient (Williams, 1992) to sample better architectures over time. Once the controller is done training, the best architectures are selected and trained longer to improve their accuracies. The main bottleneck of NAS is the training of the child model architectures; SMASH aims to amortize this cost. In SMASH, a hypernetwork is trained a priori to generate suitable weights for every child model architecture in the search space. The same fixed hypernetwork is then used to evaluate many different child model architectures.

Funding

- The best models get up to 96.5% accuracy with around 41M parameters
- The baseline All On trains the model with all paths turned on. This approach gets .1% higher accuracies, with the best models obtaining 96.5%

Study subjects and analysis

workers: 16

To demonstrate its importance, we trained one-shot models with varying dropout rates. Following the setup described at the beginning of the section, each one-shot model was trained for 5,000 steps (113 epochs) using Synchronous SGD with 16 workers. The dropout rates in these experiments were kept constant throughout training

Reference

- Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, pp. 265–283, Berkeley, CA, USA, 2016. USENIX Association. ISBN 978-1-93197133-1.
- Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., and de Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pp. 3981–3989, 2016.
- Baker, B., Gupta, O., Naik, N., and Raskar, R. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
- Bello, I., Zoph, B., Vasudevan, V., and Le, Q. V. Neural optimizer search with reinforcement learning. arXiv preprint arXiv:1709.07417, 2017.
- Bergstra, J. and Bengio, Y. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
- Bergstra, J., Yamins, D., and Cox, D. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International Conference on Machine Learning, pp. 115–123, 2013.
- Bergstra, J. S., Bardenet, R., Bengio, Y., and Kegl, B. Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, pp. 2546–2554, 2011.
- Brock, A., Lim, T., Ritchie, J. M., and Weston, N. SMASH: One-shot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344, 2017.
- Cai, H., Chen, T., Zhang, W., Yu, Y., and Wang, J. Reinforcement learning for architecture search by network transformation. arXiv preprint arXiv:1707.04873, 2017.
- Elsken, T., Metzen, J.-H., and Hutter, F. Simple and efficient architecture search for convolutional neural networks. arXiv preprint arXiv:1711.04528, 2017.
- Gordon, A., Eban, E., Nachum, O., Chen, B., Yang, T., and Choi, E. Morphnet: Fast & simple resource-constrained structure learning of deep networks. CoRR, abs/1711.06798, 2017. URL http://arxiv.org/abs/1711.06798.
- Hoffer, E., Hubara, I., and Soudry, D. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pp. 1729–1739, 2017.
- Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M., Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning, I., Simonyan, K., et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017.
- Jozefowicz, R., Zaremba, W., and Sutskever, I. An empirical exploration of recurrent network architectures. In International Conference on Machine Learning, pp. 2342–2350, 2015.
- Liu, C., Zoph, B., Shlens, J., Hua, W., Li, L.-J., Fei-Fei, L., Yuille, A., Huang, J., and Murphy, K. Progressive neural architecture search. arXiv preprint arXiv:1712.00559, 2017a.
- Liu, H., Simonyan, K., Vinyals, O., Fernando, C., and Kavukcuoglu, K. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436, 2017b.
- Miikkulainen, R., Liang, J., Meyerson, E., Rawal, A., Fink, D., Francon, O., Raju, B., Navruzyan, A., Duffy, N., and Hodjat, B. Evolving deep neural networks. arXiv preprint arXiv:1703.00548, 2017.
- Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., and Dean, J. Faster discovery of neural architectures by searching for paths in a large model. International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ByQZjx-0-.
- Ramachandran, P., Zoph, B., and Le, Q. V. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
- Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y. L., Le, Q., and Kurakin, A. Large-scale evolution of image classifiers. arXiv preprint arXiv:1703.01041, 2017.
- Schmidhuber, J. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta... hook. PhD thesis, Technische Universitat Munchen, 1987.
- Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012.
- Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, M., and Adams, R. Scalable bayesian optimization using deep neural networks. In International Conference on Machine Learning, pp. 2171–2180, 2015.
- Stanley, K. O. and Miikkulainen, R. Evolving neural networks through augmenting topologies. Evolutionary computation, 10(2):99–127, 2002.
- Thrun, S. and Pratt, L. Learning to learn. Springer Science & Business Media, 2012.
- Wichrowska, O., Maheswaranathan, N., Hoffman, M. W., Colmenarejo, S. G., Denil, M., de Freitas, N., and SohlDickstein, J. Learned optimizers that scale and generalize. arXiv preprint arXiv:1703.04813, 2017.
- Xie, L. and Yuille, A. Genetic cnn. arXiv preprint arXiv:1703.01513, 2017.
- Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2016.
- Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012, 2017.

Tags

Comments