# Efficient Neural Architecture Search via Parameter Sharing

ICML, pp. 4092-4101, 2018.

EI

Keywords:

Wei bo:

Abstract:

We propose Efficient Neural Architecture Search (ENAS), a fast and inexpensive approach for automatic model design. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained with policy gradient to select a subgraph that maximize...More

Code:

Data:

Introduction

- Neural architecture search (NAS) has been successfully applied to design model architectures for image classification and language models (Zoph & Le, 2017; Zoph et al, 2018; Cai et al, 2018; Liu et al, 2017; 2018).
- In NAS, a controller is trained in a loop: the controller first samples a candidate architecture, i.e. a child model, trains it to convergence, and measure its performance on the task of desire.
- The controller uses the performance as a guiding signal to find more promising architectures.
- This process is repeated for many iterations.
- The authors observe that the computational bottleneck of NAS is the training of each child model to convergence, only to measure its accuracy whilst throwing away all the trained weights

Highlights

- Neural architecture search (NAS) has been successfully applied to design model architectures for image classification and language models (Zoph & Le, 2017; Zoph et al, 2018; Cai et al, 2018; Liu et al, 2017; 2018)
- On Penn Treebank, our method achieves a test perplexity of 56.3, which significantly outperforms Neural architecture search’s test perplexity of 62.4 (Zoph & Le, 2017) and which is on par with the existing state-of-the-art among Penn Treebank’s approaches that do not utilize post-training processing (56.0; Yang et al (2018))
- The Efficient Neural Architecture Search cell achieves a test perplexity of 56.3, which is on par with the existing state-of-the-art of 56.0 achieved by Mixture of Softmaxes (MoS) (Yang et al, 2018)
- Efficient Neural Architecture Search finds a network architecture, which we visualize in Figure 7, and which achieves 4.23% test error. This test error is better than the error of 4.47%, achieved by the second best Neural architecture search model (Zoph & Le, 2017)
- We presented Efficient Neural Architecture Search, a novel method that speeds up Neural architecture search by more than 1000x, in terms of GPU hours
- We showed that Efficient Neural Architecture Search works well on both CIFAR-10 and Penn Treebank datasets

Methods

- Central to the idea of ENAS is the observation that all of the graphs which NAS ends up iterating over can be viewed as sub-graphs of a larger graph.
- The authors can represent NAS’s search space using a single directed acyclic graph (DAG).
- ENAS’s DAG is the superposition of all possible child models in a search space of NAS, where the nodes represent the local computations and the edges anism via a simple example recurrent cell with N = 4 computational nodes.
- Let xt be the input signal for a recurrent cell, and ht−1 be the output from the previous time step.

Results

- Running on a single Nvidia GTX 1080Ti GPU, ENAS finds a recurrent cell in about 10 hours.
- ENAS finds a network architecture, which the authors visualize in Figure 7, and which achieves 4.23% test error.
- This test error is better than the error of 4.47%, achieved by the second best NAS model (Zoph & Le, 2017).
- ENAS takes about 7 hours to find this architecture, reducing the number of GPU-hours by more than 50,000x compared to NAS

Conclusion

- NAS is an important advance that automatizes the designing process of neural networks.
- The authors presented ENAS, a novel method that speeds up NAS by more than 1000x, in terms of GPU hours.
- ENAS’s key contribution is the sharing of parameters across child models during the search for architectures.
- This insight is implemented by searching for a subgraph within a larger graph that incorporates architectures in a search space.
- The authors showed that ENAS works well on both CIFAR-10 and Penn Treebank datasets

Summary

## Introduction:

Neural architecture search (NAS) has been successfully applied to design model architectures for image classification and language models (Zoph & Le, 2017; Zoph et al, 2018; Cai et al, 2018; Liu et al, 2017; 2018).- In NAS, a controller is trained in a loop: the controller first samples a candidate architecture, i.e. a child model, trains it to convergence, and measure its performance on the task of desire.
- The controller uses the performance as a guiding signal to find more promising architectures.
- This process is repeated for many iterations.
- The authors observe that the computational bottleneck of NAS is the training of each child model to convergence, only to measure its accuracy whilst throwing away all the trained weights
## Objectives:

Since the goal of the work is to discover cell architectures, the authors only employ the standard training and test process on Penn Treebank, and do not utilize post-training techniques such as neural cache (Grave et al, 2017) and dynamic evaluation (Krause et al, 2017).## Methods:

Central to the idea of ENAS is the observation that all of the graphs which NAS ends up iterating over can be viewed as sub-graphs of a larger graph.- The authors can represent NAS’s search space using a single directed acyclic graph (DAG).
- ENAS’s DAG is the superposition of all possible child models in a search space of NAS, where the nodes represent the local computations and the edges anism via a simple example recurrent cell with N = 4 computational nodes.
- Let xt be the input signal for a recurrent cell, and ht−1 be the output from the previous time step.
## Results:

Running on a single Nvidia GTX 1080Ti GPU, ENAS finds a recurrent cell in about 10 hours.- ENAS finds a network architecture, which the authors visualize in Figure 7, and which achieves 4.23% test error.
- This test error is better than the error of 4.47%, achieved by the second best NAS model (Zoph & Le, 2017).
- ENAS takes about 7 hours to find this architecture, reducing the number of GPU-hours by more than 50,000x compared to NAS
## Conclusion:

NAS is an important advance that automatizes the designing process of neural networks.- The authors presented ENAS, a novel method that speeds up NAS by more than 1000x, in terms of GPU hours.
- ENAS’s key contribution is the sharing of parameters across child models during the search for architectures.
- This insight is implemented by searching for a subgraph within a larger graph that incorporates architectures in a search space.
- The authors showed that ENAS works well on both CIFAR-10 and Penn Treebank datasets

- Table1: Test perplexity on Penn Treebank of ENAS and other baselines. Abbreviations: RHN is Recurrent Highway Network, VD is Variational Dropout; WT is Weight Tying; 2 is Weight Penalty; AWD is Averaged Weight Drop; MoC is Mixture of Contexts; MoS is Mixture of Softmaxes
- Table2: Classification errors of ENAS and baselines on CIFAR-10. In this table, the first block presents DenseNet, one of the state-of-theart architectures designed by human experts. The second block presents approaches that design the entire network. The last block presents techniques that design modular cells which are combined to build the final network

Related work

**Related Work and Discussions**

There is a growing interest in improving the efficiency of NAS. Concurrent to our work are the promising ideas of using performance prediction (Baker et al, 2017b; Deng et al, 2017), using iterative search method for architectures of growing complexity (Liu et al, 2017), and using hierarchical representation of architectures (Liu et al, 2018). Table 2 shows that ENAS is significantly more efficient than these other methods, in GPU hours.

ENAS’s design of sharing weights between architectures is inspired by the concept of weight inheritance in neural model evolution (Real et al, 2017; 2018). Additionally, ENAS’s choice of representing computations using a DAG is inspired by the concept of stochastic computational graph (Schulman et al, 2015), which introduces nodes with stochastic outputs into a computational graph. ENAS’s utilizes such stochastic decisions in a network to make discrete architectural decisions that govern subsequent computations in the network, trains the decision maker, i.e. the controller,

Funding

- As shown, ENAS finds a network architecture, which we visualize in Figure 7, and which achieves 4.23% test error. This test error is better than the error of 4.47%, achieved by the second best NAS model (Zoph & Le, 2017)

Reference

- Baker, Bowen, Gupta, Otkrist, Naik, Nikhil, and Raskar, Ramesh. Designing neural network architectures using reinforcement learning. In ICLR, 2017a.
- Baker, Bowen, Otkrist, Gupta, Raskar, Ramesh, and Naik, Nikhil. Accelerating neural architecture search using performance prediction. Arxiv, 1705.10823, 2017b.
- Bello, Irwan, Pham, Hieu, Le, Quoc V., Norouzi, Mohammad, and Bengio, Samy. Neural combinatorial optimization with reinforcement learning. In ICLR Workshop, 2017a.
- Bello, Irwan, Zoph, Barret, Vasudevan, Vijay, and Le, Quoc V. Neural optimizer search with reinforcement learning. In ICML, 2017b.
- Brock, Andrew, Lim, Theodore, Ritchie, James M., and Weston, Nick. SMASH: one-shot model architecture search through hypernetworks. ICLR, 2018.
- Cai, Han, Chen, Tianyao, Zhang, Weinan, Yu, Yong., and Wang, Jun. Efficient architecture search by network transformation. In AAAI, 2018.
- Chollet, Francois. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017.
- Collins, Jasmine, Sohl-Dickstein, Jascha, and Sussillo, David. Capacity and trainability in recurrent neural networks. In ICLR, 2017.
- Deng, Boyang, Yan, Junjie, and Lin, Dahua. Peephole: Predicting network performance before training. Arxiv, 1705.10823, 2017.
- DeVries, Terrance and Taylor, Graham W. Improved regularization of convolutional neural networks with cutout. Arxiv, 1708.04552, 2017.
- Gal, Yarin and Ghahramani, Zoubin. A theoretically grounded application of dropout in recurrent neural networks. In NIPS, 2016.
- Gastaldi, Xavier. Shake-shake regularization of 3-branch residual networks. In ICLR Workshop Track, 2016.
- Grave, Edouard, Joulin, Armand, and Usunier, Nicolas. Improving neural language models with a continuous cache. In ICLR, 2017.
- Ha, David, Dai, Andrew, and Le, Quoc V. Hypernetworks. In ICLR, 2017.
- He, Kaiming, Zhang, Xiangyu, Rein, Shaoqing, and Sun, Jian. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In CVPR, 2015.
- He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. In CPVR, 2016a.
- He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Identity mappings in deep residual networks. In CPVR, 2016b.
- Hochreiter, Sepp and Schmidhuber, Jurgen. Long short-term memory. In Neural Computations, 1997.
- Huang, Gao, Liu, Zhuang, van der Maaten, Laurens, and Weinberger, Kilian Q. Densely connected convolutional networks. In CVPR, 2016.
- Inan, Hakan, Khosravi, Khashayar, and Socher, Richard. Tying word vectors and word classifiers: a loss framework for language modeling. In ICLR, 2017.
- Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- Kingma, Diederik P. and Ba, Jimmy Lei. Adam: A method for stochastic optimization. In ICLR, 2015.
- Krause, Ben, Kahembwe, Emmanuel, Murray, Iain, and Renals, Steve. Dynamic evaluation of neural sequence models. Arxiv, 1709.07432, 2017.
- Krizhevsky, Alex. Learning multiple layers of features from tiny images. Technical report, 2009.
- Larsson, Gustav, Maire, Michael, and Shakhnarovich, Gregory. Fractalnet: Ultra-deep neural networks without residuals. In ICLR, 2017.
- Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network in network. Arxiv, 1312.4400, 2013.
- Liu, Chenxi, Zoph, Barret, Shlens, Jonathon, Hua, Wei, Li, Li-Jia, Fei-Fei, Li, Yuille, Alan, Huang, Jonathan, and Murphy, Kevin. Progressive neural architecture search. Arxiv, 1712.00559, 2017.
- Liu, Hanxiao, Simonyan, Karen, Vinyals, Oriol, Fernando, Chrisantha, and Kavukcuoglu, Koray. Hierarchical representations for efficient architecture search. In ICLR, 2018.
- Loshchilov, Ilya and Hutter, Frank. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017.
- Luong, Minh-Thang, Le, Quoc V., Sutskever, Ilya, Vinyals, Oriol, and Kaiser, Lukasz. Multi-task sequence to sequence learning. In ICLR, 2016.
- Marcus, Mitchell, Kim, Grace, Marcinkiewicz, Mary Ann, MacIntyre, Robert, Bies, Ann, Ferguson, Mark, Katz, Karen, and Schasberger, Britta. The penn treebank: Annotating predicate argument structure. In Proceedings of the Workshop on Human Language Technology, 1994.
- Melis, Gabor, Dyer, Chris, and Blunsom, Phil. On the state of the art of evaluation in neural language models. Arxiv, 1707.05589, 2017.
- Merity, Stephen, Keskar, Nitish Shirish, and Socher, Richard. Regularizing and optimizing LSTM language models. Arxiv, 1708.02182, 2017.
- Negrinho, Renato and Gordon, Geoff. Deeparchitect: Automatically designing and training deep architectures. In CPVR, 2017.
- Zoph, Barret, Yuret, Deniz, May, Jonathan, and Knight, Kevin. Transfer learning for low-resource neural machine translation. In EMNLP, 2016.
- Zoph, Barret, Vasudevan, Vijay, Shlens, Jonathon, and Le, Quoc V. Learning transferable architectures for scalable image recognition. In CVPR, 2018.
- Mathematics Doklady, 1983.
- Razavian, Ali Sharif, Azizpour, Hossein, Josephine, Sullivan, and Carlsson, Stefan. Cnn features off-the-shelf: an astounding baseline for recognition. In CVPR, 2014.
- Real, Esteban, Moore, Sherry, Selle, Andrew, Saxena, Saurabh, Leon, Yutaka Suematsu, Tan, Jie, Le, Quoc, and Kurakin, Alex. Large-scale evolution of image classifiers. In ICML, 2017.
- Real, Esteban, Aggarwal, Alok, Huang, Yanping, and Le, Quoc V. Peephole: Predicting network performance before training. Arxiv, 1802.01548, 2018.
- Saxena, Shreyas and Verbeek, Jakob. Convolutional neural fabrics. In NIPS, 2016.
- Schulman, John, Heess, Nicolas, Weber, Theophane, and Abbeel, Pieter. Gradient estimation using stochastic computation graphs. In NIPS, 2015.
- Veniat, Tom and Denoyer, Ludovic. Learning time-efficient deep architectures with budgeted super networks. Arxiv, 1706.00046, 2017.
- Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992.
- Yang, Zhilin, Dai, Zihang, Salakhutdinov, Ruslan, and Cohen, William. Breaking the softmax bottleneck: A highrank rnn language model. In ICLR, 2018.
- Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Recurrent neural network regularization. Arxiv, 1409.2329, 2014.
- Zhong, Zhao, Yan, Junjie, and Liu, Cheng-Lin. Practical network blocks design with q-learning. AAAI, 2018.
- Zilly, Julian Georg, Srivastava, Rupesh Kumar, Koutnık, Jan, and Schmidhuber, Jurgen. Recurrent highway networks. In ICML, 2017.
- Zoph, Barret and Le, Quoc V. Neural architecture search with reinforcement learning. In ICLR, 2017.

Tags

Comments