# Faster Discovery of Neural Architectures by Searching for Paths in a Large Model

international conference on learning representations, 2018.

EI

Keywords:

neural network architectureEfficient Neural Architecture Searchstandard nasdirected acyclic graphCIFAR-10More(13+)

Wei bo:

Abstract:

We propose Efficient Neural Architecture Search (ENAS), a faster and less expensive approach to automated model design than previous methods. In ENAS, a controller learns to discover neural network architectures by searching for an optimal path within a larger model. The controller is trained with policy gradient to select a path that max...More

Code:

Data:

Introduction

- Neural architecture search (NAS) has been applied successfully to design model architectures for image classification and language modeling (Zoph & Le, 2017; Baker et al, 2017a; Bello et al, 2017b; Zoph et al, 2017; Cai et al, 2017).
- In standard NAS (Zoph & Le, 2017; Baker et al, 2017a), an RNN controller is trained by policy gradient to search for a good architecture, which is basically a computational graph.
- As illustrated in Figure 1, a neural network architecture can be found by taking a subset of edges in this DAG
- This design is advantageous because it enables sharing parameters among all architectures in the search space

Highlights

- Neural architecture search (NAS) has been applied successfully to design model architectures for image classification and language modeling (Zoph & Le, 2017; Baker et al, 2017a; Bello et al, 2017b; Zoph et al, 2017; Cai et al, 2017)
- The goal of this work is to remove this inefficiency by enabling more sharing between the child models
- We present an ablation study which shows the role of Efficient Neural Architecture Search in discovering novel architectures, as well as details regarding the efficiency of Efficient Neural Architecture Search
- We presented Efficient Neural Architecture Search, an alternative method to Neural architecture search, that requires three orders of magnitude less resources×time
- The key insight of our method is to share parameters across child models during architecture search. This insight is implemented by having Neural architecture search search for a path within a larger model
- We demonstrate empirically that the method works well on both CIFAR-10 and Penn Treebank datasets

Methods

- The CIFAR-10 dataset (Krizhevsky, 2009) consists of 50,000 training images and 10,000 test images.
- In Section 3.3, the authors present a search space for convolutional architectures in which the controller can make decisions over skip connections and the mask over the channels.
- The authors explore two restricted versions of this search space: one where the controller only needs to make decisions of the mask over the channels and one where the controller only needs to make decisions of the skip connections.

Results

- Running on a single Nvidia GTX 1080Ti GPU, ENAS finds the recurrent cell in less than 10 hours.
- In the general search space, ENAS takes 15.6 hours to find a model that achieves 4.23% error rate on CIFAR-10.
- This model outperforms all but one model reported by Zoph & Le (2017), while taking 30x less time and using 800x less computing resource to discover.

Conclusion

- Neural Architecture Search (NAS) is an important advance that allows faster architecture design for neural networks.
- The authors presented ENAS, an alternative method to NAS, that requires three orders of magnitude less resources×time.
- The key insight of the method is to share parameters across child models during architecture search.
- This insight is implemented by having NAS search for a path within a larger model.
- The authors demonstrate empirically that the method works well on both CIFAR-10 and Penn Treebank datasets

Summary

## Introduction:

Neural architecture search (NAS) has been applied successfully to design model architectures for image classification and language modeling (Zoph & Le, 2017; Baker et al, 2017a; Bello et al, 2017b; Zoph et al, 2017; Cai et al, 2017).- In standard NAS (Zoph & Le, 2017; Baker et al, 2017a), an RNN controller is trained by policy gradient to search for a good architecture, which is basically a computational graph.
- As illustrated in Figure 1, a neural network architecture can be found by taking a subset of edges in this DAG
- This design is advantageous because it enables sharing parameters among all architectures in the search space
## Objectives:

The goal of this work is to remove this inefficiency by enabling more sharing between the child models.## Methods:

The CIFAR-10 dataset (Krizhevsky, 2009) consists of 50,000 training images and 10,000 test images.- In Section 3.3, the authors present a search space for convolutional architectures in which the controller can make decisions over skip connections and the mask over the channels.
- The authors explore two restricted versions of this search space: one where the controller only needs to make decisions of the mask over the channels and one where the controller only needs to make decisions of the skip connections.
## Results:

Running on a single Nvidia GTX 1080Ti GPU, ENAS finds the recurrent cell in less than 10 hours.- In the general search space, ENAS takes 15.6 hours to find a model that achieves 4.23% error rate on CIFAR-10.
- This model outperforms all but one model reported by Zoph & Le (2017), while taking 30x less time and using 800x less computing resource to discover.
## Conclusion:

Neural Architecture Search (NAS) is an important advance that allows faster architecture design for neural networks.- The authors presented ENAS, an alternative method to NAS, that requires three orders of magnitude less resources×time.
- The key insight of the method is to share parameters across child models during architecture search.
- This insight is implemented by having NAS search for a path within a larger model.
- The authors demonstrate empirically that the method works well on both CIFAR-10 and Penn Treebank datasets

- Table1: Test perplexity on Penn Treebank of ENAS and other approaches. VD = Variational Dropout; WT = Weight Tying; MC = Monte Carlo sampling
- Table2: Classification error rates of ENAS and other methods on CIFAR-10. In this table, the first block presents the state-of-the-art models, all of which are designed by human experts. The second block presents various approaches that design the entire network. ENAS outperforms all these methods but NAS, which requires much more computing resource and time. The last block presents techniques that design modular cells which are used to build a large model. ENAS outperforms MicroNAS, which uses 32 GPUs to search, and achieves similar performance with NASNet-A
- Table3: Time and resources needed for different architecture search methods to find the good architectures for CIFAR-10

Related work

- There is growing interest in improving the efficiency of neural architecture search. Concurrent to our work are the promising ideas of using learning curve prediction to skip bad models (Baker et al, 2017b), predicting the accuracies of models before training (Deng et al, 2017), using iterative search method for architectures of growing complexity (Liu et al, 2017a; Elsken et al, 2017), or using hierarchical representation of architectures (Liu et al, 2017b). Our method is also inspired by the concept of weight inheritance in neuroevolution, which has been demonstrated to have positive effects at scale (Real et al, 2017).

Closely related to our method are other recent approaches that avoid training each architecture from scratch, such as convolutional neural fabrics – ConvFabrics (Saxena & Verbeek, 2016) and SMASH (Brock et al, 2017). These methods are more computationally efficient than standard NAS. However, the search space of ConvFabrics is not flexible enough to include novel architectures, e.g. architectures with arbitrary skip connections as in Zoph & Le (2017). Meanwhile, SMASH can design interesting architectures but requires a hypernetwork (Ha et al, 2017) to generate the weights, conditional on the architectures. While a hypernetwork can efficiently rank different architectures, as shown in the paper, the real performance of each network is different from its performance with parameters generated by a hypernetwork. Such discrepancy in SMASH can cause misleading signals for reinforcement learning. Even more closely related to our method is PathNet (Fernando et al, 2017), which uses evolution to search for a path inside a large model for transfer learning between Atari games.

Reference

- Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. In ICLR, 2017a.
- Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Accelerating neural architecture search using performance prediction. Arxiv, 1705.10823, 2017b.
- Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. In ICLR Workshop, 2017a.
- Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V Le. Neural optimizer search with reinforcement learning. In ICML, 2017b.
- Leon Bottou. Une Approche theorique de l’Apprentissage Connexioniste: Applicationsa la reconnaissance de la Parole. PhD thesis, 1991.
- Andrew Brock, Theodore Lim, James M. Ritchie, and Nick Weston. SMASH: one-shot model architecture search through hypernetworks. Arxiv, 1708.05344, 2017.
- Han Cai, Tianyao Chen, Weinan Zhang, Yong. Yu, and Jun Wang. Reinforcement learning for architecture search by network transformation. Arxiv, 1707.04873, 2017.
- Boyang Deng, Junjie Yan, and Dahua Lin. Peephole: Predicting network performance before training. Arxiv, 1705.10823, 2017.
- Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. Arxiv, 1708.04552, 2017.
- Thomas Elsken, Jan-Hendrik Metzen, and Frank Hutter. Simple and efficient architecture search for convolutional neural networks. Arxiv, 1711.04528, 2017.
- Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv, 1701.08734, 2017.
- Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In NIPS, 2016.
- Xavier Gastaldi. Shake-shake regularization of 3-branch residual networks. In ICLR Workshop Track, 2016.
- David Ha, Andrew Dai, and Quoc V. Le. Hypernetworks. In ICLR, 2017.
- Kaiming He, Xiangyu Zhang, Shaoqing Rein, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In CVPR, 2015.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CPVR, 2016.
- Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. In Neural Computations, 1997.
- Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, 2016.
- Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: a loss framework for language modeling. In ICLR, 2017.
- Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- Emmanuel Krause, Ben Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of neural sequence models. Arxiv, 1709.07432, 2017.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
- Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. In ICLR, 2017.
- Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. Arxiv, 1312.4400, 2013.
- Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. Arxiv, 1712.03351, 2017a.
- Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. Arxiv, 1711.00436, 2017b.
- Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017.
- Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penn treebank: Annotating predicate argument structure. In Proceedings of the Workshop on Human Language Technology, 1994.
- Gabor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. Arxiv, 1707.05589, 2017.
- Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing LSTM language models. Arxiv, 1708.02182, 2017.
- Renato Negrinho and Geoff Gordon. Deeparchitect: Automatically designing and training deep architectures. In CPVR, 2017.
- Yurii E. Nesterov. A method for solving the convex programming problem with convergence rate o(1/k2). Soviet Mathematics Doklady, 1983.
- Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc Le, and Alex Kurakin. Large-scale evolution of image classifiers. In ICML, 2017.
- Shreyas Saxena and Jakob Verbeek. Convolutional neural fabrics. In NIPS, 2016.
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CPVR, 2016.
- Tom Veniat and Ludovic Denoyer. Learning time-efficient deep architectures with budgeted super networks. Arxiv, 1706.00046, 2017.
- Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992.
- Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991.
- Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. Arxiv, 1409.2329, 2014.
- Zhao Zhong, Junjie Yan, and Cheng-Lin Liu. Practical network blocks design with qlearning. Arxiv, 1708.05552, 2017.
- Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutnık, and Jurgen Schmidhuber. Recurrent highway networks. In ICML, 2017.
- Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In ICLR, 2017.
- Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. Arxiv, 1707.07012, 2017.

Tags

Comments