Efficient Neural Architecture Search via Parameter Sharing

    Hieu Pham
    Hieu Pham
    Melody Y. Guan
    Melody Y. Guan
    Barret Zoph
    Barret Zoph

    ICML, pp. 4092-4101, 2018.

    Cited by: 400|Bibtex|Views79|Links
    EI
    Keywords:
    test errorlanguage modelmodel designstochastic gradient descentarchitecture searchMore(17+)
    Wei bo:
    We presented Efficient Neural Architecture Search, a novel method that speeds up Neural architecture search by more than 1000x, in terms of GPU hours

    Abstract:

    We propose Efficient Neural Architecture Search (ENAS), a fast and inexpensive approach for automatic model design. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained with policy gradient to select a subgraph that maximize...More

    Code:

    Data:

    Introduction
    • Neural architecture search (NAS) has been successfully applied to design model architectures for image classification and language models (Zoph & Le, 2017; Zoph et al, 2018; Cai et al, 2018; Liu et al, 2017; 2018).
    • In NAS, a controller is trained in a loop: the controller first samples a candidate architecture, i.e. a child model, trains it to convergence, and measure its performance on the task of desire.
    • The controller uses the performance as a guiding signal to find more promising architectures.
    • This process is repeated for many iterations.
    • The authors observe that the computational bottleneck of NAS is the training of each child model to convergence, only to measure its accuracy whilst throwing away all the trained weights
    Highlights
    • Neural architecture search (NAS) has been successfully applied to design model architectures for image classification and language models (Zoph & Le, 2017; Zoph et al, 2018; Cai et al, 2018; Liu et al, 2017; 2018)
    • On Penn Treebank, our method achieves a test perplexity of 56.3, which significantly outperforms Neural architecture search’s test perplexity of 62.4 (Zoph & Le, 2017) and which is on par with the existing state-of-the-art among Penn Treebank’s approaches that do not utilize post-training processing (56.0; Yang et al (2018))
    • The Efficient Neural Architecture Search cell achieves a test perplexity of 56.3, which is on par with the existing state-of-the-art of 56.0 achieved by Mixture of Softmaxes (MoS) (Yang et al, 2018)
    • Efficient Neural Architecture Search finds a network architecture, which we visualize in Figure 7, and which achieves 4.23% test error. This test error is better than the error of 4.47%, achieved by the second best Neural architecture search model (Zoph & Le, 2017)
    • We presented Efficient Neural Architecture Search, a novel method that speeds up Neural architecture search by more than 1000x, in terms of GPU hours
    • We showed that Efficient Neural Architecture Search works well on both CIFAR-10 and Penn Treebank datasets
    Methods
    • Central to the idea of ENAS is the observation that all of the graphs which NAS ends up iterating over can be viewed as sub-graphs of a larger graph.
    • The authors can represent NAS’s search space using a single directed acyclic graph (DAG).
    • ENAS’s DAG is the superposition of all possible child models in a search space of NAS, where the nodes represent the local computations and the edges anism via a simple example recurrent cell with N = 4 computational nodes.
    • Let xt be the input signal for a recurrent cell, and ht−1 be the output from the previous time step.
    Results
    • Running on a single Nvidia GTX 1080Ti GPU, ENAS finds a recurrent cell in about 10 hours.
    • ENAS finds a network architecture, which the authors visualize in Figure 7, and which achieves 4.23% test error.
    • This test error is better than the error of 4.47%, achieved by the second best NAS model (Zoph & Le, 2017).
    • ENAS takes about 7 hours to find this architecture, reducing the number of GPU-hours by more than 50,000x compared to NAS
    Conclusion
    • NAS is an important advance that automatizes the designing process of neural networks.
    • The authors presented ENAS, a novel method that speeds up NAS by more than 1000x, in terms of GPU hours.
    • ENAS’s key contribution is the sharing of parameters across child models during the search for architectures.
    • This insight is implemented by searching for a subgraph within a larger graph that incorporates architectures in a search space.
    • The authors showed that ENAS works well on both CIFAR-10 and Penn Treebank datasets
    Summary
    • Introduction:

      Neural architecture search (NAS) has been successfully applied to design model architectures for image classification and language models (Zoph & Le, 2017; Zoph et al, 2018; Cai et al, 2018; Liu et al, 2017; 2018).
    • In NAS, a controller is trained in a loop: the controller first samples a candidate architecture, i.e. a child model, trains it to convergence, and measure its performance on the task of desire.
    • The controller uses the performance as a guiding signal to find more promising architectures.
    • This process is repeated for many iterations.
    • The authors observe that the computational bottleneck of NAS is the training of each child model to convergence, only to measure its accuracy whilst throwing away all the trained weights
    • Objectives:

      Since the goal of the work is to discover cell architectures, the authors only employ the standard training and test process on Penn Treebank, and do not utilize post-training techniques such as neural cache (Grave et al, 2017) and dynamic evaluation (Krause et al, 2017).
    • Methods:

      Central to the idea of ENAS is the observation that all of the graphs which NAS ends up iterating over can be viewed as sub-graphs of a larger graph.
    • The authors can represent NAS’s search space using a single directed acyclic graph (DAG).
    • ENAS’s DAG is the superposition of all possible child models in a search space of NAS, where the nodes represent the local computations and the edges anism via a simple example recurrent cell with N = 4 computational nodes.
    • Let xt be the input signal for a recurrent cell, and ht−1 be the output from the previous time step.
    • Results:

      Running on a single Nvidia GTX 1080Ti GPU, ENAS finds a recurrent cell in about 10 hours.
    • ENAS finds a network architecture, which the authors visualize in Figure 7, and which achieves 4.23% test error.
    • This test error is better than the error of 4.47%, achieved by the second best NAS model (Zoph & Le, 2017).
    • ENAS takes about 7 hours to find this architecture, reducing the number of GPU-hours by more than 50,000x compared to NAS
    • Conclusion:

      NAS is an important advance that automatizes the designing process of neural networks.
    • The authors presented ENAS, a novel method that speeds up NAS by more than 1000x, in terms of GPU hours.
    • ENAS’s key contribution is the sharing of parameters across child models during the search for architectures.
    • This insight is implemented by searching for a subgraph within a larger graph that incorporates architectures in a search space.
    • The authors showed that ENAS works well on both CIFAR-10 and Penn Treebank datasets
    Tables
    • Table1: Test perplexity on Penn Treebank of ENAS and other baselines. Abbreviations: RHN is Recurrent Highway Network, VD is Variational Dropout; WT is Weight Tying; 2 is Weight Penalty; AWD is Averaged Weight Drop; MoC is Mixture of Contexts; MoS is Mixture of Softmaxes
    • Table2: Classification errors of ENAS and baselines on CIFAR-10. In this table, the first block presents DenseNet, one of the state-of-theart architectures designed by human experts. The second block presents approaches that design the entire network. The last block presents techniques that design modular cells which are combined to build the final network
    Download tables as Excel
    Related work
    • Related Work and Discussions

      There is a growing interest in improving the efficiency of NAS. Concurrent to our work are the promising ideas of using performance prediction (Baker et al, 2017b; Deng et al, 2017), using iterative search method for architectures of growing complexity (Liu et al, 2017), and using hierarchical representation of architectures (Liu et al, 2018). Table 2 shows that ENAS is significantly more efficient than these other methods, in GPU hours.

      ENAS’s design of sharing weights between architectures is inspired by the concept of weight inheritance in neural model evolution (Real et al, 2017; 2018). Additionally, ENAS’s choice of representing computations using a DAG is inspired by the concept of stochastic computational graph (Schulman et al, 2015), which introduces nodes with stochastic outputs into a computational graph. ENAS’s utilizes such stochastic decisions in a network to make discrete architectural decisions that govern subsequent computations in the network, trains the decision maker, i.e. the controller,
    Funding
    • As shown, ENAS finds a network architecture, which we visualize in Figure 7, and which achieves 4.23% test error. This test error is better than the error of 4.47%, achieved by the second best NAS model (Zoph & Le, 2017)
    Reference
    • Baker, Bowen, Gupta, Otkrist, Naik, Nikhil, and Raskar, Ramesh. Designing neural network architectures using reinforcement learning. In ICLR, 2017a.
      Google ScholarLocate open access versionFindings
    • Baker, Bowen, Otkrist, Gupta, Raskar, Ramesh, and Naik, Nikhil. Accelerating neural architecture search using performance prediction. Arxiv, 1705.10823, 2017b.
      Findings
    • Bello, Irwan, Pham, Hieu, Le, Quoc V., Norouzi, Mohammad, and Bengio, Samy. Neural combinatorial optimization with reinforcement learning. In ICLR Workshop, 2017a.
      Google ScholarLocate open access versionFindings
    • Bello, Irwan, Zoph, Barret, Vasudevan, Vijay, and Le, Quoc V. Neural optimizer search with reinforcement learning. In ICML, 2017b.
      Google ScholarLocate open access versionFindings
    • Brock, Andrew, Lim, Theodore, Ritchie, James M., and Weston, Nick. SMASH: one-shot model architecture search through hypernetworks. ICLR, 2018.
      Google ScholarLocate open access versionFindings
    • Cai, Han, Chen, Tianyao, Zhang, Weinan, Yu, Yong., and Wang, Jun. Efficient architecture search by network transformation. In AAAI, 2018.
      Google ScholarLocate open access versionFindings
    • Chollet, Francois. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017.
      Google ScholarLocate open access versionFindings
    • Collins, Jasmine, Sohl-Dickstein, Jascha, and Sussillo, David. Capacity and trainability in recurrent neural networks. In ICLR, 2017.
      Google ScholarLocate open access versionFindings
    • Deng, Boyang, Yan, Junjie, and Lin, Dahua. Peephole: Predicting network performance before training. Arxiv, 1705.10823, 2017.
      Findings
    • DeVries, Terrance and Taylor, Graham W. Improved regularization of convolutional neural networks with cutout. Arxiv, 1708.04552, 2017.
      Findings
    • Gal, Yarin and Ghahramani, Zoubin. A theoretically grounded application of dropout in recurrent neural networks. In NIPS, 2016.
      Google ScholarLocate open access versionFindings
    • Gastaldi, Xavier. Shake-shake regularization of 3-branch residual networks. In ICLR Workshop Track, 2016.
      Google ScholarLocate open access versionFindings
    • Grave, Edouard, Joulin, Armand, and Usunier, Nicolas. Improving neural language models with a continuous cache. In ICLR, 2017.
      Google ScholarLocate open access versionFindings
    • Ha, David, Dai, Andrew, and Le, Quoc V. Hypernetworks. In ICLR, 2017.
      Google ScholarLocate open access versionFindings
    • He, Kaiming, Zhang, Xiangyu, Rein, Shaoqing, and Sun, Jian. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In CVPR, 2015.
      Google ScholarLocate open access versionFindings
    • He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. In CPVR, 2016a.
      Google ScholarLocate open access versionFindings
    • He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Identity mappings in deep residual networks. In CPVR, 2016b.
      Google ScholarLocate open access versionFindings
    • Hochreiter, Sepp and Schmidhuber, Jurgen. Long short-term memory. In Neural Computations, 1997.
      Google ScholarLocate open access versionFindings
    • Huang, Gao, Liu, Zhuang, van der Maaten, Laurens, and Weinberger, Kilian Q. Densely connected convolutional networks. In CVPR, 2016.
      Google ScholarLocate open access versionFindings
    • Inan, Hakan, Khosravi, Khashayar, and Socher, Richard. Tying word vectors and word classifiers: a loss framework for language modeling. In ICLR, 2017.
      Google ScholarLocate open access versionFindings
    • Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
      Google ScholarLocate open access versionFindings
    • Kingma, Diederik P. and Ba, Jimmy Lei. Adam: A method for stochastic optimization. In ICLR, 2015.
      Google ScholarLocate open access versionFindings
    • Krause, Ben, Kahembwe, Emmanuel, Murray, Iain, and Renals, Steve. Dynamic evaluation of neural sequence models. Arxiv, 1709.07432, 2017.
      Findings
    • Krizhevsky, Alex. Learning multiple layers of features from tiny images. Technical report, 2009.
      Google ScholarFindings
    • Larsson, Gustav, Maire, Michael, and Shakhnarovich, Gregory. Fractalnet: Ultra-deep neural networks without residuals. In ICLR, 2017.
      Google ScholarLocate open access versionFindings
    • Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network in network. Arxiv, 1312.4400, 2013.
      Findings
    • Liu, Chenxi, Zoph, Barret, Shlens, Jonathon, Hua, Wei, Li, Li-Jia, Fei-Fei, Li, Yuille, Alan, Huang, Jonathan, and Murphy, Kevin. Progressive neural architecture search. Arxiv, 1712.00559, 2017.
      Findings
    • Liu, Hanxiao, Simonyan, Karen, Vinyals, Oriol, Fernando, Chrisantha, and Kavukcuoglu, Koray. Hierarchical representations for efficient architecture search. In ICLR, 2018.
      Google ScholarLocate open access versionFindings
    • Loshchilov, Ilya and Hutter, Frank. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017.
      Google ScholarLocate open access versionFindings
    • Luong, Minh-Thang, Le, Quoc V., Sutskever, Ilya, Vinyals, Oriol, and Kaiser, Lukasz. Multi-task sequence to sequence learning. In ICLR, 2016.
      Google ScholarLocate open access versionFindings
    • Marcus, Mitchell, Kim, Grace, Marcinkiewicz, Mary Ann, MacIntyre, Robert, Bies, Ann, Ferguson, Mark, Katz, Karen, and Schasberger, Britta. The penn treebank: Annotating predicate argument structure. In Proceedings of the Workshop on Human Language Technology, 1994.
      Google ScholarLocate open access versionFindings
    • Melis, Gabor, Dyer, Chris, and Blunsom, Phil. On the state of the art of evaluation in neural language models. Arxiv, 1707.05589, 2017.
      Findings
    • Merity, Stephen, Keskar, Nitish Shirish, and Socher, Richard. Regularizing and optimizing LSTM language models. Arxiv, 1708.02182, 2017.
      Findings
    • Negrinho, Renato and Gordon, Geoff. Deeparchitect: Automatically designing and training deep architectures. In CPVR, 2017.
      Google ScholarLocate open access versionFindings
    • Zoph, Barret, Yuret, Deniz, May, Jonathan, and Knight, Kevin. Transfer learning for low-resource neural machine translation. In EMNLP, 2016.
      Google ScholarLocate open access versionFindings
    • Zoph, Barret, Vasudevan, Vijay, Shlens, Jonathon, and Le, Quoc V. Learning transferable architectures for scalable image recognition. In CVPR, 2018.
      Google ScholarLocate open access versionFindings
    • Mathematics Doklady, 1983.
      Google ScholarFindings
    • Razavian, Ali Sharif, Azizpour, Hossein, Josephine, Sullivan, and Carlsson, Stefan. Cnn features off-the-shelf: an astounding baseline for recognition. In CVPR, 2014.
      Google ScholarLocate open access versionFindings
    • Real, Esteban, Moore, Sherry, Selle, Andrew, Saxena, Saurabh, Leon, Yutaka Suematsu, Tan, Jie, Le, Quoc, and Kurakin, Alex. Large-scale evolution of image classifiers. In ICML, 2017.
      Google ScholarLocate open access versionFindings
    • Real, Esteban, Aggarwal, Alok, Huang, Yanping, and Le, Quoc V. Peephole: Predicting network performance before training. Arxiv, 1802.01548, 2018.
      Findings
    • Saxena, Shreyas and Verbeek, Jakob. Convolutional neural fabrics. In NIPS, 2016.
      Google ScholarLocate open access versionFindings
    • Schulman, John, Heess, Nicolas, Weber, Theophane, and Abbeel, Pieter. Gradient estimation using stochastic computation graphs. In NIPS, 2015.
      Google ScholarLocate open access versionFindings
    • Veniat, Tom and Denoyer, Ludovic. Learning time-efficient deep architectures with budgeted super networks. Arxiv, 1706.00046, 2017.
      Findings
    • Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992.
      Google ScholarLocate open access versionFindings
    • Yang, Zhilin, Dai, Zihang, Salakhutdinov, Ruslan, and Cohen, William. Breaking the softmax bottleneck: A highrank rnn language model. In ICLR, 2018.
      Google ScholarLocate open access versionFindings
    • Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Recurrent neural network regularization. Arxiv, 1409.2329, 2014.
      Findings
    • Zhong, Zhao, Yan, Junjie, and Liu, Cheng-Lin. Practical network blocks design with q-learning. AAAI, 2018.
      Google ScholarFindings
    • Zilly, Julian Georg, Srivastava, Rupesh Kumar, Koutnık, Jan, and Schmidhuber, Jurgen. Recurrent highway networks. In ICML, 2017.
      Google ScholarLocate open access versionFindings
    • Zoph, Barret and Le, Quoc V. Neural architecture search with reinforcement learning. In ICLR, 2017.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments