Neural Optimizer Search with Reinforcement Learning

    Irwan Bello
    Irwan Bello
    Barret Zoph
    Barret Zoph

    ICML, pp. 459-468, 2017.

    Cited by: 108|Bibtex|Views20|Links
    EI
    Keywords:
    neural network architecturemachine translationCIFAR-10domain specific languagedeep learning architectureMore(7+)
    Wei bo:
    We consider an approach to automate the process of designing update rules for optimization methods, especially for deep learning architectures

    Abstract:

    We present an approach to automate the process of discovering optimization methods, with a focus on deep learning architectures. We train a Recurrent Neural Network controller to generate a string in a domain specific language that describes a mathematical update equation based on a list of primitive functions, such as the gradient, runni...More

    Code:

    Data:

    Introduction
    • The choice of the right optimization method plays a major role in the success of training deep learning models.
    • Stochastic Gradient Descent (SGD) often works well out of the box, more advanced optimization methods such as Adam (Kingma & Ba, 2015) or Adagrad (Duchi et al, 2011) can be faster, especially for training very deep networks.
    • The authors consider an approach to automate the process of designing update rules for optimization methods, especially for deep learning architectures.
    Highlights
    • The choice of the right optimization method plays a major role in the success of training deep learning models
    • We consider an approach to automate the process of designing update rules for optimization methods, especially for deep learning architectures
    • To map strings sampled by the controller to an update rule, we design a domain specific language that relies on a parenthesis-free notation
    • Our choice of domain specific language (DSL) is motivated by the observation that the computational graph of most common optimizers can be represented as a simple binary expression tree, assuming input primitives such as the gradient or the running average of the gradient and basic unary and binary functions
    • We express each update rule with a string describing 1) the first operand to select, 2) the second operand to select, 3) the unary function to apply on the first operand, 4) the unary function to apply on the second operand and 5) the binary function to apply to combine the outputs of the unary functions
    Methods
    • The authors express each update rule with a string describing 1) the first operand to select, 2) the second operand to select, 3) the unary function to apply on the first operand, 4) the unary function to apply on the second operand and 5) the binary function to apply to combine the outputs of the unary functions.
    • The output of the binary function is either temporarily stored in the operand bank or used as the final weight update as follows:
    Results
    • The authors' results show that the controller discovers many different updates that perform well during training and the maximum accuracy increases over time.
    • In Figure 4, the authors show the learning curve of the controller as more optimizers are sampled.
    • Number of sampled optimizers optimizers being run for 300 epochs on the full CIFAR10 dataset.
    • The controller discovered update rules that work well, but produced update equations that are fairly intuitive.
    • Among the top candidates is the following update function:
    Conclusion
    • This paper considers an approach for automating the discovery of optimizers with a focus on deep neural network architectures.
    • One may for example use the method for discovering optimizers that perform well in scenarios where computations are only carried out using 4 bits, or a distributed setup where workers can only communicate a few bits of information to a shared parameter server.
    • Unlike previous approaches in learning to learn, the update rules in the form of equations can be transferred to other optimization tasks.
    • In addition to opening up new ways to design update rules, this new update rule can be used to improve the training of deep networks
    Summary
    • Introduction:

      The choice of the right optimization method plays a major role in the success of training deep learning models.
    • Stochastic Gradient Descent (SGD) often works well out of the box, more advanced optimization methods such as Adam (Kingma & Ba, 2015) or Adagrad (Duchi et al, 2011) can be faster, especially for training very deep networks.
    • The authors consider an approach to automate the process of designing update rules for optimization methods, especially for deep learning architectures.
    • Objectives:

      The goal of the work is to search for better update rules for neural networks in the space of well known primitives.
    • Methods:

      The authors express each update rule with a string describing 1) the first operand to select, 2) the second operand to select, 3) the unary function to apply on the first operand, 4) the unary function to apply on the second operand and 5) the binary function to apply to combine the outputs of the unary functions.
    • The output of the binary function is either temporarily stored in the operand bank or used as the final weight update as follows:
    • Results:

      The authors' results show that the controller discovers many different updates that perform well during training and the maximum accuracy increases over time.
    • In Figure 4, the authors show the learning curve of the controller as more optimizers are sampled.
    • Number of sampled optimizers optimizers being run for 300 epochs on the full CIFAR10 dataset.
    • The controller discovered update rules that work well, but produced update equations that are fairly intuitive.
    • Among the top candidates is the following update function:
    • Conclusion:

      This paper considers an approach for automating the discovery of optimizers with a focus on deep neural network architectures.
    • One may for example use the method for discovering optimizers that perform well in scenarios where computations are only carried out using 4 bits, or a distributed setup where workers can only communicate a few bits of information to a shared parameter server.
    • Unlike previous approaches in learning to learn, the update rules in the form of equations can be transferred to other optimization tasks.
    • In addition to opening up new ways to design update rules, this new update rule can be used to improve the training of deep networks
    Tables
    • Table1: Performance of Neural Search Search and standard optimizers on the Wide-ResNet architecture (Zagoruyko & Komodakis, 2016) on CIFAR-10. Final Val and Final Test refer to the final validation and test accuracy after for training for 300 epochs. Best Val corresponds to the best validation accuracy over the 300 epochs and Best Test is the test accuracy at the epoch where the validation accuracy was the highest. For each optimizer we report the best results out of seven learning rates on a logarithmic scale according to the validation accuracy
    • Table2: Performance of our optimizer versus ADAM in a strong baseline GNMT model on WMT 2014 English → German
    Download tables as Excel
    Related work
    • Neural networks are difficult and slow to train, and many methods have been designed to tackle this difficulty (e.g., Riedmiller & Braun (1992); LeCun et al (1998); Schraudolph (2002); Martens (2010); Le et al (2011); Duchi et al (2011); Zeiler (2012); Martens & Sutskever (2012); Schaul et al (2013); Pascanu & Bengio (2013); Pascanu et al (2013); Kingma & Ba (2014); Ba et al (2017)). More recent optimization methods combine insights from both stochastic and batch methods in that they use a small minibatch, similar to SGD, yet they implement many heuristics to estimate diagonal second-order information, similar to Hessian-free or L-BFGS (Liu & Nocedal, 1989). This combination often yields faster convergence for practical problems (Duchi et al, 2011; Dean et al, 2012; Kingma & Ba, 2014). For example, Adam (Kingma & Ba, 2014), a commonly-used optimizer in deep learning, implements simple heuristics to estimate the mean and variance of the gradient, which are used to generate more stable updates during training.

      Many of the above update rules are designed by borrowing ideas from convex analysis, even though optimization problems in neural networks are non-convex. Recent empirical results with non-monotonic learning rate heuristics (Loshchilov & Hutter, 2017) suggest that there are still many unknowns in training neural networks and that many ideas in non-convex optimization can be used to improve it.
    Funding
    • Presents an approach to automate the process of discovering optimization methods, with a focus on deep learning architectures
    • Notes that multiple strings in our prediction scheme can map to the same underlying update rule, including strings of different lengths . This is both a feature of our action space corresponding to mathematical expressions and our choice of domain specific language. argues that this makes for interesting exploration dynamics because a competitive optimizer may be obtained by expressing a standard optimizer in an expanded fashion and modifying it slightly
    • Update rules found on a small ConvNet architecture, when applied to the Wide ResNet architecture , improved accuracy over Adam, RMSProp, Momentum, and SGD by a margin up to 2% on the test set
    Reference
    • Abadi, Martın, Barham, Paul, Chen, Jianmin, Chen, Zhifeng, Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Irving, Geoffrey, Isard, Michael, Kudlur, Manjunath, Levenberg, Josh, Monga, Rajat, Moore, Sherry, Murray, Derek G., Steiner, Benoit, Tucker, Paul, Vasudevan, Vijay, Warden, Pete, Wicke, Martin, Yu, Yuan,, and Zheng, Xiaoqiang. Tensorflow: A system for large-scale machine learning. Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016.
      Google ScholarLocate open access versionFindings
    • Finally, the update rule is also more memory efficient as it only keeps one running average per parameter, compared to two running averages for Adam. This has practical implications for much larger translation models where Adam cannot currently be used due to memory constraints (Shazeer et al., 2017).
      Google ScholarFindings
    Your rating :
    0

     

    Tags
    Comments