NAS evaluation is frustratingly hard

Antoine Yang
Antoine Yang
Pedro M. Esperança
Pedro M. Esperança
Fabio M. Carlucci
Fabio M. Carlucci

ICLR, 2020.

Cited by: 2|Bibtex|Views107|Links
EI
Keywords:
neural architecture search nas benchmark reproducibility harking
Weibo:
In this paper we have shown that, for many Neural Architecture Search methods, the search space has been engineered such that all architectures perform well and that their relative ranking can shift

Abstract:

Neural Architecture Search (NAS) is an exciting new field which promises to be as much as a game-changer as Convolutional Neural Networks were in 2012. Despite many great works leading to substantial improvements on a variety of tasks, comparison between different methods is still very much an open issue. While most algorithms are tested ...More
Highlights
  • As the deep learning revolution helped us move away from hand crafted features (Krizhevsky et al, 2012) and reach new heights (He et al, 2016; Szegedy et al, 2017), so does Neural Architecture Search (NAS) hold the promise of freeing us from hand-crafted architectures, which requires tedious and expensive tuning for each new task or dataset
  • Using a simple metric—the relative improvement over the average architecture of the search space—we find that most Neural Architecture Search methods perform very and rarely substantially above this baseline
  • As a matter of fact, we found that all Neural Architecture Search methods neglect to report the time needed to optimize hyperparameters
  • Our full protocol uses several tricks which have been used in recent works (Xie et al, 2019b; Nayman et al, 2019): Auxiliary Towers (A), DropPath (D; Larsson & Shakhnarovich, 2017), Cutout (C; DeVries & Taylor, 2017), AutoAugment (AA; Cubuk et al, 2018), extended training for 1500 epochs (1500E), and increased number of channels (50C). In between these two extremes, by selectively enabling and disabling each component, we evaluated a further 8 intermediate training protocols
  • In Figure 10 (Appendix A.3.1) we show similar results when training a ResNet-50 (He et al, 2016) with the same protocols
  • In this paper we have shown that, for many Neural Architecture Search methods, the search space has been engineered such that all architectures perform well and that their relative ranking can shift
Summary
  • As the deep learning revolution helped us move away from hand crafted features (Krizhevsky et al, 2012) and reach new heights (He et al, 2016; Szegedy et al, 2017), so does Neural Architecture Search (NAS) hold the promise of freeing us from hand-crafted architectures, which requires tedious and expensive tuning for each new task or dataset.
  • To partly answer their second point, and understand how much the final accuracy depends on the specific architecture, we implement an in-depth study of the widely employed DARTS (Liu et al, 2019) search space and perform an ablation on the commonly used training techniques (e.g. Cutout, DropPath, AutoAugment).
  • Report mean and standard deviation of the top-1 test accuracy, obtained at the end of the augmentation, for both the randomly sampled and the searched architectures; Since both learned and randomly sampled architectures share the same search space and training protocol, calculating a relative improvement over this random baseline as RI = 100 × (Accm − Accr)/Accr can offer insights into the quality of the search strategy alone.
  • We use the following process: 1) sample 8 random architectures, 2) train them with different training protocols and 3) report mean, standard deviation and maximum of the top-1 test accuracy at the end of the training process.
  • To better understand the results from the previous section, we sampled a considerable number of architectures (214) from the most commonly used search space (Liu et al, 2019) and fully trained them with the matching training protocol (Cutout+DropPath+Auxiliary Towers).
  • We sampled 56 architectures from this new search space and trained them with the DARTS training protocol (Cutout+DropPath+Auxiliary Towers), for fair comparison with the results from the previous section.
  • The second is that, if the lottery ticket hypothesis holds—so that specific sub-networks are better mainly due to their lucky initialization; Frankle & Carbin., 2018—together with our findings, this could be an additional reason why methods searching on a different number of cells than the final model, struggle to significantly improve on the average randomly sampled architecture.
  • Search Space: it is difficult to evaluate the effectiveness of any given proposed method without a measure of how good randomly sampled architectures are.
  • The best solution for this is likely to test NAS algorithms on a battery of datasets, with different characteristics: image sizes, number of samples, class granularity and learning task.
  • Investigating hidden components: as our experiments in Sections 4 and 5.2 show, the DARTS search space is not only effective due to specific operations that are being chosen, but in greater part due to the overall macro-structure and the training protocol used.
  • In this paper we have shown that, for many NAS methods, the search space has been engineered such that all architectures perform well and that their relative ranking can shift.
  • We have provided some suggestions on how to make future research more robust to these issues
Tables
  • Table1: Relative improvement metric, RI = 100 × (Accm − Accr)/Accr (in %), where Accm and Accr are the accuracies of the search method and random sampling baseline, respectively
  • Table2: Hyperparameters for different NAS methods. “S” denotes search stage and “A” denotes augmentation stage. For learning rates, a ↓ b means cosine annealing from a to b
Related work
  • As mentioned, NAS methods have the potential to truly revolutionize the field, but to do so it is crucial that future research avoids common mistakes. Some of these concerns have been recently raised by the community.

    For example, Li & Talwalkar (2019) highlight that most NAS methods a) fail to compare against an adequate baseline, such as a properly implemented random search strategy, b) are overly complex, with no ablation to properly assign credit to the important components, and c) fail to provide all details needed for successfully reproducing their results. In our paper we go one step further and argue that the relative improvement over the average (randomly sampled) architecture is an useful tool to quantify the effectiveness of a proposed solution and compare it with competing methods. To partly answer their second point, and understand how much the final accuracy depends on the specific architecture, we implement an in-depth study of the widely employed DARTS (Liu et al, 2019) search space and perform an ablation on the commonly used training techniques (e.g. Cutout, DropPath, AutoAugment).
Funding
  • Proposes using a method’s relative improvement over the randomly sampled average architecture, which effectively removes advantages arising from expertly engineered search spaces or training protocols
  • Finds that most NAS methods perform very and rarely substantially above this baseline
Study subjects and analysis
datasets: 5
As such, and due to the under-use of ablation studies, there is a lack of clarity regarding why certain methods are more effective than others. Our first contribution is a benchmark of 8 NAS methods on 5 datasets. To overcome the hurdle of comparing methods with different search spaces, we propose using a method’s relative improvement over the randomly sampled average architecture, which effectively removes advantages arising from expertly engineered search spaces or training protocols

well known CV datasets: 5
Our findings suggest that most of the gains in accuracy in recent contributions to NAS have come from manual improvements in the training protocol, not in the search algorithms. As a step towards understanding which methods are more effective, we have collected code for 8 reasonably fast (search time of less than 4 days) NAS algorithms, and benchmarked them on 5 well known CV datasets. Using a simple metric—the relative improvement over the average architecture of the search space—we find that most NAS methods perform very similarly and rarely substantially above this baseline

datasets: 5
Thus, we further confirm the authors’ claim, showing that indeed the average architecture performs extremely well and that how you train a model has more impact than any specific architecture. In this section we present a systematic evaluation of 8 methods on 5 datasets using a strategy that is designed to reveal the quality of each method’s search strategy, removing the effect of the manuallyengineered training protocol and search space. The goal is to find general trends and highlight common features rather than just pin-pointing the most accurate algorithm

datasets: 5
3.2 RESULTS. Figure 1 shows the evaluation results on the 5 datasets, from which we draw two main conclusions. First, the improvements over random sampling tend to be small

samples: 32
95.0 4 6 n8umbe1r2of ce1l6ls 20 24. Kendall-tau correlation: 0.48 (32 samples). Kendall-tau correlation: 0.54 (32 samples) number of cells

samples: 32
Kendall-tau correlation: 0.48 (32 samples). Kendall-tau correlation: 0.54 (32 samples) number of cells. 5.3 DOES CHANGING SEED AND NUMBER OF CELLS AFFECT RANKING?

Your rating :
0

 

Tags
Comments