Modular Meta-Learning with Shrinkage

NeurIPS 2020, 2020.

Cited by: 0|Bibtex|Views22|Links
Keywords:
deep networkmodel agnostic metanew tasklong adaptationmean opinion scoreMore(12+)
Weibo:
General practitioners who can not afford to collect a large amount of labeled data would be able to take advantage of a pre-trained generic meta-model, and adapt its task-specific components for a new task based on limited data

Abstract:

The modular nature of deep networks allows some components to learn general features, while others learn more task-specific features. When a deep model is then fine-tuned on a new task, each component adapts differently. For example, the input layers of an image classification convnet typically adapt very little, while the output layers m...More

Code:

Data:

0
Introduction
  • The goal of meta-learning is to extract shared knowledge from a large set of training tasks to solve held-out tasks more efficiently.
  • Reusing or repurposing modules can reduce overfitting in low-data regimes, improve interpretability, and facilitate the deployment of large multi-task models on limited-resource devices as parameter sharing allows for significant savings in memory.
  • These considerations are important in domains such as few-shot text-to-speech synthesis (TTS), characterized by large speaker-adaptable models, limited training data for speaker adaptation, and long adaptation horizons.
  • The authors would like to automatically learn the smallest set of modules needed to adapt to a new speaker and allow those to adapt for as long as needed
Highlights
  • The goal of meta-learning is to extract shared knowledge from a large set of training tasks to solve held-out tasks more efficiently
  • Reusing or repurposing modules can reduce overfitting in low-data regimes, improve interpretability, and facilitate the deployment of large multi-task models on limited-resource devices as parameter sharing allows for significant savings in memory
  • These considerations are important in domains such as few-shot text-to-speech synthesis (TTS), characterized by large speaker-adaptable models, limited training data for speaker adaptation, and long adaptation horizons
  • We show in Appendix B.2 that the meta update for φ is equivalent to that of implicit MAML (iMAML) when σm2 is constant for all m, and refer to this more general method as σ-iMAML
  • This paper presents a general meta-learning technique to automatically identify task-specific modules in a model for few-shot machine learning problems
  • Learning and using our shrinkage prior helps prevent overfitting and improves performance in low-data, long-adaptation regimes
  • General practitioners who can not afford to collect a large amount of labeled data would be able to take advantage of a pre-trained generic meta-model, and adapt its task-specific components for a new task based on limited data
Results
  • Learning and using the shrinkage prior helps prevent overfitting and improves performance in low-data, long-adaptation regimes.
Conclusion
  • The authors answer all three experimental questions in the affirmative
  • In both image classification and text-to-speech, the learned shrinkage priors correspond to meaningful and interesting task-specific modules.
  • This paper presents a general meta-learning technique to automatically identify task-specific modules in a model for few-shot machine learning problems
  • It reduces the requirement for domain knowledge to hand-design task-specific architectures, and have a positive societal impact to democratize machine learning techniques.
  • An example application is to adapt a multilingual text-to-speech model to a low-resource language or dialect for minority ethnic groups
Summary
  • Introduction:

    The goal of meta-learning is to extract shared knowledge from a large set of training tasks to solve held-out tasks more efficiently.
  • Reusing or repurposing modules can reduce overfitting in low-data regimes, improve interpretability, and facilitate the deployment of large multi-task models on limited-resource devices as parameter sharing allows for significant savings in memory.
  • These considerations are important in domains such as few-shot text-to-speech synthesis (TTS), characterized by large speaker-adaptable models, limited training data for speaker adaptation, and long adaptation horizons.
  • The authors would like to automatically learn the smallest set of modules needed to adapt to a new speaker and allow those to adapt for as long as needed
  • Objectives:

    The authors' goal is to minimize the average negative predictive log-likelihood over T validation tasks,.
  • Results:

    Learning and using the shrinkage prior helps prevent overfitting and improves performance in low-data, long-adaptation regimes.
  • Conclusion:

    The authors answer all three experimental questions in the affirmative
  • In both image classification and text-to-speech, the learned shrinkage priors correspond to meaningful and interesting task-specific modules.
  • This paper presents a general meta-learning technique to automatically identify task-specific modules in a model for few-shot machine learning problems
  • It reduces the requirement for domain knowledge to hand-design task-specific architectures, and have a positive societal impact to democratize machine learning techniques.
  • An example application is to adapt a multilingual text-to-speech model to a low-resource language or dialect for minority ethnic groups
Tables
  • Table1: The above algorithms result from different approximations to the predictive likelihood
  • Table2: Average test accuracy and 95% confidence intervals for 10 runs on large-data augFigure 3: Learned σ2 of WaveNet modules. mented Omniglot. Highest accuracy in bold
  • Table3: Mean opinion score of sample naturalments from utterances (higher is better)
  • Table4: Hyperparameters for the large-data augmented Omniglot classification experiment
  • Table5: Hyperparameters for the small-data augmented Omniglot experiment
  • Table6: Task optimizer learning rate for different numbers of training instances per character class in small-data augmented Omniglot
  • Table7: Hyperparameters for the few-shot sinusoid regression experiment
  • Table8: Hyperparameters for the few-shot Omniglot classification
  • Table9: Hyperparameters for the few-shot miniImageNet classification experiment
  • Table10: Test accuracy on few-shot Omniglot and few-shot miniImageNet. For each algorithm, we report the mean and 95% confidence interval over 10 different runs. For each pair of corresponding methods, we bold the entry with highest mean accuracy
Download tables as Excel
Related work
  • Multiple Bayesian meta-learning approaches have been proposed to either provide model uncertainty in few-shot learning [22,23,24,25] or to provide a probabilistic interpretation and extend existing nonBayesian works [26,27,28]. However, to the best of our knowledge, none of these account for modular structure in their formulation. While we use point estimates of variables for computational reasons, more sophisticated inference methods from these works can also be used within our framework.

    Modular meta-learning approaches based on MAML-style backpropagation through short task adaptation horizons have also been proposed. The most relevant of these, Alet et al [29], proposes to learn a modular network architecture, whereas our work identifies the adaptability of each module. In other work, Zintgraf et al [16] hand-designs the task-specific and shared parameters, and the M-Net in Lee and Choi [14] provides an alternative method for learning adaptable modules by sampling binary mask variables. In all of the above, however, backpropagating through task adaptation is computationally prohibitive when applied to problems that require longer adaptation horizons for. While it is worth investigating how to extend these works to this setting, we leave this for future work.
Funding
  • Learning and using our shrinkage prior helps prevent overfitting and improves performance in low-data, long-adaptation regimes
Reference
  • Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
    Findings
  • Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems, pages 113–124, 2019.
    Google ScholarLocate open access versionFindings
  • Sebastian Flennerhag, Pablo G. Moreno, Neil D. Lawrence, and Andreas Damianou. Transferring knowledge across learning processes. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Sebastian Flennerhag, Andrei A Rusu, Razvan Pascanu, Francesco Visin, Hujun Yin, and Raia Hadsell. Meta-learning with warped gradient descent. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Sercan Arik, Jitong Chen, Kainan Peng, Wei Ping, and Yanqi Zhou. Neural voice cloning with a few samples. In Advances in Neural Information Processing Systems, pages 10019–10029, 2018.
    Google ScholarLocate open access versionFindings
  • Yutian Chen, Yannis Assael, Brendan Shillingford, David Budden, Scott Reed, Heiga Zen, Quan Wang, Luis C. Cobo, Andrew Trask, Ben Laurie, Caglar Gulcehre, Aaron van den Oord, Oriol Vinyals, and Nando de Freitas. Sample efficient adaptive text-to-speech. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, and Yonghui Wu. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in Neural Information Processing Systems, pages 4480–4490, 2018.
    Google ScholarLocate open access versionFindings
  • Yaniv Taigman, Lior Wolf, Adam Polyak, and Eliya Nachmani. Voiceloop: Voice fitting and synthesis via a phonological loop. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, 2017.
    Google ScholarFindings
  • Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla, and Bernhard Schölkopf. Learning independent causal mechanisms. In International Conference on Machine Learning, pages 4036–4044, 2018.
    Google ScholarLocate open access versionFindings
  • Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
    Findings
  • Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Nan Rosemary Ke, Sebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta-transfer objective for learning to disentangle causal mechanisms. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135, 2017.
    Google ScholarLocate open access versionFindings
  • Yoonho Lee and Seungjin Choi. Gradient-based meta-learning with learned layerwise metric and subspace. In International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Eunbyung Park and Junier B Oliva. Meta-curvature. In Advances in Neural Information Processing Systems, pages 3309–3319, 2019.
    Google ScholarLocate open access versionFindings
  • Luisa M. Zintgraf, Kyriacos Shiarlis, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Fast context adaptation via meta-learning. In International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of MAML. arXiv preprint arXiv:1909.09157, 2019.
    Findings
  • Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Sébastien MR Arnold, Shariq Iqbal, and Fei Sha. Decoupling adaptation from modeling with metaoptimizers for meta learning. arXiv preprint arXiv:1910.13603, 2019.
    Findings
  • Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian Data Analysis. Chapman and Hall/CRC, 2013.
    Google ScholarLocate open access versionFindings
  • Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
    Findings
  • Sachin Ravi and Alex Beatson. Amortized Bayesian meta-learning. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Harrison Edwards and Amos Storkey. Towards a neural statistician. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, and S. M. Ali Eslami. Conditional neural processes. In International Conference on Machine Learning, pages 1704–1713, 2018.
    Google ScholarLocate open access versionFindings
  • Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and Richard E. Turner. Metalearning probabilistic inference for prediction. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting gradient-based meta-learning as hierarchical Bayes. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian model-agnostic meta-learning. In Conference on Neural Information Processing Systems, pages 7332–7342, 2018.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In Conference on Neural Information Processing Systems, pages 9516–9527, 2018.
    Google ScholarLocate open access versionFindings
  • Ferran Alet, Tomás Lozano-Pérez, and Leslie P Kaelbling. Modular meta-learning. In 2nd Conference on Robot Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835, 2017.
    Findings
  • Hae Beom Lee, Hayeon Lee, Donghyun Na, Saehoon Kim, Minseop Park, Eunho Yang, and Sung Ju Hwang. Learning to balance: Bayesian meta-learning for imbalanced and out-of-distribution tasks. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Mikhail Khodak, Maria-Florina F Balcan, and Ameet S Talwalkar. Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems, pages 5915–5926, 2019.
    Google ScholarLocate open access versionFindings
  • Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Conference on Neural Information Processing Systems, pages 3981–3989, 2016.
    Google ScholarLocate open access versionFindings
  • Yuhuai Wu, Mengye Ren, Renjie Liao, and Roger Grosse. Understanding short-horizon bias in stochastic meta-optimization. International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 2017.
    Google ScholarLocate open access versionFindings
  • Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning, 2017.
    Google ScholarLocate open access versionFindings
  • Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn stochastic gradient descent with biased regularization. In International Conference on Machine Learning, pages 1566–1575, 2019.
    Google ScholarLocate open access versionFindings
  • Giulia Denevi, Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Learning to learn around a common mean. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, pages 10169–10179, 2018.
    Google ScholarLocate open access versionFindings
  • Pan Zhou, Xiaotong Yuan, Huan Xu, Shuicheng Yan, and Jiashi Feng. Efficient meta learning via minibatch proximal update. In Advances in Neural Information Processing Systems, pages 1532–1542, 2019.
    Google ScholarLocate open access versionFindings
  • Michael E Tipping. Sparse bayesian learning and the relevance vector machine. Journal of machine learning research, 1(Jun):211–244, 2001.
    Google ScholarLocate open access versionFindings
  • Edwin Fong and Chris Holmes. On the marginal likelihood and cross-validation. Biometrika, to appear arXiv preprint arXiv:1905.08737, 2019.
    Findings
  • Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8):1889–1900, 2000.
    Google ScholarLocate open access versionFindings
  • Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, pages 737–746, 2016.
    Google ScholarLocate open access versionFindings
  • Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. arXiv preprint arXiv:1911.02590, 2019.
    Findings
  • Yoram Singer and John C Duchi. Efficient learning using forward-backward splitting. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems, pages 495–503. Curran Associates, Inc., 2009.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2014.
    Google ScholarLocate open access versionFindings
  • Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end-to-end loss for speaker verification. In International Conference on Acoustics, Speech, and Signal Processing, pages 4879–4883. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
    Google ScholarLocate open access versionFindings
  • Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning, pages 1842–1850, 2016.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Charles Blundell, Timothy P. Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In Conference on Neural Information Processing Systems, pages 3630–3638, 2016.
    Google ScholarLocate open access versionFindings
  • Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments