# Few-Shot Learning via Learning the Representation, Provably

ICLR, 2021.

EI

Weibo:

Abstract:

This paper studies few-shot learning via representation learning, where one uses $T$ source tasks with $n_1$ data per task to learn a representation in order to reduce the sample complexity of a target task for which there is only $n_2 (\ll n_1)$ data. Specifically, we focus on the setting where there exists a good \emph{common represen...More

Code:

Data:

Introduction

- A popular scheme for few-shot learning, i.e., learning in a data scarce environment, is representation learning, where one first learns a feature extractor, or representation, e.g., the last layer of a convolutional neural network, from different but related source tasks, and uses a simple predictor on top of this representation in the target task.
- The authors study the setting where there exists a common well-specified low-dimensional representation in source and target tasks, and obtain an dk n1 T
- Maurer et al (2016) and follow-up work gave analyses on the benefit of representation learning for reducing the sample complexity of the target task.

Highlights

- A popular scheme for few-shot learning, i.e., learning in a data scarce environment, is representation learning, where one first learns a feature extractor, or representation, e.g., the last layer of a convolutional neural network, from different but related source tasks, and uses a simple predictor on top of this representation in the target task
- The hope is that the learned representation captures the common structure across tasks, which makes a linear predictor sufficient for the target task
- It is not guaranteed that the representation found will be useful for the target task unless one makes some assumptions to characterize the connections between different tasks
- In Section 6, we present our result for representation learning in neural networks
- We gave the first statistical analysis showing that representation learning can fully exploit all data points from source tasks to enable few-shot learning on a target task
- There are many important directions to pursue in representation learning and few-shot learning

Results

- The concurrent work of Tripuraneni et al (2020) studies low-dimensional linear representation learning and obtains a similar result as ours in this case, but they assume isotropic inputs for all tasks, which is a special case of the result.
- The authors try to use different linear predictors on top of a common representation function φ to model the input-output relations in different source tasks.
- As described in Section 3, the authors assume that all T + 1 tasks share a common ground-truth representation specified by a matrix B∗ ∈ Rd×k such that a sample (x, y) ∼ μt satisfies x ∼ pt and y = (B∗) x + z where z ∼ N (0, σ2) is independent of x.
- Assumptions 4.1, 4.2, 4.3 and 4.4, the authors further assume 2k ≤ min{d, T } and that the sample sizes in source and target tasks satisfy n1
- With probability at least 1 − δ over the samples, the expected excess risk of the learned predictor x → wT +1Bx on the target task satisfies
- Similar to the setting in Section 4, the authors assume the target task data is subgaussian as in Assumption 4.1.
- With probability at least 1 − δ over the samples, the expected excess risk of the learned predictor x → (Bλ) x on the target task satisfies: EθT∗ +1∼ν [ER(B, wT +1)] ≤ σR · O
- With probability at least 1 − δ over the samples, the expected excess risk of the learned predictor x → wT +1(Bx)+ on the target task satisfies: EBT∗ +1wT∗ +1∼ν [ER(fB,wT +1 )] ≤ σR · On1T
- To highlight the advantage of representation learning, the authors compare to training a neural network with weight decay directly on the target task: (B, w) = arg min 1 n

Conclusion

- The authors gave the first statistical analysis showing that representation learning can fully exploit all data points from source tasks to enable few-shot learning on a target task.
- The authors' results in Sections 5 and 6 indicate that explicit low dimensionality is not necessary, and norm-based capacity control forces the classifier to learn good representations.
- Further questions include whether this is a general phenomenon in all deep learning models, whether other capacity control can be applied, and how to optimize to attain good representations

Summary

- A popular scheme for few-shot learning, i.e., learning in a data scarce environment, is representation learning, where one first learns a feature extractor, or representation, e.g., the last layer of a convolutional neural network, from different but related source tasks, and uses a simple predictor on top of this representation in the target task.
- The authors study the setting where there exists a common well-specified low-dimensional representation in source and target tasks, and obtain an dk n1 T
- Maurer et al (2016) and follow-up work gave analyses on the benefit of representation learning for reducing the sample complexity of the target task.
- The concurrent work of Tripuraneni et al (2020) studies low-dimensional linear representation learning and obtains a similar result as ours in this case, but they assume isotropic inputs for all tasks, which is a special case of the result.
- The authors try to use different linear predictors on top of a common representation function φ to model the input-output relations in different source tasks.
- As described in Section 3, the authors assume that all T + 1 tasks share a common ground-truth representation specified by a matrix B∗ ∈ Rd×k such that a sample (x, y) ∼ μt satisfies x ∼ pt and y = (B∗) x + z where z ∼ N (0, σ2) is independent of x.
- Assumptions 4.1, 4.2, 4.3 and 4.4, the authors further assume 2k ≤ min{d, T } and that the sample sizes in source and target tasks satisfy n1
- With probability at least 1 − δ over the samples, the expected excess risk of the learned predictor x → wT +1Bx on the target task satisfies
- Similar to the setting in Section 4, the authors assume the target task data is subgaussian as in Assumption 4.1.
- With probability at least 1 − δ over the samples, the expected excess risk of the learned predictor x → (Bλ) x on the target task satisfies: EθT∗ +1∼ν [ER(B, wT +1)] ≤ σR · O
- With probability at least 1 − δ over the samples, the expected excess risk of the learned predictor x → wT +1(Bx)+ on the target task satisfies: EBT∗ +1wT∗ +1∼ν [ER(fB,wT +1 )] ≤ σR · On1T
- To highlight the advantage of representation learning, the authors compare to training a neural network with weight decay directly on the target task: (B, w) = arg min 1 n
- The authors gave the first statistical analysis showing that representation learning can fully exploit all data points from source tasks to enable few-shot learning on a target task.
- The authors' results in Sections 5 and 6 indicate that explicit low dimensionality is not necessary, and norm-based capacity control forces the classifier to learn good representations.
- Further questions include whether this is a general phenomenon in all deep learning models, whether other capacity control can be applied, and how to optimize to attain good representations

Related work

- The idea of multitask representation learning at least dates back to Caruana (1997); Thrun and Pratt (1998); Baxter (2000). Empirically, representation learning has shown its great power in various domains; see Bengio et al (2013) for a survey. In particular, representation learning is widely adopted for few-shot learning tasks (Sun et al, 2017; Goyal et al, 2019). Representation learning is also closely connected to meta-learning (Schaul and Schmidhuber, 2010). Recent work Raghu et al (2019) empirically suggested that the effectiveness of the popular meta-learning algorithm Model Agnostic Meta-Learning (MAML) is due to its ability to learn a useful representation. The scheme we analyze in this paper is closely related to Lee et al (2019); Bertinetto et al (2018) for meta-learning.

Reference

- Pierre Alquier, The Tien Mai, and Massimiliano Pontil. Regret bounds for lifelong learning. arXiv preprint arXiv:1610.08628, 2016.
- Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. In Proceedings of the 36th International Conference on Machine Learning, 2019.
- Jonathan Baxter. A model of inductive bias learning. J. Artif. Int. Res., 2000.
- Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte. Convex neural networks. In Advances in neural information processing systems, pages 123–130, 2006.
- Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
- Luca Bertinetto, Joao F Henriques, Philip HS Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136, 2018.
- Rich Caruana. Multitask learning. Machine Learning, 28(1):41–75, Jul 199ISSN 1573-0565. doi: 10.1023/A:1007379606734. URL https://doi.org/10.1023/A:1007379606734.
- Giulia Denevi, Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Incremental learning-tolearn with statistical guarantees. arXiv preprint arXiv:1803.08089, 2018.
- Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn stochastic gradient descent with biased regularization. In Proceedings of the 36th International Conference on Machine Learning, 2019.
- Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. In Proceedings of the 36th International Conference on Machine Learning, 2019.
- Tomer Galanti, Lior Wolf, and Tamir Hazan. A theoretical framework for deep transfer learning. Information and Inference: A Journal of the IMA, 5(2):159–209, 2016.
- Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points − online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015.
- Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking selfsupervised visual representation learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 6391–6400, 2019.
- Benjamin Haeffele, Eric Young, and Rene Vidal. Structured low-rank matrix factorization: Optimality, algorithm, and applications to image processing. In International conference on machine learning, pages 2007–2015, 2014.
- Daniel Hsu, Sham Kakade, Tong Zhang, et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012a.
- Daniel Hsu, Sham M Kakade, and Tong Zhang. Random design analysis of ridge regression. In Conference on learning theory, pages 9–1, 2012b.
- Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning, pages 1724–1732, 2017.
- Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Adaptive gradient-based meta-learning methods. arXiv preprint arXiv:1906.02717, 2019.
- Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246–1257, 2016.
- Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10657–10665, 2019.
- Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853–2884, 2016.
- Daniel McNamara and Maria-Florina Balcan. Risk bounds for transferring representations with and without fine-tuning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2373–2381. JMLR. org, 2017.
- Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of maml. arXiv preprint arXiv:1909.09157, 2019.
- Tom Schaul and Jürgen Schmidhuber. Metalearning. Scholarpedia, 5(6):4650, 2010. Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In International Conference on
- Springer, 2005. Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pages 843–852, 2017. Sebastian Thrun and Lorien Pratt. Learning to Learn: Introduction and Overview, pages 3–17.
- Springer US, Boston, MA, 1998. ISBN 978-1-4615-5529-2. doi: 10.1007/978-1-4615-5529-2_1. URL https://doi.org/10.1007/978-1-4615-5529-2_1. Nilesh Tripuraneni, Chi Jin, and Michael I Jordan. Provable meta-learning of linear representations.arXiv preprint arXiv:2002.11684, 2020. Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015. Roman Vershynin. Four lectures on probabilistic methods for data science.https://arxiv.org/pdf/1612.06661.pdf, 2017. Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. In Advances in Neural Information Processing Systems, pages 9709–9721, 2019.
- 2. Upper bounding ∆ F.
- 3. Applying the -net N.
- 4. Finishing the proof.
- 22. Let w∗ ∈ arg minw A1Bw − A1v 22. Then we have min w Theorem B.1 is very similar to Theorem 4.1 in terms of the result and the assumptions made. In the bound (29), the complexity of Φ is captured by the Gaussian width of the data-dependent set FX (Φ) defined in (28). Data-dependent complexity measures are ubiquitous in generalization theory, one of the most notable examples being Rademacher complexity. Similar complexity measure also appeared in existing representation learning theory (Maurer et al., 2016). Usually, for specific examples, we can apply concentration bounds to get rid of the data dependency, such as our result for linear representations (Theorem 4.1).
- λ Θ ∗. See reference e.g. Srebro and Shraibman (2005). At global minimum
- 2. Here X ∗ is the adjoint operator of X such that X ∗(Z) = Proof. We use matrix Bernstein with intrinsic dimension to bound λ (See Theorem 7.3.1 in Tropp et al. (2015)).
- Then from intrinsic matrix bernstein (Theorem 7.3.1 in Tropp et al. (2015)), with probability 1 − δ we have, A ≤ O(σ log
- 22. Instead, we first look at its performance on the training data that will well-approximate our target: LT +1 =
- 2. From Claim C.7, we get the second term is upper bounded by √ σ√R O( Σ ) with probability 1 − δ/10.
- 2. We apply Claim C.6 here. Therefore v 2 Tr(ΣB) + 2 Tr(Σ2B) log 1/δ + ΣB log 1/δ = O(Tr(ΣB)) by Proposition 1 of Hsu et al. (2012a). Notice here we used Tr(Σ2B) = t σt4 ≤ ( t σt2)2 = (Tr(ΣB))2. Here σt is the eigenvalues of ΣB. Meanwhile Tr(ΣB) = Σ, BB ≤ Σ 2 BB ∗ Σ 2R. This finishes the proof.
- Theorem C.10 (Restated Matrix deviation inequality from Vershynin (2017)). Let A be an m × n matrix whose rows ai are independent, isotropic and sub-gaussian random vectors in Rn. Let
- (2019); Rosset et al. (2007); Bengio et al. (2006). Define the infinite feature vector with coordinates φ(x)b = (b x)+ for every b ∈ Sd0−1. Let αt be a signed measure on Sd0−1. The inner product notation denotes integration: α φ(x) Sd0−1 φ(x)bdα(b). The tth output of the infinite-width neural network is fαt (x) = αt, φ(x). Consider the least-squares problem
- via the basic inequality (c.f. proof of Claim C.1). By the matrix Bernstein inequality (c.f. Lemma C.3 or Wei et al. (2019)), Exin∼iidp,z∼N(0,I)[ Φ(X) z ∞]

Tags

Comments