# Impact of Representation Learning in Linear Bandits

ICLR, 2021.

EI

Weibo:

Abstract:

We study how representation learning can improve the efficiency of bandit problems. We study the setting where we play T linear bandits with dimension d concurrently, and these T bandit tasks share a common k(≪d) dimensional linear representation. For the finite-action setting, we present a new algorithm which achieves O~(TkN+dkNT) regret...More

Code:

Data:

Introduction

- This paper investigates the benefit of using representation learning for sequential decision-making problems.
- Representation learning has become a popular approach for improving sample efficiency across various machine learning tasks (Bengio et al, 2013).
- Many sequential decision-making tasks share the same environment but have different reward functions.
- While representation learning is already widely applied in sequential decision-making problems empirically, its theoretical foundation is still limited.
- One important problem remains open: When does representation learning provably improve efficiency of sequential decision-making problems?

Highlights

- This paper investigates the benefit of using representation learning for sequential decision-making problems
- We initiate the study on the benefits of representation learning in bandits
- We proposed new algorithms and demonstrated that in the multi-task linear bandits, if all tasks share a common linear feature extractor, representation learning provably reduces the regret
- We leave it as an open problem to develop a algorithm with an O(T kN + dkN T ) upper bound or show this bound is not possible in the adversarial contexts setting
- One central challenge for the upper bound is that existing analyses for multi-task representation learning requires i.i.d. inputs even in the supervised learning setting

Methods

- OF-MOMENTS ESTIMATOR UNDER BANDIT SETTING

The following theorem shows the guarantee of the method-of-moments (MoM) estimator the authors used to find the linear feature extractor B. - OF-MOMENTS ESTIMATOR UNDER BANDIT SETTING.
- The following theorem shows the guarantee of the method-of-moments (MoM) estimator the authors used to find the linear feature extractor B.
- Theorem 5 (MoM Estimator).
- The key differences are: (i) the authors use a uniform distribution to find the feature extractor, while they assumed the input distribution is standard d-dimensional Gaussian; (ii) the SNR in the linear bandit setting is worse than that in their supervised learning setting, and the authors get an extra d factor in the theorem

Results

**MAIN RESULTS FOR FINITE**

ACTION SETTING

focus on the finite-action setting.- Representation learning may have adverse effect without enough task
- In the figures, this was established by noting that the algorithm cannot outperform PEGE when T is small.
- This was established by noting that the algorithm cannot outperform PEGE when T is small
- This corresponds to the “negative transfer” phenomenon observed in previous work (Wang et al, 2019).
- By comparing the two figures, the authors notice that the algorithm has bigger advantage when k is smaller, which corroborates the scaling with respect to k in the regret upper bound.
- PEGE does not benefit from a smaller k

Conclusion

- The authors initiate the study on the benefits of representation learning in bandits. The authors proposed new algorithms and demonstrated that in the multi-task linear bandits, if all tasks share a common linear feature extractor, representation learning provably reduces the regret.
- One central challenge for the upper bound is that existing analyses for multi-task representation learning requires i.i.d. inputs even in the supervised learning setting.
- Another challenge is how to develop a confidence interval for an unseen input in the multi-task linear bandits setting.
- This confidence interval should utilize the common feature extractor and is tighter than the standard confidence interval for linear bandits, e.g. LinUCB

Summary

## Introduction:

This paper investigates the benefit of using representation learning for sequential decision-making problems.- Representation learning has become a popular approach for improving sample efficiency across various machine learning tasks (Bengio et al, 2013).
- Many sequential decision-making tasks share the same environment but have different reward functions.
- While representation learning is already widely applied in sequential decision-making problems empirically, its theoretical foundation is still limited.
- One important problem remains open: When does representation learning provably improve efficiency of sequential decision-making problems?
## Methods:

OF-MOMENTS ESTIMATOR UNDER BANDIT SETTING

The following theorem shows the guarantee of the method-of-moments (MoM) estimator the authors used to find the linear feature extractor B.- OF-MOMENTS ESTIMATOR UNDER BANDIT SETTING.
- The following theorem shows the guarantee of the method-of-moments (MoM) estimator the authors used to find the linear feature extractor B.
- Theorem 5 (MoM Estimator).
- The key differences are: (i) the authors use a uniform distribution to find the feature extractor, while they assumed the input distribution is standard d-dimensional Gaussian; (ii) the SNR in the linear bandit setting is worse than that in their supervised learning setting, and the authors get an extra d factor in the theorem
## Results:

**MAIN RESULTS FOR FINITE**

ACTION SETTING

focus on the finite-action setting.- Representation learning may have adverse effect without enough task
- In the figures, this was established by noting that the algorithm cannot outperform PEGE when T is small.
- This was established by noting that the algorithm cannot outperform PEGE when T is small
- This corresponds to the “negative transfer” phenomenon observed in previous work (Wang et al, 2019).
- By comparing the two figures, the authors notice that the algorithm has bigger advantage when k is smaller, which corroborates the scaling with respect to k in the regret upper bound.
- PEGE does not benefit from a smaller k
## Conclusion:

The authors initiate the study on the benefits of representation learning in bandits. The authors proposed new algorithms and demonstrated that in the multi-task linear bandits, if all tasks share a common linear feature extractor, representation learning provably reduces the regret.- One central challenge for the upper bound is that existing analyses for multi-task representation learning requires i.i.d. inputs even in the supervised learning setting.
- Another challenge is how to develop a confidence interval for an unseen input in the multi-task linear bandits setting.
- This confidence interval should utilize the common feature extractor and is tighter than the standard confidence interval for linear bandits, e.g. LinUCB

Related work

- Here we mainly focus on related theoretical results. We refer readers to Bengio et al (2013) for empirical results of using representation learning.

For supervised learning, there is a long line of works on multi-task learning and representation learning with various assumptions (Baxter, 2000; Ando & Zhang, 2005; Ben-David & Schuller, 2003; Maurer, 2006; Cavallanti et al, 2010; Maurer et al, 2016; Du et al, 2020; Tripuraneni et al, 2020). All these results assumed the existence of a common representation shared among all tasks. However, this assumption alone is not sufficient. For example, Maurer et al (2016) further assumed every task is i.i.d. drawn from an underlying distribution. Recently, Du et al (2020) replaced the i.i.d. assumption with a deterministic assumption on the input distribution. Finally, it is worth mentioning that Tripuraneni et al (2020) gave the method-of-moments estimator and built the confidence ball for the feature extractor, which inspired our algorithm for the infinite-action setting.

Funding

- The experimental results are displayed in Figure 3 for T = 10 (done by constructing tasks with first five digits) and Figure 4 for T = 45. We observe for both T = 10 and T = 45, our algorithm significantly outperforms the naive algorithm for all k

Reference

- Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320, 2011.
- Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research, 67(5):1453–1485, 2019.
- Pierre Alquier, The Tien Mai, and Massimiliano Pontil. Regret bounds for lifelong learning. arXiv preprint arXiv:1610.08628, 2016.
- Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853, 2005.
- Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. In Proceedings of the 36th International Conference on Machine Learning, 2019.
- Sanjeev Arora, Simon S Du, Sham Kakade, Yuping Luo, and Nikunj Saunshi. Provable representation learning for imitation learning via bi-level optimization. arXiv preprint arXiv:2002.10544, 2020.
- Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
- Hamsa Bastani, David Simchi-Levi, and Ruihao Zhu. Meta dynamic pricing: Learning across experiments. arXiv preprint arXiv:1902.10918, 2019.
- Jonathan Baxter. A model of inductive bias learning. Journal of artificial intelligence research, 12: 149–198, 2000.
- Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
- Luca Bertinetto, Joao F Henriques, Philip HS Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136, 2018.
- Rajendra Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013.
- Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
- Giovanni Cavallanti, Nicolo Cesa-Bianchi, and Claudio Gentile. Linear algorithms for online multitask classification. Journal of Machine Learning Research, 11(Oct):2901–2934, 2010.
- Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208–214, 2011.
- Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In COLT, 2008.
- Giulia Denevi, Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Incremental learning-tolearn with statistical guarantees. arXiv preprint arXiv:1803.08089, 2018.
- Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn stochastic gradient descent with biased regularization. In Proceedings of the 36th International Conference on Machine Learning, 2019.
- Carlo D’Eramo, Davide Tateo, Andrea Bonarini, Marcello Restelli, and Jan Peters. Sharing knowledge in multi-task deep reinforcement learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgpv2VFvr.
- Aniket Anand Deshmukh, Urun Dogan, and Clay Scott. Multi-task learning for contextual bandits. In Advances in neural information processing systems, pp. 4848–4856, 2017.
- Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
- Kai Wang Fang. Symmetric multivariate and related distributions. CRC Press, 2018.
- Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. In Proceedings of the 36th International Conference on Machine Learning, 2019.
- Tomer Galanti, Lior Wolf, and Tamir Hazan. A theoretical framework for deep transfer learning. Information and Inference: A Journal of the IMA, 5(2):159–209, 2016.
- Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. In Advances in Neural Information Processing Systems, pp. 503–513, 2019.
- Yanjun Han, Zhengqing Zhou, Zhengyuan Zhou, Jose Blanchet, Peter W Glynn, and Yinyu Ye. Sequential batch learning in finite-action linear contextual bandits. arXiv preprint arXiv:2004.06321, 2020.
- Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Multi-task deep reinforcement learning with popart. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3796–3803, 2019.
- Irina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1480–1490. JMLR. org, 2017.
- Kwang-Sung Jun, Rebecca Willett, Stephen Wright, and Robert Nowak. Bilinear bandits with lowrank structure. In International Conference on Machine Learning, pp. 3163–3172, 2019.
- Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Adaptive gradient-based metalearning methods. arXiv preprint arXiv:1906.02717, 2019.
- Sahin Lale, Kamyar Azizzadenesheli, Anima Anandkumar, and Babak Hassibi. Stochastic linear bandits with hidden low rank structure. arXiv preprint arXiv:1901.09490, 2019.
- Alessandro Lazaric and Marcello Restelli. Transfer from multiple mdps. In Advances in Neural Information Processing Systems, pp. 1746–1754, 2011.
- Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
- Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10657–10665, 2019.
- Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670, 2010.
- Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2071–2080. JMLR.org, 2017.
- Yingkai Li, Yining Wang, and Yuan Zhou. Nearly minimax-optimal regret for linearly parameterized bandits. In Conference on Learning Theory, pp. 2173–2174, 2019a.
- Yingkai Li, Yining Wang, and Yuan Zhou. Tight regret bounds for infinite-armed linear contextual bandits. arXiv preprint arXiv:1905.01435, 2019b.
- Lydia T Liu, Urun Dogan, and Katja Hofmann. Decoding multitask dqn in the world of minecraft. In The 13th European Workshop on Reinforcement Learning (EWRL) 2016, 2016.
- Yangyi Lu, Amirhossein Meisami, and Ambuj Tewari. Low-rank generalized linear bandit problems. arXiv preprint arXiv:2006.02948, 2020.
- Andreas Maurer. Bounds for linear multi-task learning. Journal of Machine Learning Research, 7 (Jan):117–139, 2006.
- Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853–2884, 2016.
- Daniel McNamara and Maria-Florina Balcan. Risk bounds for transferring representations with and without fine-tuning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2373–2381. JMLR. org, 2017.
- Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.
- Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of maml. arXiv preprint arXiv:1909.09157, 2019.
- Yufei Ruan, Jiaqi Yang, and Yuan Zhou. Linear bandits with limited adaptivity and learning distributional optimal design. arXiv preprint arXiv:2007.01980, 2020.
- Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
- Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
- Tom Schaul and Jurgen Schmidhuber. Metalearning. Scholarpedia, 5(6):4650, 2010. David Simchi-Levi and Yunzong Xu. Phase transitions and cyclic phenomena in bandits with switching constraints. In Advances in Neural Information Processing Systems, pp. 7523–7532, 2019. Marta Soare, Ouais Alsharif, Alessandro Lazaric, and Joelle Pineau. Multi-task linear bandits. 2018. Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009. Yee Teh, Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas
- Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4496–4506, 2017. Nilesh Tripuraneni, Chi Jin, and Michael I Jordan. Provable meta-learning of linear representations. arXiv preprint arXiv:2002.11684, 2020. Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. Zirui Wang, Zihang Dai, Barnabas Poczos, and Jaime Carbonell. Characterizing and avoiding negative transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11293–11302, 2019.
- Lemma 1 (General Hoeffding’s inequality, Vershynin (2018), Theorem 2.6.2). Let X1,..., Xn be independent random variables such that E[Xi] = 0 and Xi is σi-sub-Gaussian. Then there exists a constant c > 0, such that for any δ > 0, we have

Tags

Comments