# Taylor Expansion Policy Optimization

ICML, pp. 9397-9406, 2020.

EI

Keywords:

Machine LearningMarkov Decision Processimportance samplingspecial casedeep reinforcement learningMore(13+)

Weibo:

Abstract:

In this work, we investigate the application of Taylor expansions in reinforcement learning. In particular, we propose Taylor expansion policy optimization , a policy optimization formalism that generalizes prior work (e.g., TRPO) as a first-order special case. We also show that Taylor expansions intimately relate to off-policy evaluation...More

Code:

Data:

Introduction

- Policy optimization is a major framework in model-free reinforcement learning (RL), with successful applications in challenging domains (Silver et al, 2016; Berner et al, 2019; Vinyals et al, 2019).
- Espeholt et al (2018) has observed that the corrections are especially useful for distributed algorithms, where behavior policy and target policy typically differ.
- Both algorithmic ideas have contributed significantly to stabilizing

Highlights

- Policy optimization is a major framework in model-free reinforcement learning (RL), with successful applications in challenging domains (Silver et al, 2016; Berner et al, 2019; Vinyals et al, 2019)
- When we apply the same technique to the reinforcement learning objective, we reuse the general result and derive a higher-order policy optimization objective. This leads to Section 3, where we formally present the Taylor Expansion Policy Optimization (TayPO) and generalize prior work (Schulman et al, 2015; 2017) as a first-order special case
- The idea of importance sampling is the core of most off-policy evaluation techniques (Precup et al, 2000; Harutyunyan et al, 2016; Munos et al, 2016)
- We showed that Taylor expansions construct approximations to the full importance sampling corrections and intimately relate to established off-policy evaluation techniques
- Prior work focuses on applying off-policy corrections directly to policy gradient estimators (Jie and Abbeel, 2010; Espeholt et al, 2018) instead of the surrogate objectives which generate the gradients

Methods

- The authors evaluate the potential benefits of applying second-order expansions in a diverse set of scenarios.
- The authors test if the second-order correction helps with (1) policy-based and (2) value-based algorithms.
- In large-scale experiments, to take advantage of computational architectures, actors (μ) and learners (π) are not perfectly synchronized.
- In Section 5.2, the authors study how the performance of a general distributed policy-based agent (e.g., IMPALA, Espeholt et al, 2018) is influenced by the discrepancy between actors and learners.
- For case (2), in Section 5.3, the authors show the benefits of secondorder expansions in with a state-of-the-art value-based agent

Results

- All evaluation environments are done on the entire suite of Atari games (Bellemare et al, 2013).
- Architecture for distributed agents.
- Distributed agents generally consist of a central learner and multiple actors (Nair et al, 2015; Mnih et al, 2016; Babaeizadeh et al, 2017; Barth-Maron et al, 2018; Horgan et al, 2018).
- The authors focus on two main setups: Type I includes agents such as IMPALA (Espeholt et al, 2018).
- The authors provide details on hyper-parameters of experiment setups in respective subsections in Appendix H

Conclusion

**Discussion and conclusion**

The idea of IS is the core of most off-policy evaluation techniques (Precup et al, 2000; Harutyunyan et al, 2016; Munos et al, 2016).- Related to the work is that of Tomczak et al (2019), where they identified such optimization objectives as biased approximations to the full IS objective (Metelli et al, 2018)
- The authors characterized such approximations as the first-order special case of Taylor expansions and derived their natural generalizations.
- The authors find enumerating all steps along the trajectory works well

Summary

## Introduction:

Policy optimization is a major framework in model-free reinforcement learning (RL), with successful applications in challenging domains (Silver et al, 2016; Berner et al, 2019; Vinyals et al, 2019).- Espeholt et al (2018) has observed that the corrections are especially useful for distributed algorithms, where behavior policy and target policy typically differ.
- Both algorithmic ideas have contributed significantly to stabilizing
## Methods:

The authors evaluate the potential benefits of applying second-order expansions in a diverse set of scenarios.- The authors test if the second-order correction helps with (1) policy-based and (2) value-based algorithms.
- In large-scale experiments, to take advantage of computational architectures, actors (μ) and learners (π) are not perfectly synchronized.
- In Section 5.2, the authors study how the performance of a general distributed policy-based agent (e.g., IMPALA, Espeholt et al, 2018) is influenced by the discrepancy between actors and learners.
- For case (2), in Section 5.3, the authors show the benefits of secondorder expansions in with a state-of-the-art value-based agent
## Results:

All evaluation environments are done on the entire suite of Atari games (Bellemare et al, 2013).- Architecture for distributed agents.
- Distributed agents generally consist of a central learner and multiple actors (Nair et al, 2015; Mnih et al, 2016; Babaeizadeh et al, 2017; Barth-Maron et al, 2018; Horgan et al, 2018).
- The authors focus on two main setups: Type I includes agents such as IMPALA (Espeholt et al, 2018).
- The authors provide details on hyper-parameters of experiment setups in respective subsections in Appendix H
## Conclusion:

**Discussion and conclusion**

The idea of IS is the core of most off-policy evaluation techniques (Precup et al, 2000; Harutyunyan et al, 2016; Munos et al, 2016).- Related to the work is that of Tomczak et al (2019), where they identified such optimization objectives as biased approximations to the full IS objective (Metelli et al, 2018)
- The authors characterized such approximations as the first-order special case of Taylor expansions and derived their natural generalizations.
- The authors find enumerating all steps along the trajectory works well

- Table1: Scores across 57 Atari levels for experiments on general policy-optimization with distributed architecture with no artificial delays between actors and learner. We compare several alternatives for off-policy correction: V-trace, first-order and second-order. We also provide scores for random policy and human players as reference. All scores are obtained by training for 400M frames. Best results per game are highlighted in bold font
- Table2: Scores across 57 Atari levels for experiments on general policy-optimization with distributed architecture with severe delays between actors and learner. We compare several alternatives for off-policy correction: V-trace, first-order and second-order. We also provide scores for random policy and human players as reference. All scores are obtained by training for 400M frames. The performance across all algorithms generally degrade significantly compared to Table 1, the second-order degrades more gracefully than other baselines. Best results per game are highlighted in bold

Reference

- Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. (2018). Maximum a posteriori policy optimisation. In International Conference on Learning Representations.
- Babaeizadeh, M., Frosio, I., Tyree, S., Clemons, J., and Kautz, J. (2017). Reinforcement learning through asynchronous advantage actor-critic on a gpu. International Conference on Learning Representations.
- Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., TB, D., Muldal, A., Heess, N., and Lillicrap, T. (2018). Distributional policy gradients. In International Conference on Learning Representations.
- Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279.
- Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., et al. (2019). Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680.
- Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., Legg, S., and Kavukcuoglu, K. (2018). IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In International Conference on Machine Learning.
- Gruslys, A., Dabney, W., Azar, M. G., Piot, B., Bellemare, M., and Munos, R. (2018). The reactor: A fast and sample-efficient actor-critic agent for reinforcement learning. In International Conference on Learning Representations.
- Harutyunyan, A., Bellemare, M. G., Stepleton, T., and Munos, R. (2016). Q(λ) with Off-Policy Corrections. In Algorithmic Learning Theory.
- He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Computer Vision and Pattern Recognition.
- Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H., and Silver, D. (2018). Distributed prioritized experience replay. In International Conference on Learning Representations.
- Jie, T. and Abbeel, P. (2010). On a connection between importance sampling and the likelihood ratio policy gradient. In Neural Information Processing Systems.
- Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning.
- Kapturowski, S., Ostrovski, G., Dabney, W., Quan, J., and Munos, R. (2019). Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations.
- Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Metelli, A. M., Papini, M., Faccio, F., and Restelli, M. (2018). Policy optimization via importance sampling. In Neural Information Processing Systems.
- Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International conference on machine learning.
- Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. In NIPS Deep Learning Workshop.
- Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. (2016). Safe and efficient off-policy reinforcement learning. In Neural Information Processing Systems.
- Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De Maria, A., Panneershelvam, V., Suleyman, M., Beattie, C., Petersen, S., et al. (2015). Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296.
- Pohlen, T., Piot, B., Hester, T., Azar, M. G., Horgan, D., Budden, D., Barth-Maron, G., Van Hasselt, H., Quan, J., Vecerık, M., et al. (2018). Observe and look further: Achieving consistent performance on atari. arXiv preprint arXiv:1805.11593.
- Precup, D., Sutton, R. S., and Singh, S. P. (2000). Eligibility traces for off-policy policy evaluation. In International Conference on Machine Learning.
- Rowland, M., Dabney, W., and Munos, R. (2020). Adaptive trade-offs in off-policy learning. In International Conference on Artificial Intelligence and Statistics.
- Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning.
- Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354.
- Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. (2017). Sample efficient actor-critic with experience replay. International Conference on Learning Representations.
- Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., and Freitas, N. (2016). Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning.
- Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529:484–503.
- Song, H. F., Abdolmaleki, A., Springenberg, J. T., Clark, A., Soyer, H., Rae, J. W., Noury, S., Ahuja, A., Liu, S., Tirumala, D., Heess, N., Belov, D., Riedmiller, M., and Botvinick, M. M. (2020). V-MPO: on-policy maximum a posteriori policy optimization for discrete and continuous control. In International Conference on Learning Representations.
- Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Neural Information Processing Systems.
- Tieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31.
- Tomczak, M. B., Kim, D., Vrancx, P., and Kim, K.-E. (2019). Policy optimization through approximated importance sampling. arXiv preprint arXiv:1910.03857.
- Proof. It is known that for K = 1, replacing Qμ(x, a) by Aμ(x, a) in the estimation can potentially reduce variance (Schulman et al., 2015; 2017) yet keeps the estimate unbiased. Below, we show that in general, replacing Qπ(x, a) by Aπ(x, a) renders the estimate of LK (π, μ) unbiased for general K ≥ 1.

Tags

Comments