# Policy Optimization With Penalized Point Probability Distance: An Alternative To Proximal Policy Optimization

arXiv: Learning, Volume abs/1807.00442, 2018.

EI

Keywords:

Trust Region Policy Optimizationdeep reinforcement learningproximal policy optimizationpenalized point probability distanceActor Critic using Kronecker-factoredTrust RegionMore(5+)

Weibo:

Abstract:

As the most influential variant and improvement for Trust Region Policy Optimization (TRPO), proximal policy optimization (PPO) has been widely applied across various domains with its inherent advantages involving sample efficiency, implementation and parallelism after published. In this paper, a first order gradient reinforcement learnin...More

Code:

Data:

Introduction

- The basis of a reinforcement learning algorithm is generalized policy iteration [Sutton and Barto, 2018], which states two essential iterative steps: policy evaluation and improvement.
- Trust Region (ACKTR) [Wu et al, 2017], Proximal Policy Optimization (PPO) [Schulman et al, 2017].
- Improving the strategy monotonically had been nontrivial before the trust region policy optimization(TRPO) was proposed [Schulman et al, 2015a].
- Given Q(s, a) which represents the agent’s return in state s after taking action a, the objective function can be written as max Es,a log πθ(a|s)Q(s, a)

Highlights

- With the development of deep reinforcement learning, lots of impressive results have been produced in a wide range of fields such as playing Atari game [Mnih et al, 2015; Hessel et al, 2017], controlling robotics [Lillicrap et al, 2015], Go [Silver et al, 2017], neural architecture search [Tan et al, 2018; Pham et al, 2018].

The basis of a reinforcement learning algorithm is generalized policy iteration [Sutton and Barto, 2018], which states two essential iterative steps: policy evaluation and improvement - The pessimistic surrogate objective which is the most critical component of Policy Optimization (PPO) still has some limitations. As another potential improvement for Trust Region Policy Optimization (TRPO) and alternative to PPO, this paper focuses on a policy optimization algorithm, where its contributions are: 1
- For games that POP3D score highest, BASELINE score worse than PPO more often than the other way round, which means that POP3D is not just an approximate version of BASELINE
- We introduce a new reinforcement learning algorithm called POP3D (Policy Optimization with Penalized Point Probability Distance), which acts as a TRPO variant like PPO
- Compared with KLD that is an upper bound for the square of total variance divergence between two distributions, the penalized point probability distance is a symmetric lower bound
- It equivalently expands the optimal solution manifold effectively while encouraging exploration, which is a similar mechanism implicitly possessed by PPO

Methods

- OpenAI Gym is a well-known simulation environment to test and evaluate various reinforcement algorithms, which is composed of both discrete (Atari) and continuous (Mujoco) domains [Brockman et al, 2016].
- Since PPO is a distinguished RL algorithm which defeats various methods such as A3C, A2C ACKTR, the authors focus on a detailed quantitative comparison with fine-tuned PPO.
- Quantitative comparisons between KLD and point probability penalty helps to convince the critical role of the latter, where the former strategy is named fixed KLD in [Schulman et al, 2017] and can act as another good baseline in this context, named by BASELINE below

Results

- The final score of each game is averaged by three different seeds and the highest is in bold.
- For games that POP3D score highest, BASELINE score worse than PPO more often than the other way round, which means that POP3D is not just an approximate version of BASELINE
- For another metric, POP3D wins 20 out of 49 Atari games which matches PPO with 18, followed by BASELINE with 6, and last ranked by TRPO with 5.
- Each algorithm’s score performance with iteration steps is shown in Figure 2
- Both metrics indicates that POP3D is competitive to PPO in the continuous domain

Conclusion

**Discussion about Pessimistic Proximal Policy**

PPO is called pessimistic proximal policy optimization1 in the meaning of its objective construction style.

Without loss of generality, supposing At > 0 for given state st and action at, and the optimal choice is at.- The pessimistic mechanism plays a very critical role for PPO by a relatively weak preference for good action decision for a given state, which in turn affects learning efficiency.In this paper, the authors introduce a new reinforcement learning algorithm called POP3D (Policy Optimization with Penalized Point Probability Distance), which acts as a TRPO variant like PPO.
- POP3D is highly competitive and an alternative to PPO

Summary

## Introduction:

The basis of a reinforcement learning algorithm is generalized policy iteration [Sutton and Barto, 2018], which states two essential iterative steps: policy evaluation and improvement.- Trust Region (ACKTR) [Wu et al, 2017], Proximal Policy Optimization (PPO) [Schulman et al, 2017].
- Improving the strategy monotonically had been nontrivial before the trust region policy optimization(TRPO) was proposed [Schulman et al, 2015a].
- Given Q(s, a) which represents the agent’s return in state s after taking action a, the objective function can be written as max Es,a log πθ(a|s)Q(s, a)
## Objectives:

The authors hardly obtained the optimal solution accurately, instead, the goal is a good enough answer.## Methods:

OpenAI Gym is a well-known simulation environment to test and evaluate various reinforcement algorithms, which is composed of both discrete (Atari) and continuous (Mujoco) domains [Brockman et al, 2016].- Since PPO is a distinguished RL algorithm which defeats various methods such as A3C, A2C ACKTR, the authors focus on a detailed quantitative comparison with fine-tuned PPO.
- Quantitative comparisons between KLD and point probability penalty helps to convince the critical role of the latter, where the former strategy is named fixed KLD in [Schulman et al, 2017] and can act as another good baseline in this context, named by BASELINE below
## Results:

The final score of each game is averaged by three different seeds and the highest is in bold.- For games that POP3D score highest, BASELINE score worse than PPO more often than the other way round, which means that POP3D is not just an approximate version of BASELINE
- For another metric, POP3D wins 20 out of 49 Atari games which matches PPO with 18, followed by BASELINE with 6, and last ranked by TRPO with 5.
- Each algorithm’s score performance with iteration steps is shown in Figure 2
- Both metrics indicates that POP3D is competitive to PPO in the continuous domain
## Conclusion:

**Discussion about Pessimistic Proximal Policy**

PPO is called pessimistic proximal policy optimization1 in the meaning of its objective construction style.

Without loss of generality, supposing At > 0 for given state st and action at, and the optimal choice is at.- The pessimistic mechanism plays a very critical role for PPO by a relatively weak preference for good action decision for a given state, which in turn affects learning efficiency.In this paper, the authors introduce a new reinforcement learning algorithm called POP3D (Policy Optimization with Penalized Point Probability Distance), which acts as a TRPO variant like PPO.
- POP3D is highly competitive and an alternative to PPO

- Table1: The number of games ”won” by each algorithm for Atari game, where the score metric is averaged on three seeds
- Table2: The number of games won by each algorithm for Mujoco game, where the score metric is averaged on three seeds
- Table3: Mean final scores (last 100 episodes) of PPO, POP3D, BASELINE and TRPO on Atari games after 40M frames. The results are averaged on three trials
- Table4: PPO’s hyper-parameters for Atari game
- Table5: POP3D’s hyper-parameters for Atari game
- Table6: BASELINE’s hyper-parameters for Atari game
- Table7: PPO’s hyper-parameters for Mujoco game
- Table8: POP3D’s hyper-parameters for Mujoco game
- Table9: All episodes mean scores of PPO, POP3D, BASELINE and TRPO on Atari games after 40M frames. The results are averaged by three trials
- Table10: Mean final scores (last 100 episodes) of PPO ,POP3D on Mujoco games after 10M frames. The results are averaged on three trials
- Table11: All episodes mean scores of PPO ,POP3D on Mujoco games after 10M frames. The results are averaged by three trials

Reference

- [Bellemare et al., 2017] Marc G Bellemare, Will Dabney, and Remi Munos. A distributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887, 2017.
- [Brockman et al., 2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- [Espeholt et al., 2018] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.
- [Goodfellow et al., 2016] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
- [Henderson et al., 2017] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. arXiv preprint arXiv:1709.06560, 2017.
- [Hessel et al., 2017] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298, 2017.
- [Horgan et al., 2018] Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van Hasselt, and David Silver. Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933, 2018.
- [Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [Lillicrap et al., 2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- [Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- [Mnih et al., 2016] V. Mnih, A. Puigdomenech Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. ArXiv e-prints, February 2016.
- [Pham et al., 2018] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
- [Robert, 2014] Christian Robert. Machine learning, a probabilistic perspective, 2014.
- [Schaul et al., 2015] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
- [Schulman et al., 2015a] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust Region Policy Optimization. ArXiv e-prints, February 2015.
- [Schulman et al., 2015b] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-Dimensional Continuous Control Using Generalized Advantage Estimation. ArXiv e-prints, June 2015.
- [Schulman et al., 2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal Policy Optimization Algorithms. ArXiv e-prints, July 2017.
- [Silver et al., 2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
- [Sutton and Barto, 2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- [Tan et al., 2018] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platformaware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.
- [Van Hasselt et al., 2016] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI, volume 16, pages 2094–2100, 2016.
- [Wang et al., 2015] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
- [Wu et al., 2017] Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kroneckerfactored approximation. In Advances in neural information processing systems, pages 5279–5288, 2017.

Tags

Comments