# DualSMC: Tunneling Differentiable Filtering and Planning under Continuous POMDPs

IJCAI 2020, pp. 4190-4198, 2020.

EI

Weibo:

Abstract:

A major difficulty of solving continuous POMDPs is to infer the multi-modal distribution of the unobserved true states and to make the planning algorithm dependent on the perceived uncertainty. We cast POMDP filtering and planning problems as two closely related Sequential Monte Carlo (SMC) processes, one over the real states and the othe...More

Introduction

- Observable Markov Decision Processes (POMDPs) formulate reinforcement learning problems where the agent’s instant observation is insufficient for optimal decision making [Kaelbling et al, 1998].
- Since conventional POMDP problems usually present an explicit state formulation, executing the planning algorithm in a latent space makes it difficult to adopt any useful prior knowledge.
- Whenever these models fail to perform well, it is difficult to analyze which part causes the failure as they are less interpretable

Highlights

- Observable Markov Decision Processes (POMDPs) formulate reinforcement learning problems where the agent’s instant observation is insufficient for optimal decision making [Kaelbling et al, 1998]
- Approximate solutions to Partially Observable Markov Decision Processes based on deep reinforcement learning can directly encode the history of past observations with deep models like RNNs [Hausknecht and Stone
- We present a simple but effective model named Dual Sequential Monte Carlo (DualSMC)
- On the other hand, compared with the existing Bayesian reinforcement learning literature on Partially Observable Markov Decision Processes [Ross et al, 2008], our work focuses more on deep reinforcement learning solutions to continuous Partially Observable Markov Decision Processes
- We provided an end-to-end neural network named Dual Sequential Monte Carlo to solve continuous Partially Observable Markov Decision Processes, which has three advantages
- Dual Sequential Monte Carlo combines the richness of neural networks as well as the interpretability of classical sequential Monte Carlo methods

Methods

- DVRL [Igl et al, 2018].
- LSTM filter + SMCP [Piche et al, 2018] Regressive PF ( 2, top-1) + SMCP Regressive PF + PI-SMCP.
- Adversarial PF + SMCP Adversarial PF + PI-SMCP.
- DualSMC with regressive PF ( 2) DualSMC with regressive PF DualSMC w/o proposer DualSMC with adversarial PF Success # Steps Reg PF + SMCP Adv PF + SMCP.
- DualSMC with Adv PF PF w/o proposer

Results

- The authors can see that the adversarial PF significantly outperforms other differentiable state estimation approaches, such as (1) the existing DPFs that perform density estimation [Jonschkowski et al, 2018], and (2) the deterministic LSTM model that was previously used as a strong baseline in [Karkus et al, 2018; Jonschkowski et al, 2018].

Conclusion

- The authors provided an end-to-end neural network named DualSMC to solve continuous POMDPs, which has three advantages.
- It learns plausible belief states for highdimensional POMDPs with an adversarial particle filter.
- DualSMC plans future actions by considering the distributions of the learned belief states.
- DualSMC combines the richness of neural networks as well as the interpretability of classical sequential Monte Carlo methods.
- The authors empirically validated the effectiveness of DualSMC on different tasks including visual navigation and control

Summary

## Introduction:

Observable Markov Decision Processes (POMDPs) formulate reinforcement learning problems where the agent’s instant observation is insufficient for optimal decision making [Kaelbling et al, 1998].- Since conventional POMDP problems usually present an explicit state formulation, executing the planning algorithm in a latent space makes it difficult to adopt any useful prior knowledge.
- Whenever these models fail to perform well, it is difficult to analyze which part causes the failure as they are less interpretable
## Methods:

DVRL [Igl et al, 2018].- LSTM filter + SMCP [Piche et al, 2018] Regressive PF ( 2, top-1) + SMCP Regressive PF + PI-SMCP.
- Adversarial PF + SMCP Adversarial PF + PI-SMCP.
- DualSMC with regressive PF ( 2) DualSMC with regressive PF DualSMC w/o proposer DualSMC with adversarial PF Success # Steps Reg PF + SMCP Adv PF + SMCP.
- DualSMC with Adv PF PF w/o proposer
## Results:

The authors can see that the adversarial PF significantly outperforms other differentiable state estimation approaches, such as (1) the existing DPFs that perform density estimation [Jonschkowski et al, 2018], and (2) the deterministic LSTM model that was previously used as a strong baseline in [Karkus et al, 2018; Jonschkowski et al, 2018].## Conclusion:

The authors provided an end-to-end neural network named DualSMC to solve continuous POMDPs, which has three advantages.- It learns plausible belief states for highdimensional POMDPs with an adversarial particle filter.
- DualSMC plans future actions by considering the distributions of the learned belief states.
- DualSMC combines the richness of neural networks as well as the interpretability of classical sequential Monte Carlo methods.
- The authors empirically validated the effectiveness of DualSMC on different tasks including visual navigation and control

- Table1: Training hyper-parameters for the (A) floor positioning, (B) 3D dark-light, and (C) modified reacher domains state = (0.95, 0.8) obs = (0.95, 1.05, 0.3, 0.2)
- Table2: The success rate and the average number of steps of 1,000 tests in the floor positioning domain (PF is short for particle filter)
- Table3: The average result of 100 tests for 3D light-dark navigation planning part. The robot changes its plan from taking a detour shown in Figure 5(a) to walking toward the target area directly shown in Figure 5(b). It performs equally well to the standard SMCP, with a 100.0% success rate and an averaged 21.3 steps (v.s. 20.7 steps by SMCP). We may conclude that DualSMC provides policies based on the distribution of filtered particles. We may also conclude that DualSMC trained under POMDPs generalizes well to similar tasks with less uncertainty
- Table4: Network details of each module in DualSMC

Related work

- Planning under uncertainty. Due to the high computation cost of POMDPs, many previous approaches used samplingbased techniques for either belief update or planning, or both. For instance, a variety of Monte Carlo tree search methods have shown success in relatively large POMDPs by constructing a search tree of history based on rollout simulations [Silver and Veness, 2010; Somani et al, 2013; Seiler et al, 2015; Sunberg and Kochenderfer, 2018]. Later work further improved the efficiency by limiting the search space or reusing plans [Somani et al, 2013; Kurniawati and Yadav, 2016]. Although considerable progress has been made to enlarge the set of solvable POMDPs, it remains hard for pure sampling-based methods to deal with unknown dynamics and complex observations like visual inputs. Therefore, in this work, we provide one approach to combine the efficiency and interpretability of conventional sampling-based methods with the flexibility of deep learning networks for complex POMDP modeling.

Funding

- This work is in part supported by ONR MURI N00014-16-12007. A Particle-Independent SMC Planning As shown in Alg 3, it takes the top-M particle states (for computation efficiency) and plans N future trajectories independently based on each particle state

Reference

- [Beattie et al., 2016] Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Kuttler, Andrew Lefrancq, Simon Green, Vıctor Valdes, Amir Sadik, et al. DeepMind Lab. arXiv preprint arXiv:1612.03801, 2016.
- [Brockman et al., 2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
- [Doucet and Johansen, 2009] Arnaud Doucet and Adam M Johansen. A tutorial on particle filtering and smoothing: Fifteen years later. Handbook of Nonlinear Filtering, 12(656704):3, 2009.
- [Goodfellow et al., 2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, pages 2672–2680, 2014.
- [Gordon et al., 1993] Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approach to nonlinear/nonGaussian Bayesian state estimation. In IEE Proceedings F (Radar and Signal Processing), pages 107–113, 1993.
- [Gu et al., 2015] Shixiang Shane Gu, Zoubin Ghahramani, and Richard E Turner. Neural adaptive sequential Monte Carlo. In NeurIPS, pages 2629–2637, 2015.
- [Hafner et al., 2019] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In ICML, pages 2555–2565, 2019.
- [Hausknecht and Stone, 2015] Matthew Hausknecht and Peter Stone. Deep recurrent Q-learning for partially observable MDPs. In 2015 AAAI Fall Symposium Series, 2015.
- [Igl et al., 2018] Maximilian Igl, Luisa Zintgraf, Tuan Anh Le, Frank Wood, and Shimon Whiteson. Deep variational reinforcement learning for POMDPs. In ICML, pages 2117– 2126, 2018.
- [Jonschkowski et al., 2018] Rico Jonschkowski, Divyam Rastogi, and Oliver Brock. Differentiable particle filters: End-to-end learning with algorithmic priors. In RSS, 2018.
- [Kaelbling et al., 1998] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1-2):99–134, 1998.
- [Kappen et al., 2012] Hilbert J Kappen, Vicenc Gomez, and Manfred Opper. Optimal control as a graphical model inference problem. Machine Learning, 87(2):159–182, 2012.
- [Karkus et al., 2017] Peter Karkus, David Hsu, and Wee Sun Lee. QMDP-net: Deep learning for planning under partial observability. In NeurIPS, pages 4694–4704, 2017.
- [Karkus et al., 2018] Peter Karkus, David Hsu, and Wee Sun Lee. Particle filter networks with application to visual localization. In CoRL, 2018.
- [Kempinska and Shawe-Taylor, 2017] Kira Kempinska and John Shawe-Taylor. Adversarial sequential Monte Carlo. In Bayesian Deep Learning (NeurIPS Workshop), 2017.
- [Kingma and Ba, 2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- [Kurniawati and Yadav, 2016] Hanna Kurniawati and Vinay Yadav. An online POMDP solver for uncertainty planning in dynamic environment. In Robotics Research, pages 611– 629. 2016.
- [Levine and Koltun, 2013] Sergey Levine and Vladlen Koltun. Variational policy search via trajectory optimization. In NeurIPS, pages 207–215, 2013.
- [Levine, 2018] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
- [Littman et al., 1995] Michael L Littman, Anthony R Cassandra, and Leslie Pack Kaelbling. Learning policies for partially observable environments: Scaling up. In ICML, 1995.
- [Maddison et al., 2017] Chris J Maddison, John Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Teh. Filtering variational objectives. In NeurIPS, pages 6573–6583, 2017.
- [Naesseth et al., 2018] Christian A Naesseth, Scott W Linderman, Rajesh Ranganath, and David M Blei. Variational sequential Monte Carlo. In AISTATS, 2018.
- [Papadimitriou and Tsitsiklis, 1987] Christos H Papadimitriou and John N Tsitsiklis. The complexity of Markov decision processes. Mathematics of Operations Research, 12(3):441–450, 1987.
- [Piche et al., 2018] Alexandre Piche, Valentin Thomas, Cyril Ibrahim, Yoshua Bengio, and Chris Pal. Probabilistic planning with sequential Monte Carlo methods. In ICLR, 2018.
- [Platt Jr et al., 2010] Robert Platt Jr, Russ Tedrake, Leslie Kaelbling, and Tomas Lozano-Perez. Belief space planning assuming maximum likelihood observations. In RSS, 2010.
- [Ross et al., 2008] Stephane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayes-adaptive POMDPs. In NeurIPS, pages 1225–1232, 2008.
- [Seiler et al., 2015] Konstantin M Seiler, Hanna Kurniawati, and Surya PN Singh. An online and approximate solver for POMDPs with continuous action space. In ICRA, pages 2290–2297, 2015.
- [Silver and Veness, 2010] David Silver and Joel Veness. Monte-Carlo planning in large POMDPs. In NeurIPS, pages 2164–2172, 2010.
- [Somani et al., 2013] Adhiraj Somani, Nan Ye, David Hsu, and Wee Sun Lee. DESPOT: Online POMDP planning with regularization. In NeurIPS, pages 1772–1780, 2013.
- [Sunberg and Kochenderfer, 2018] Zachary N Sunberg and Mykel J Kochenderfer. Online algorithms for POMDPs with continuous state, action, and observation spaces. In ICAPS, 2018.
- [Todorov et al., 2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In IROS, pages 5026–5033, 2012.
- [Todorov, 2008] Emanuel Todorov. General duality between optimal control and estimation. In CDC, pages 4286–4292, 2008.
- [Toussaint, 2009] Marc Toussaint. Robot trajectory optimization using approximate inference. In ICML, pages 1049– 1056, 2009.
- [Zhu et al., 2018] Pengfei Zhu, Xin Li, Pascal Poupart, and Guanghui Miao. On improving deep reinforcement learning for POMDPs. arXiv preprint arXiv:1804.06309, 2018.

Tags

Comments