Bootstrap Latent-Predictive Representations for Multitask Reinforcement Learning

international conference on machine learning, 2020.

Cited by: 0|Bibtex|Views94|Links
Keywords:
multitask settingPredictions of Bootstrapped Latentsrl agentPOMDPsauxiliary taskMore(12+)
Weibo:
We considered the problem of representation learning for deep reinforcement learning agents in partially observable, multitask settings

Abstract:

Learning a good representation is an essential component for deep reinforcement learning (RL). Representation learning is especially important in multitask and partially observable settings where building a representation of the unknown environment is crucial to solve the tasks. Here we introduce Prediction of Bootstrap Latents (PBL), a s...More

Code:

Data:

0
Introduction
  • Deep reinforcement learning (RL) has seen many successes in the recent years (Mnih et al, 2015; Levine et al, 2016; Silver et al, 2017; Vinyals et al, 2019).
  • To learn a rich and useful representation, these methods may demand a high-degree of accuracy in multistep prediction of future observations
  • This degree of accuracy can be difficult to achieve in many problems, especially in partially observable and multitask settings, where uncertainty in the agent state and complex, diverse observations make prediction more challenging.
  • For a fixed policy π and for all t ≥ 0, the authors define recursively, the states Stπ, the observations Otπ, the histories Htπ and actions Aπt starting form some initial distribution ρ and following π: Initialization: S0π H0π
Highlights
  • Deep reinforcement learning (RL) has seen many successes in the recent years (Mnih et al, 2015; Levine et al, 2016; Silver et al, 2017; Vinyals et al, 2019)
  • We introduce Predictions of Bootstrapped Latents (PBL, “pebble”), a new representation learning technique for deep reinforcement learning agents
  • We considered the problem of representation learning for deep reinforcement learning agents in partially observable, multitask settings
  • We introduced Predictions of Bootstrapped Latents (PBL), a representation learning technique that provides a novel way of learning meaningful future latent observations through bootstrapped forward and reverse predictions
  • We demonstrated that Predictions of Bootstrapped Latents outperforms the state-of-the-art representation learning technique in DMLab-30
  • Our results show that agents in multitask setting significantly benefit from learning to predict a meaningful representation of the future in response to its actions
Methods
  • PBL Simcore DRAW Pixel Control CPC RL only.
  • Simcore DRAW outperforms the standard pixel control, albeit to a lesser extent.
  • This may be explained by the intuition that pixel-based representation learning may struggle with the large variety of tasks.
  • Diverse images produced by different tasks are easy to distinguish from each other, meaning that CPC may not require a rich and informative representations to distinguish between the positive and negative examples, resulting in worse overall performance
Results
  • The authors see that PBL is able to improve the overall performance in this benchmark. The authors demonstrated that PBL outperforms the state-of-the-art representation learning technique in DMLab-30.
Conclusion
  • The authors considered the problem of representation learning for deep RL agents in partially observable, multitask settings.
  • The authors' results show that agents in multitask setting significantly benefit from learning to predict a meaningful representation of the future in response to its actions.
  • This is highlighted in the fact that when the authors used random projections of future observations, there was no benefit in trying to predict farther into the future.
  • Going forward, a promising direction would be to investigate ways to predict more relevant and diverse information, over long horizons
Summary
  • Introduction:

    Deep reinforcement learning (RL) has seen many successes in the recent years (Mnih et al, 2015; Levine et al, 2016; Silver et al, 2017; Vinyals et al, 2019).
  • To learn a rich and useful representation, these methods may demand a high-degree of accuracy in multistep prediction of future observations
  • This degree of accuracy can be difficult to achieve in many problems, especially in partially observable and multitask settings, where uncertainty in the agent state and complex, diverse observations make prediction more challenging.
  • For a fixed policy π and for all t ≥ 0, the authors define recursively, the states Stπ, the observations Otπ, the histories Htπ and actions Aπt starting form some initial distribution ρ and following π: Initialization: S0π H0π
  • Methods:

    PBL Simcore DRAW Pixel Control CPC RL only.
  • Simcore DRAW outperforms the standard pixel control, albeit to a lesser extent.
  • This may be explained by the intuition that pixel-based representation learning may struggle with the large variety of tasks.
  • Diverse images produced by different tasks are easy to distinguish from each other, meaning that CPC may not require a rich and informative representations to distinguish between the positive and negative examples, resulting in worse overall performance
  • Results:

    The authors see that PBL is able to improve the overall performance in this benchmark. The authors demonstrated that PBL outperforms the state-of-the-art representation learning technique in DMLab-30.
  • Conclusion:

    The authors considered the problem of representation learning for deep RL agents in partially observable, multitask settings.
  • The authors' results show that agents in multitask setting significantly benefit from learning to predict a meaningful representation of the future in response to its actions.
  • This is highlighted in the fact that when the authors used random projections of future observations, there was no benefit in trying to predict farther into the future.
  • Going forward, a promising direction would be to investigate ways to predict more relevant and diverse information, over long horizons
Tables
  • Table1: Main architecture parameters. These are the same as Table 7 of <a class="ref-link" id="cHessel_et+al_2019_a" href="#rHessel_et+al_2019_a">Hessel et al (2019</a>), except for parameters in bold
  • Table2: Parameters for action-conditional predictions of the future
  • Table3: PBL parameters
  • Table4: CPC parameters
  • Table5: DRAW loss weight. See Gregor et al (2019) for full details, and tables 1 and 2
  • Table6: Pixel control parameters. See Hessel et al (2019) for full details. Different parameters are designated in bold
  • Table7: PopArt parameters, the same as used by <a class="ref-link" id="cHessel_et+al_2019_a" href="#rHessel_et+al_2019_a">Hessel et al (2019</a>)
  • Table8: RL and other training parameters. These match the parameters used by <a class="ref-link" id="cHessel_et+al_2019_a" href="#rHessel_et+al_2019_a">Hessel et al (2019</a>) with differences in bold
  • Table9: Human normalized scores across tasks in the last 5% of training—the last 500M out of 10B frames. Statistically significant performance improvements in bold
Download tables as Excel
Related work
  • Littman et al (2001); Singh et al (2003) introduced predictive state representations (PSRs) and an algorithm for learning such representations. PSRs are akin to actionconditional predictions of the future where the prediction tasks are indicators of events on the finite observation space.

    The power of PSRs is the ability to act as a compact replacement for the state that can be constructed from observable quantities. That is, PSRs are compact sufficient statistics of the history. Inspired by this line of research, we focus on learning neural representations that are predictive of observations multiple timesteps into the future, conditioned on the agent’s actions.

    A number of other works investigated learning predictive representations (with action-conditional predictions of the future) to improve RL performance. Oh et al (2015) use generated observations in addition to the actions when compressing partial histories. Amos et al (2018) learned representations by maximizing the likelihood of future proprioceptive observations using a PreCo model, which interleaves an action-dependent (predictor) RNN with an observation dependent (corrector) RNN for compressing full histories, and uses the predictor RNN for the action-only sequences in partial histories (cf. fig. 1, where one RNN is used for the full histories, and another for the partial histories). Oh et al (2017); Schrittwieser et al (2019) learn to predict the base elements needed for Monte Carlo tree search conditioned on actions (Schrittwieser et al, 2019) or options (Oh et al, 2015): Rewards, values, logits of the search prior policy (in the case of Schrittwieser et al, 2019), and termination/discount (in the case of Oh et al, 2017). Guo et al (2018); Moreno et al (2018); Gregor et al (2019) used an architecture similar to fig. 1 with a particular focus on learning representations of belief state. Ha & Schmidhuber (2018); Hafner et al (2019) used variational autoencoders (VAEs, Kingma & Welling, 2014; Gregor et al, 2015) to shape the full history RNN representation, and trained latent models from the partial histories to maximize likelihood of the respective full histories. The Simcore DRAW technique (Gregor et al, 2019) is a demonstrably strong VAE-based representation learning technique for single task deep RL. It uses the same architecture as fig. 1 to compress histories, but trains hf and hp by trying to autoencode observations. Taking Zt+k ∼ Pφ(Zt+k|Ot+k) to be the DRAW latent with parameters φ, the technique maximizes the (parametric) likelihood Pφ(Ot+k|Zt+k) subject to the posterior Pφ(Zt+k|Ot+k) being close (in KL divergence) to the prior Pφ (Zt+k |Btθ+k ).
Reference
  • Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
    Google ScholarFindings
  • Amos, B., Dinh, L., Cabi, S., Rothörl, T., Colmenarejo, S. G., Muldal, A., Erez, T., Tassa, Y., de Freitas, N., and Denil, M. Learning awareness models. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
    Google ScholarLocate open access versionFindings
  • Aytar, Y., Pfaff, T., Budden, D., Paine, T. L., Wang, Z., and de Freitas, N. Playing hard exploration games by watching YouTube. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38
    Google ScholarLocate open access versionFindings
  • December 2018, Montréal, Canada, pp. 2935–2945, 2018.
    Google ScholarFindings
  • Azar, M. G., Piot, B., Pires, B. A., Grill, J.-B., Altché, F., and Munos, R. World discovery models. arXiv preprint arXiv:1902.07685, 2019.
    Findings
  • Beattie, C., Leibo, J. Z., Teplyashin, D., Ward, T., Wainwright, M., Küttler, H., Lefrancq, A., Green, S., Valdés, V., Sadik, A., Schrittwieser, J., Anderson, K., York, S., Cant, M., Cain, A., Bolton, A., Gaffney, S., King, H., Hassabis, D., Legg, S., and Petersen, S. DeepMind lab, 2016.
    Google ScholarFindings
  • Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
    Google ScholarLocate open access versionFindings
  • Brunskill, E. and Li, L. Sample complexity of multi-task reinforcement learning. In Proceedings of the TwentyNinth Conference on Uncertainty in Artificial Intelligence, UAI’13, pp. 122–131, Arlington, Virginia, USA, 2013. AUAI Press.
    Google ScholarLocate open access versionFindings
  • Burda, Y., Edwards, H., Storkey, A. J., and Klimov, O. Exploration by random network distillation. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
    Google ScholarLocate open access versionFindings
  • Cassandra, A. R., Kaelbling, L. P., and Littman, M. L. Acting optimally in partially observable stochastic domains. In Proceedings of the 12th National Conference on Artificial Intelligence, Seattle, WA, USA, July 31 - August 4, 1994, Volume 2, pp. 1023–1028, 1994.
    Google ScholarLocate open access versionFindings
  • Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., Legg, S., and Kavukcuoglu, K. IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 1406–1415, 2018.
    Google ScholarLocate open access versionFindings
  • Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and Wierstra, D. DRAW: A recurrent neural network for image generation. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 1462–1471, 2015.
    Google ScholarLocate open access versionFindings
  • Gregor, K., Jimenez Rezende, D., Besse, F., Wu, Y., Merzic, H., and van den Oord, A. Shaping belief states with generative environment models for RL. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 13475–13487. Curran Associates, Inc., 2019.
    Google ScholarLocate open access versionFindings
  • Guo, Z. D., Azar, M. G., Piot, B., Pires, B. A., Pohlen, T., and Munos, R. Neural predictive belief representations. arXiv preprint arXiv:1811.06407, 2018.
    Findings
  • Ha, D. and Schmidhuber, J. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38
    Google ScholarLocate open access versionFindings
  • December 2018, Montréal, Canada, pp. 2455–2467, 2018.
    Google ScholarFindings
  • Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latent dynamics for planning from pixels. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2555–2565, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
    Google ScholarLocate open access versionFindings
  • He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778, 2016a. doi: 10.1109/CVPR.2016.90.
    Locate open access versionFindings
  • Springer, 2016b.
    Google ScholarFindings
  • Hessel, M., Soyer, H., Espeholt, L., Czarnecki, W., Schmitt, S., and van Hasselt, H. Multi-task deep reinforcement learning with PopArt. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3796–3803, 2019.
    Google ScholarLocate open access versionFindings
  • Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. doi: 10. 1162/neco.1997.9.8.1735.
    Locate open access versionFindings
  • Houthooft, R., Chen, X., Duan, Y., Schulman, J., Turck, F. D., and Abbeel, P. VIME: variational information maximizing exploration. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 1109–1117, 2016.
    Google ScholarLocate open access versionFindings
  • Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforcement learning with unsupervised auxiliary tasks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
    Google ScholarLocate open access versionFindings
  • Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
    Google ScholarLocate open access versionFindings
  • Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-toend training of deep visuomotor policies. Journal of Machine Learning Research, 17(1):1334–1373, January 2016. ISSN 1532-4435.
    Google ScholarLocate open access versionFindings
  • Littman, M. L., Sutton, R. S., and Singh, S. P. Predictive representations of state. In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada], pp. 1555–1561, 2001.
    Google ScholarLocate open access versionFindings
  • Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., Kumaran, D., and Hadsell, R. Learning to navigate in complex environments. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
    Google ScholarLocate open access versionFindings
  • Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529, 2015.
    Google ScholarLocate open access versionFindings
  • Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pp. 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • Moreno, P., Humplik, J., Papamakarios, G., Pires, B. A., Buesing, L., Heess, N., and Weber, T. Neural belief states for partially observed domains. In NeurIPS 2018 workshop on Reinforcement Learning under Partial Observability, 2018.
    Google ScholarLocate open access versionFindings
  • Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML10), June 21-24, 2010, Haifa, Israel, pp. 807–814, 2010.
    Google ScholarLocate open access versionFindings
  • Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. P. Action-conditional video prediction using deep networks in Atari games. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 2863–2871, 2015.
    Google ScholarLocate open access versionFindings
  • Oh, J., Singh, S., and Lee, H. Value prediction network. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 6118–6128, 2017.
    Google ScholarLocate open access versionFindings
  • Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
    Findings
  • Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 2778–2787, 2017.
    Google ScholarLocate open access versionFindings
  • Puigdomènech Badia, A., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., and Blundell, C. Never give up: Learning directed exploration strategies. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M. J., and Levine, S. SOLAR: Deep structured representations for model-based reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 7444–7453, 2019.
    Google ScholarLocate open access versionFindings
  • Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., et al. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265, 2019.
    Findings
  • Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of Go without human knowledge. nature, 550(7676):354–359, 2017.
    Google ScholarLocate open access versionFindings
  • Singh, S. P., Littman, M. L., Jong, N. K., Pardoe, D., and Stone, P. Learning predictive state representations. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pp. 712–719, 2003.
    Google ScholarLocate open access versionFindings
  • Song, H. F., Abdolmaleki, A., Springenberg, J. T., Clark, A., Soyer, H., Rae, J. W., Noury, S., Ahuja, A., Liu, S., Tirumala, D., Heess, N., Belov, D., Riedmiller, M., and Botvinick, M. M. V-MPO: On-policy maximum a posteriori policy optimization for discrete and continuous control. In (to appear) 8th International Conference on Learning Representations, ICLR 2020, 2020.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
    Google ScholarFindings
  • Szepesvári, C. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning, 4(1):1–103, 2010.
    Google ScholarLocate open access versionFindings
  • Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D. d. L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., et al. DeepMind control suite. arXiv preprint arXiv:1801.00690, 2018.
    Findings
  • Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575 (7782):350–354, 2019.
    Google ScholarLocate open access versionFindings
  • Table 1 collects the different architecture parameters for the “main agent networks”. These include the observation processing, the RNN for full histories, and the MLPs for value estimates and policies. Our choices follow Hessel et al. (2019) with a few exceptions that are given in bold in table 1. The differences are limited to increases in network size. The RNNs used are LSTMS (Hochreiter & Schmidhuber, 1997) and the networks used for image processing are ResNets (He et al., 2016a) The DMLab30 observations we use and the way we process them is the same as Hessel et al. (2019) (with the exception of
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments