# Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization

ICLR, 2021.

EI

Weibo:

Abstract:

We propose a simple, general and effective technique, Reward Randomization for discovering diverse strategic policies in complex multi-agent games. Combining reward randomization and policy gradient, we derive a new algorithm, Reward-Randomized Policy Gradient (RPG). RPG is able to discover a set of multiple distinctive human-interpretabl...More

Code:

Data:

Introduction

- Games have been a long-standing benchmark for artificial intelligence, which prompts persistent technical advances towards the ultimate goal of building intelligent agents like humans, from Shannon’s initial interest in Chess (Shannon, 1950) and IBM DeepBlue (Campbell et al, 2002), to the most recent deep reinforcement learning breakthroughs in Go (Silver et al, 2017), Dota II (OpenAI et al, 2019) and Starcraft (Vinyals et al, 2019).
- The authors propose a simple technique, Reward Randomization (RR), which can help PG discover the “risky” cooperation strategy in the stag-hunt game with theoretical guarantees.

Highlights

- Games have been a long-standing benchmark for artificial intelligence, which prompts persistent technical advances towards our ultimate goal of building intelligent agents like humans, from Shannon’s initial interest in Chess (Shannon, 1950) and IBM DeepBlue (Campbell et al, 2002), to the most recent deep reinforcement learning breakthroughs in Go (Silver et al, 2017), Dota II (OpenAI et al, 2019) and Starcraft (Vinyals et al, 2019)
- Nash Equilibrium (NE) (Nash, 1951), where no player could benefit from altering its strategy unilaterally, provides a general solution concept and serves as a goal for policy learning and has attracted increasingly significant interests from AI researchers (Heinrich & Silver, 2016; Lanctot et al, 2017; Foerster et al, 2018; Kamra et al, 2019; Han & Hu, 2019; Bai & Jin, 2020; Perolat et al, 2020): many existing works studied how to design practical multi-agent reinforcement learning (MARL) algorithms that can provably converge to an NE in Markov games, in the zero-sum setting
- Despite the empirical success of these algorithms, a fundamental question remains largely unstudied in the field: even if an MARL algorithm converges to an NE, which equilibrium will it converge to? The existence of multiple NEs is extremely common in many multi-agent games
- In many games where multiple distinct NEs exist, the popular decentralized policy gradient algorithm (PG), which has led to great successes in numerous games including Dota II and Stacraft, always converge to a particular NE with non-optimal payoffs and fail to explore more diverse modes in the strategy space
- Note that in Stag Hunt, we focus on the Stag NE that has the highest payoff for both agents, in general Reward Randomization (RR) can be applied to NE selection in other matrix-form games using a payoff evaluation function E(π1, π2)
- Neither population-based training (PBT) nor Random Network Distillation (RND) was able to find any cooperative strategies in the aggressive game while RewardRandomized Policy Gradient (RPG) stably discovers a cooperative equilibrium with a significantly higher reward

Results

- If the authors can define an appropriate space R over different utility functions and draw samples from R, the authors may possibly discover desired novel strategies by running PG on some sampled utility function R and evaluating the obtained policy profile on the original game with R.
- Remark 2: Thm. 2 suggests that comparing with policy randomization, perturbing the payoff matrix makes it substantially easier to discover a strategy that can be hardly reached in the original game.
- The authors can define a reward function space R, train a population of policy profiles in parallel with sampled reward functions from R and select the desired strategy by evaluating the obtained policy profiles in the original game M .
- With a diverse set of strategies, the authors can build an adaptive agent by training with a random opponent policy sampled from the set per episode, so that the agent is
- The authors present empirical results showing that in all the introduced testbeds, including the real-world game Agar.io, RPG always discovers diverse strategic behaviors and achieves an equilibrium with substantially higher rewards than standard multi-agent PG methods.
- Standard setting: PG in the original game (w = [1, 0]) leads to a typical trust-dilemma dynamics: the two agents first learn to hunt and occasionally Cooperate (Fig. 9(a)), i.e., eat a script cell with the other agent close by; accidentally one agent Attacks the other agent (Fig. 9(b)), which yields a big immediate bonus and makes the policy aggressive; policies converge to the non-cooperative equilibrium where both agents keep apart and hunt alone.
- Neither PBT nor RND was able to find any cooperative strategies in the aggressive game while RPG stably discovers a cooperative equilibrium with a significantly higher reward.

Conclusion

- The evaluation results are shown in Tab. 4, where the adaptive policy successfully exploits all the test-time opponents, including M(onster)-Alone, which was trained to actively avoids the other agent.
- The authors primarily focus on how reward randomization empirically helps MARL discover better strategies in practice and only consider stag hunt as a challenging example where an “optimal” NE with a high payoff for every agent exists.

Summary

- Games have been a long-standing benchmark for artificial intelligence, which prompts persistent technical advances towards the ultimate goal of building intelligent agents like humans, from Shannon’s initial interest in Chess (Shannon, 1950) and IBM DeepBlue (Campbell et al, 2002), to the most recent deep reinforcement learning breakthroughs in Go (Silver et al, 2017), Dota II (OpenAI et al, 2019) and Starcraft (Vinyals et al, 2019).
- The authors propose a simple technique, Reward Randomization (RR), which can help PG discover the “risky” cooperation strategy in the stag-hunt game with theoretical guarantees.
- If the authors can define an appropriate space R over different utility functions and draw samples from R, the authors may possibly discover desired novel strategies by running PG on some sampled utility function R and evaluating the obtained policy profile on the original game with R.
- Remark 2: Thm. 2 suggests that comparing with policy randomization, perturbing the payoff matrix makes it substantially easier to discover a strategy that can be hardly reached in the original game.
- The authors can define a reward function space R, train a population of policy profiles in parallel with sampled reward functions from R and select the desired strategy by evaluating the obtained policy profiles in the original game M .
- With a diverse set of strategies, the authors can build an adaptive agent by training with a random opponent policy sampled from the set per episode, so that the agent is
- The authors present empirical results showing that in all the introduced testbeds, including the real-world game Agar.io, RPG always discovers diverse strategic behaviors and achieves an equilibrium with substantially higher rewards than standard multi-agent PG methods.
- Standard setting: PG in the original game (w = [1, 0]) leads to a typical trust-dilemma dynamics: the two agents first learn to hunt and occasionally Cooperate (Fig. 9(a)), i.e., eat a script cell with the other agent close by; accidentally one agent Attacks the other agent (Fig. 9(b)), which yields a big immediate bonus and makes the policy aggressive; policies converge to the non-cooperative equilibrium where both agents keep apart and hunt alone.
- Neither PBT nor RND was able to find any cooperative strategies in the aggressive game while RPG stably discovers a cooperative equilibrium with a significantly higher reward.
- The evaluation results are shown in Tab. 4, where the adaptive policy successfully exploits all the test-time opponents, including M(onster)-Alone, which was trained to actively avoids the other agent.
- The authors primarily focus on how reward randomization empirically helps MARL discover better strategies in practice and only consider stag hunt as a challenging example where an “optimal” NE with a high payoff for every agent exists.

- Table1: The stag-hunt game, a > b ≥ d > c
- Table2: Results in the standard setting of
- Table3: Results in the aggressive setting of Agar.io: PBT: population
- Table4: Stats. of the adaptive agent in Monster-Hunt sure that both training and evaluation poli- with hold-out test-time opponents. #C(oop.)-H(unt)
- Table5: Adaptation test in Agar.io. Ophalftime. Tab. 5 compares the second-half behavior of the ponent type is switched half-way per adaptive agent with the oracle pure-competitive/cooperative episode. #Attack, #Coop.: episode agents. The rewards of the adaptive agent is close to the statistics; Rew.: agent reward. Adaptive oracle: even with half-way switches, the trained policy is agents’ rewards are close to oracles
- Table6: PPO hyper-parameters used in Gridworld games, learning rate is linearly annealed during training
- Table7: PPO hyper-parameters used in Agar.io
- Table8: Frequencies of 4 types of events and rewards of different policies of Agar.io after completely training
- Table9: Evaluation of different policy profiles obtained via RR in original Iterative Stag-Hunt. Note that w = [4, 0, 0, 0] has the best performance among the policy profiles, and is the optimal NE with no further fine-tuning
- Table10: Statistics of the adaptive policy in Iterative Stag-Hunt with 4 hand-designed opponents with different behavior preferences. #Stag: the adaptive agent hunts the stag; #Hare: the adaptive agent eats the hare; The adaptive policy successfully exploits different opponents, including cooperating with TFT opponent, which is totally different from trained opponents

Related work

**RELATED WORK AND DISCUSSIONS**

Our core idea is reward perturbation. In game theory, this is aligned with the quantal response equilibrium (McKelvey & Palfrey, 1995), a smoothed version of NE obtained when payoffs are perturbed by a Gumbel noise. In RL, reward shaping is popular for learning desired behavior in various domains (Ng et al, 1999; Babes et al, 2008; Devlin & Kudenko, 2011), which inspires our idea for finding diverse strategic behavior. By contrast, state-space exploration methods (Pathak et al, 2017; Burda et al, 2019; Eysenbach et al, 2019; Sharma et al, 2020) only learn low-level primitives without strategy-level diversity (Baker et al, 2020). RR trains a set of policies, which is aligned with the population-based training in MARL (Jaderberg et al, 2017; 2019; Vinyals et al, 2019; Long et al, 2020). RR is conceptually related to domain randomization (Tobin et al, 2017) with the difference that we train separate policies instead of a single universal one, which suffers from mode collapse (see appendix). RPG is also inspired by the map-elite algorithm (Cully et al, 2015) from evolutionary learning community, which optimizes multiple objectives simultaneously for sufficiently diverse polices. Besides, RPG helps train adaptive policies against a set of opponents, which is related to Bayesian games (Dekel et al, 2004; Hartline et al, 2015). In RL, there are works on learning when to cooperate/compete (Littman, 2001; Peysakhovich & Lerer, 2018a; Kleiman-Weiner et al, 2016; Woodward et al, 2019; McKee et al, 2020), which is a special case of ours, or learning robust policies (Li et al, 2019; Shen & How, 2019; Hu et al, 2020), which complements our method.

Reference

- Bo An, Milind Tambe, Fernando Ordonez, Eric Shieh, and Christopher Kiekintveld. Refinement of strong stackelberg equilibria in security games. In Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.
- Monica Babes, Enrique Munoz de Cote, and Michael L Littman. Social reward shaping in the prisoner’s dilemma. In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems-Volume 3, pp. 1389–139International Foundation for Autonomous Agents and Multiagent Systems, 2008.
- Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. arXiv preprint arXiv:2002.04017, 2020.
- Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. Emergent tool use from multi-agent autocurricula, 2019.
- Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. Emergent tool use from multi-agent autocurricula. In International Conference on Learning Representations, 2020.
- David Balduzzi, Marta Garnelo, Yoram Bachrach, Wojciech M Czarnecki, Julien Perolat, Max Jaderberg, and Thore Graepel. Open-ended learning in symmetric zero-sum games. arXiv preprint arXiv:1901.08106, 2019.
- Jeffrey S Banks and Joel Sobel. Equilibrium selection in signaling games. Econometrica: Journal of the Econometric Society, pp. 647–661, 1987.
- B Douglas Bernheim, Bezalel Peleg, and Michael D Whinston. Coalition-proof Nash equilibria i. concepts. Journal of Economic Theory, 42(1):1–12, 1987.
- George W Brown. Iterative solution of games by fictitious play. Activity analysis of production and allocation, 13(1):374–376, 1951.
- Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2019.
- Murray Campbell, A Joseph Hoane Jr, and Feng-hsiung Hsu. Deep blue. Artificial intelligence, 134 (1-2):57–83, 2002.
- Antoine Cully, Jeff Clune, Danesh Tarapore, and Jean-Baptiste Mouret. Robots that can adapt like animals. Nature, 521(7553):503–507, 2015.
- Eddie Dekel, Drew Fudenberg, and David K Levine. Learning to play Bayesian games. Games and Economic Behavior, 46(2):282–303, 2004.
- Sam Devlin and Daniel Kudenko. Theoretical considerations of potential-based reward shaping for multi-agent systems. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pp. 225–232. International Foundation for Autonomous Agents and Multiagent Systems, 2011.
- Glenn Ellison. Learning, local interaction, and coordination. Econometrica: Journal of the Econometric Society, pp. 1047–1071, 1993.
- Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019.
- Christina Fang, Steven Orla Kimbrough, Stefano Pace, Annapurna Valluri, and Zhiqiang Zheng. On adaptive emergence of trust behavior in the game of stag hunt. Group Decision and Negotiation, 11(6):449–467, 2002.
- Fei Fang, Albert Xin Jiang, and Milind Tambe. Protecting moving targets with multiple mobile resources. Journal of Artificial Intelligence Research, 48:583–634, 2013.
- Jakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 122–130. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
- Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pp. 797–842, 2015.
- Russell Golman and Scott E Page. Individual and cultural learning in stag hunt games with multiple actions. Journal of Economic Behavior & Organization, 73(3):359–376, 2010.
- Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca Dragan. Inverse reward design. In Advances in neural information processing systems, pp. 6765–6774, 2017.
- Jiequn Han and Ruimeng Hu. Deep fictitious play for finding Markovian Nash equilibrium in multi-agent games. arXiv preprint arXiv:1912.01809, 2019.
- Jason Hartline, Vasilis Syrgkanis, and Eva Tardos. No-regret learning in Bayesian games. In Advances in Neural Information Processing Systems, pp. 3061–3069, 2015.
- Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfectinformation games. arXiv preprint arXiv:1603.01121, 2016.
- Hengyuan Hu, Adam Lerer, Alex Peysakhovich, and Jakob Foerster. Other-play for zero-shot coordination. arXiv preprint arXiv:2003.02979, 2020.
- Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017.
- Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, et al. Humanlevel performance in 3D multiplayer games with population-based reinforcement learning. Science, 364(6443):859–865, 2019.
- Nitin Kamra, Umang Gupta, Kai Wang, Fei Fang, Yan Liu, and Milind Tambe. Deep fictitious play for games with continuous action spaces. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2042–2044. International Foundation for Autonomous Agents and Multiagent Systems, 2019.
- Michihiro Kandori, George J Mailath, and Rafael Rob. Learning, mutation, and long run equilibria in games. Econometrica: Journal of the Econometric Society, pp. 29–56, 1993.
- Max Kleiman-Weiner, Mark K Ho, Joseph L Austerweil, Michael L Littman, and Joshua B Tenenbaum. Coordinate to cooperate or compete: abstract goals and joint intentions in social interaction. In CogSci, 2016.
- Robert Kleinberg, Yuanzhi Li, and Yang Yuan. An alternative view: When does sgd escape local minima? arXiv preprint arXiv:1802.06175, 2018.
- Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Pérolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4190–4203, 2017.
- Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 464–473, 2017.
- Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
- Richard Li, Allan Jabri, Trevor Darrell, and Pulkit Agrawal. Towards practical multi-object manipulation using relational reinforcement learning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2020.
- Shihui Li, Yi Wu, Xinyue Cui, Honghua Dong, Fei Fang, and Stuart Russell. Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 4213–4220, 2019.
- Michael L Littman. Friend-or-foe q-learning in general-sum games. In ICML, volume 1, pp. 322–328, 2001.
- Qian Long, Zihan Zhou, Abhinav Gupta, Fei Fang, Yi Wu, and Xiaolong Wang. Evolutionary population curriculum for scaling multi-agent reinforcement learning. In International Conference on Learning Representations, 2020.
- Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in neural information processing systems, pp. 6379–6390, 2017.
- Anuj Mahajan, Tabish Rashid, Mikayel Samvelyan, and Shimon Whiteson. Maven: Multi-agent variational exploration. In Advances in Neural Information Processing Systems, pp. 7611–7622, 2019.
- Kevin R McKee, Ian Gemp, Brian McWilliams, Edgar A Duéñez-Guzmán, Edward Hughes, and Joel Z Leibo. Social diversity and social preferences in mixed-motive reinforcement learning. arXiv preprint arXiv:2002.02325, 2020.
- Richard D McKelvey and Thomas R Palfrey. Quantal response equilibria for normal form games. Games and economic behavior, 10(1):6–38, 1995.
- H Brendan McMahan, Geoffrey J Gordon, and Avrim Blum. Planning in the presence of cost functions controlled by an adversary. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 536–543, 2003.
- Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673, 2016.
- Dov Monderer and Lloyd S Shapley. Potential games. Games and economic behavior, 14(1):124–143, 1996.
- Roger B Myerson. Refinements of the Nash equilibrium concept. International journal of game theory, 7(2):73–80, 1978.
- John Nash. Non-cooperative games. Annals of mathematics, pp. 286–295, 1951.
- Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pp. 278–287, 1999.
- Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, pp. 2, 2000.
- OpenAI,:, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique Pondé de Oliveira Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 with large scale deep reinforcement learning, 2019.
- Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16–17, 2017.
- Julien Perolat, Remi Munos, Jean-Baptiste Lespiau, Shayegan Omidshafiei, Mark Rowland, Pedro Ortega, Neil Burch, Thomas Anthony, David Balduzzi, Bart De Vylder, et al. From Poincare recurrence to convergence in imperfect information games: Finding equilibrium via regularization. arXiv preprint arXiv:2002.08456, 2020.
- Alexander Peysakhovich and Adam Lerer. Consequentialist conditional cooperation in social dilemmas with imperfect information. In International Conference on Learning Representations, 2018a.
- Alexander Peysakhovich and Adam Lerer. Prosocial learning agents solve generalized stag hunts better than selfish ones. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2043–2044. International Foundation for Autonomous Agents and Multiagent Systems, 2018b.
- Julia Robinson. An iterative method of solving a game. Annals of mathematics, pp. 296–301, 1951.
- Jean-Jacques Rousseau. A discourse on inequality. Penguin, 1984.
- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- R Selten. Reexamination of the perfectness concept for equilibrium points in extensive games. International Journal of Game Theory, 4(1):25–55, 1975.
- Reinhard Selten. Spieltheoretische behandlung eines oligopolmodells mit nachfrageträgheit: Teil i: Bestimmung des dynamischen preisgleichgewichts. Zeitschrift für die gesamte Staatswissenschaft/Journal of Institutional and Theoretical Economics, (H. 2):301–324, 1965.
- Claude E Shannon. Xxii. programming a computer for playing chess. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 41(314):256–275, 1950.
- Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. In International Conference on Learning Representations, 2020.
- Macheng Shen and Jonathan P How. Robust opponent modeling via adversarial ensemble reinforcement learning in asymmetric imperfect-information games. arXiv preprint arXiv:1909.08735, 2019.
- David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
- David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419): 1140–1144, 2018.
- Satinder P Singh, Michael J Kearns, and Yishay Mansour. Nash convergence of gradient dynamics in general-sum games. In UAI, pp. 541–548, 2000.
- Brian Skyrms. The stag hunt and the evolution of social structure. Cambridge University Press, 2004.
- Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. #exploration: A study of count-based exploration for deep reinforcement learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 2753–2762. 2017.
- Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 23–30. IEEE, 2017.
- Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, et al. Starcraft II: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782, 2017.
- Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- Yufei Wang, Zheyuan Ryan Shi, Lantao Yu, Yi Wu, Rohit Singh, Lucas Joppa, and Fei Fang. Deep reinforcement learning for green security games with real-time information. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 1401–1408, 2019.
- Wikipedia. Agar.io, 2020. URL http://en.wikipedia.org/wiki/Agar.io.[http://en.wikipedia.org/wiki/Agar.io; accessed 3-June-2020].
- Mark Woodward, Chelsea Finn, and Karol Hausman. Learning to interactively learn and assist. arXiv preprint arXiv:1906.10187, 2019.
- Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. Building generalizable agents with a realistic and rich 3D environment. arXiv preprint arXiv:1801.02209, 2018.
- Yuxin Wu and Yuandong Tian. Training agent for first-person shooter game with actor-critic curriculum learning. 2016.
- Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning (CoRL), 2019.
- Chicago, IL, USA, 2008.
- We would suggest to visit https://sites.google.com/view/staghuntrpg for example videos.

Tags

Comments