# A Generalized Training Approach for Multiagent Learning

ICLR, 2020.

EI

Keywords:

multiagent learning game theory training games

Weibo:

Abstract:

This paper investigates a population-based training regime based on game-theoretic principles called Policy-Spaced Response Oracles (PSRO). PSRO is general in the sense that it (1) encompasses well-known algorithms such as fictitious play and double oracle as special cases, and (2) in principle applies to general-sum, many-player games. D...More

Introduction

- Creating agents that learn to interact in large-scale systems is a key challenge in artificial intelligence.
- Prior applications of PSRO have used Nash equilibria as the policy-selection distribution (Lanctot et al, 2017; Balduzzi et al, 2019), which limits the scalability of PSRO to general games: Nash equilibria are intractable to compute in general (Daskalakis et al, 2009); computing approximate Nash equilibria is intractable, even for some classes of two-player games (Daskalakis, 2013); when they can be computed, Nash equilibria suffer from a selection problem (Harsanyi et al, 1988; Goldberg et al, 2013).
- The authors conduct preliminary evaluations in MuJoCo soccer (Liu et al, 2019), another complex domain wherein the authors use reinforcement learning agents as oracles in the proposed PSRO variants, illustrating the feasibility of the approach

Highlights

- Creating agents that learn to interact in large-scale systems is a key challenge in artificial intelligence
- Prior applications of Policy-Space Response Oracles have used Nash equilibria as the policy-selection distribution (Lanctot et al, 2017; Balduzzi et al, 2019), which limits the scalability of Policy-Space Response Oracles to general games: Nash equilibria are intractable to compute in general (Daskalakis et al, 2009); computing approximate Nash equilibria is intractable, even for some classes of two-player games (Daskalakis, 2013); when they can be computed, Nash equilibria suffer from a selection problem (Harsanyi et al, 1988; Goldberg et al, 2013)
- We study several Policy-Space Response Oracles variants in the context of general-sum, many-player games, providing convergence guarantees in several classes of such games for Policy-Space Response Oracles instances that use α-Rank as a meta-solver
- We have shown that for general-sum multi-player games, it is possible to give theoretical guarantees for a version of Policy-Space Response Oracles driven by α-Rank in several circumstances
- We conduct evaluations on games of increasing complexity, extending beyond prior Policy-Space Response Oracles applications that have focused on two-player zero-sum games
- The PCS-SCORE here is typically either (a) greater than 95%, or (b) less than 5%, and otherwise rarely between 5% to 95%
- This paper studied variants of Policy-Space Response Oracles using α-Rank as a meta-solver, which were shown to be competitive with Nash-based Policy-Space Response Oracles in zero-sum games, and scale effortlessly to general-sum manyplayer games, in contrast to Nash-based Policy-Space Response Oracles

Results

- The authors conduct evaluations on games of increasing complexity, extending beyond prior PSRO applications that have focused on two-player zero-sum games.
- The authors conduct 10 trials per game, in each trial running the BR and PBR oracles starting from a random strategy in the corresponding response graph, iteratively expanding the population space until convergence.
- This implies that the starting strategy may not even be in an SSCC.

Conclusion

- This paper studied variants of PSRO using α-Rank as a meta-solver, which were shown to be competitive with Nash-based PSRO in zero-sum games, and scale effortlessly to general-sum manyplayer games, in contrast to Nash-based PSRO.
- The authors strongly believe that the theoretical and empirical results established in this paper will play a key role in scaling up multiagent training in general settings.
- The authors gratefully thank Bart De Vylder for providing helpful feedback on the paper draft

Summary

## Introduction:

Creating agents that learn to interact in large-scale systems is a key challenge in artificial intelligence.- Prior applications of PSRO have used Nash equilibria as the policy-selection distribution (Lanctot et al, 2017; Balduzzi et al, 2019), which limits the scalability of PSRO to general games: Nash equilibria are intractable to compute in general (Daskalakis et al, 2009); computing approximate Nash equilibria is intractable, even for some classes of two-player games (Daskalakis, 2013); when they can be computed, Nash equilibria suffer from a selection problem (Harsanyi et al, 1988; Goldberg et al, 2013).
- The authors conduct preliminary evaluations in MuJoCo soccer (Liu et al, 2019), another complex domain wherein the authors use reinforcement learning agents as oracles in the proposed PSRO variants, illustrating the feasibility of the approach
## Results:

The authors conduct evaluations on games of increasing complexity, extending beyond prior PSRO applications that have focused on two-player zero-sum games.- The authors conduct 10 trials per game, in each trial running the BR and PBR oracles starting from a random strategy in the corresponding response graph, iteratively expanding the population space until convergence.
- This implies that the starting strategy may not even be in an SSCC.
## Conclusion:

This paper studied variants of PSRO using α-Rank as a meta-solver, which were shown to be competitive with Nash-based PSRO in zero-sum games, and scale effortlessly to general-sum manyplayer games, in contrast to Nash-based PSRO.- The authors strongly believe that the theoretical and empirical results established in this paper will play a key role in scaling up multiagent training in general settings.
- The authors gratefully thank Bart De Vylder for providing helpful feedback on the paper draft

- Table1: Theory overview. SP and MP, resp., denote single and multi-population games. BR and PBR, resp., denote best response and preference-based best response. †Defined in the noted propositions
- Table2: Symmetric zero-sum game used to guarantees of PSRO when using α-Rank, and whether analyze the behavior of PSRO in Example 1
- Table3: Illustrative games used to analyze the behavior of PSRO in Example 4. Here, 0 < ε 1. The first game is symmetric, whilst the second is zero-sum. Both tables specify the payoff to Player 1 under each strategy profile
- Table4: Game of Chicken payoff table
- Table5: Prisoner’s Dilemma payoff table
- Table6: PSRO(Rectified Nash, BR) evaluated on 2-player Kuhn Poker. Player 1’s payoff matrix shown for each respective training iteration

Related work

- We discuss the most closely related work along two axes. We start with PSRO-based research and some multiagent deep RL work that focuses on training of networks in various multiagent settings. Then we continue with related work that uses evolutionary dynamics (α-Rank and replicator dynamics) as a solution concept to examine underlying behavior of multiagent interactions using meta-games. Policy-space response oracles (Lanctot et al, 2017) unify many existing approaches to multiagent learning. Notable examples include fictitious play (Brown, 1951; Robinson, 1951), independent reinforcement learning (Matignon et al, 2012) and the double oracle algorithm (McMahan et al, 2003). PSRO also relies, fundamentally, on principles from empirical game-theoretic analysis (EGTA) (Walsh et al, 2002; Phelps et al, 2004; Tuyls et al, 2018; Wellman, 2006; Vorobeychik, 2010; Wiedenbeck and Wellman, 2012; Wiedenbeck et al, 2014). The related Parallel Nash Memory (PNM) algorithm (Oliehoek et al, 2006), which can also be seen as a generalization of the double oracle algorithm, incrementally grows the space of strategies, though using a search heuristic rather than exact best responses. PNMs have been successfully applied to games settings utilizing function approximation, notably to address exploitability issues when training Generative Adversarial Networks (GANs) (Oliehoek et al, 2019).

Funding

- Investigates a population-based training regime based on game-theoretic principles called Policy-Spaced Response Oracles
- Demonstrates the competitive performance of α-Rank-based PSRO against an exact Nash solver-based PSRO in 2-player Kuhn and Leduc Poker
- Studies several PSRO variants in the context of general-sum, many-player games, providing convergence guarantees in several classes of such games for PSRO instances that use α-Rank as a meta-solver
- Develops a new notion of best response that guarantees convergence to the α-Rank distribution in several classes of games, verifying this empirically in randomly-generated general-sum games
- Demonstrates empirical results extending beyond the reach of PSRO with Nash as a meta-solver by evaluating training in 3- to 5-player games

Reference

- David Balduzzi, Marta Garnelo, Yoram Bachrach, Wojciech Czarnecki, Julien Perolat, Max Jaderberg, and Thore Graepel. Open-ended learning in symmetric zero-sum games. In International Conference on Machine Learning (ICML), 2019.
- Daan Bloembergen, Karl Tuyls, Daniel Hennes, and Michael Kaisers. Evolutionary dynamics of multi-agent learning: A survey. J. Artif. Intell. Res. (JAIR), 53:659–697, 2015.
- George W Brown. Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 13(1):374–376, 1951.
- Ross Cressman and Yi Tao. The replicator equation and other game dynamics. Proceedings of the National Academy of Sciences USA, 111:10810–10817, 2014.
- Constantinos Daskalakis. On the complexity of approximating a Nash equilibrium. ACM Transactions on Algorithms, 9(3):23, 2013.
- Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a Nash equilibrium. SIAM Journal on Computing, 39(1):195–259, 2009.
- Arpad E Elo. The rating of chessplayers, past and present. Arco Pub., 1978. Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), 2016.
- Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In AAAI Conference on Artificial Intelligence, 2018.
- Paul W Goldberg, Christos H Papadimitriou, and Rahul Savani. The complexity of the homotopy method, equilibrium selection, and Lemke-Howson solutions. ACM Transactions on Economics and Computation, 1(2):9, 2013.
- John C Harsanyi, Reinhard Selten, et al. A general theory of equilibrium selection in games. MIT Press Books, 1, 1988.
- Daniel Hennes, Daniel Claes, and Karl Tuyls. Evolutionary advantage of reciprocity in collision avoidance. In AAMAS Workshop on Autonomous Robots and Multirobot Systems (ARMS), 2013.
- Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems, pages 1–48, 2019.
- Shariq Iqbal and Fei Sha. Actor-attention-critic for multi-agent reinforcement learning. In International Conference on Machine Learning, pages 2961–2970, 2019.
- Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, and Thore Graepel. Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science, 364(6443):859–865, 2019.
- Shauharda Khadka, Somdeb Majumdar, and Kagan Tumer. Evolutionary reinforcement learning for sample-efficient multiagent coordination. arXiv preprint arXiv:1906.07315, 2019.
- Harold W Kuhn. A simplified two-person poker. Contributions to the Theory of Games, 1:97–103, 1950.
- Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In Neural Information Processing Systems (NIPS), 2017.
- Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Perolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, Daniel Hennes, Dustin Morrill, Paul Muller, Timo Ewalds, Ryan Faulkner, Janos Kramar, Bart De Vylder, Brennan Saeta, James Bradbury, David Ding, Sebastian Borgeaud, Matthew Lai, Julian Schrittwieser, Thomas Anthony, Edward Hughes, Ivo Danihelka, and Jonah Ryan-Davis. OpenSpiel: A framework for reinforcement learning in games. arXiv preprint arXiv:1908.09453, 2019.
- Siqi Liu, Guy Lever, Josh Merel, Saran Tunyasuvunakool, Nicolas Heess, and Thore Graepel. Emergent coordination through competition. In International Conference on Learning Representations (ICLR), 2019.
- Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems (NIPS), 2017.
- Laetitia Matignon, Guillaume J. Laurent, and Nadine Le Fort-Piat. Independent reinforcement learners in cooperative Markov games: A survey regarding coordination problems. The Knowledge Engineering Review, 27(1):1–31, 2012.
- J. Maynard Smith and G. R. Price. The logic of animal conflicts. Nature, 246:15–18, 1973. H. Brendan McMahan, Geoffrey J. Gordon, and Avrim Blum. Planning in the presence of cost functions controlled by an adversary. In International Conference on Machine Learning (ICML), 2003.
- Anna Nagurney and Ding Zhang. Projected dynamical systems and variational inequalities with applications, volume 2. Springer Science & Business Media, 2012.
- John F Nash. Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1):48–49, 1950.
- Frans A. Oliehoek, Edwin D. de Jong, and Nikos Vlassis. The parallel Nash memory for asymmetric games. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pages 337–344, July 2006. doi: 10.1145/1143997.1144059. URL http://www.cs.bham.ac.uk/̃wbl/biblio/gecco2006/docs/p337.pdf. (best paper nominee in coevolution track).
- Frans A. Oliehoek, Rahul Savani, Jose Gallego, Elise van der Pol, and Roderich Groß. Beyond local nash equilibria for adversarial networks. In Martin Atzmueller and Wouter Duivesteijn, editors, Artificial Intelligence, pages 73–89, Cham, 2019. Springer International Publishing. ISBN 978-3-030-31978-6.
- Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P. How, and John Vian. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In International Conference on Machine Learning (ICML), 2017.
- Shayegan Omidshafiei, Christos Papadimitriou, Georgios Piliouras, Karl Tuyls, Mark Rowland, Jean-Baptiste Lespiau, Wojciech M Czarnecki, Marc Lanctot, Julien Perolat, and Remi Munos. α-Rank: Multi-agent evaluation by evolution. Scientific Reports, 9, 2019.
- Gerasimos Palaiopanos, Ioannis Panageas, and Georgios Piliouras. Multiplicative weights update with constant step-size in congestion games: Convergence, limit cycles and chaos. In Neural Information Processing Systems (NIPS), 2017.
- Gregory Palmer, Karl Tuyls, Daan Bloembergen, and Rahul Savani. Lenient multi-agent deep reinforcement learning. In Autonomous Agents and Multiagent Systems (AAMAS), 2018.
- Peng Peng, Ying Wen, Yaodong Yang, Quan Yuan, Zhenkun Tang, Haitao Long, and Jun Wang. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play starcraft combat games. arXiv preprint arXiv:1703.10069, 2017.
- Steve Phelps, Simon Parsons, and Peter McBurney. An evolutionary game-theoretic comparison of two double-auction market designs. In AAMAS Workshop on Agent-Mediated Electronic Commerce, 2004.
- Daniel M Reeves and Michael P Wellman. Computing best-response strategies in infinite games of incomplete information. In Uncertainty in Artificial Intelligence (UAI), 2004.
- Julia Robinson. An iterative method of solving a game. Annals of Mathematics, 54(2):296–301, 1951.
- Mark Rowland, Shayegan Omidshafiei, Karl Tuyls, Julien Perolat, Michal Valko, Georgios Piliouras, and Remi Munos. Multiagent evaluation under incomplete information. To appear in Neural Information Processing Systems (NeurIPS), 2019.
- Peter Schuster and Karl Sigmund. Replicator dynamics. Journal of Theoretical Biology, 100(3): 533–538, 1983.
- David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144, 2018.
- Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ bluff: Opponent modelling in poker. In Uncertainty in Artificial Intelligence (UAI), 2005.
- Peter D Taylor and Leo B Jonker. Evolutionary stable strategies and game dynamics. Mathematical Biosciences, 40(1-2):145–156, 1978.
- Published as a conference paper at ICLR 2020 Karl Tuyls and Simon Parsons. What evolutionary game theory tells us about multiagent learning.
- Artif. Intell., 171(7):406–416, 2007. Karl Tuyls, Julien Perolat, Marc Lanctot, Joel Z Leibo, and Thore Graepel. A generalised method for empirical game theoretic analysis. In Autonomous Agents and Multiagent Systems (AAMAS), 2018. Yevgeniy Vorobeychik. Probabilistic analysis of simulation-based games. ACM Trans. Model.
- Comput. Simul., 20(3):16:1–16:25, October 2010. William E Walsh, Rajarshi Das, Gerald Tesauro, and Jeffrey O Kephart. Analyzing complex strategic interactions in multi-agent systems. In AAAI Workshop on Game-Theoretic and Decision-Theoretic Agents, 2002. William E Walsh, David C Parkes, and Rajarshi Das. Choosing samples to compute heuristic-strategy Nash equilibrium. In International Workshop on Agent-Mediated Electronic Commerce, pages 109–123.
- Springer, 2003. Ermo Wei, Drew Wicke, David Freelan, and Sean Luke. Multiagent soft q-learning. In 2018 AAAI Spring Symposium Series, 2018. Michael P Wellman. Methods for empirical game-theoretic analysis. In AAAI Conference on Artificial Intelligence, 2006. Bryce Wiedenbeck and Michael P. Wellman. Scaling simulation-based game analysis through deviation-preserving reduction. In Autonomous Agents and Multiagent Systems (AAMAS), 2012. Bryce Wiedenbeck, Ben-Alexander Cassell, and Michael P. Wellman. Bootstrap statistics for empirical games. In Autonomous Agents and MultiAgent Systems (AAMAS), pages 597–604, 2014. Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Furnkranz. A survey of preferencebased reinforcement learning methods. The Journal of Machine Learning Research, 18(1):4945– 4990, 2017.

Tags

Comments