What game are we playing? End-to-end learning in normal and extensive form games

Chun Kai Ling
Chun Kai Ling

IJCAI, pp. 396-402, 2018.

Cited by: 28|Bibtex|Views163|Links
EI
Keywords:
limit pokerNash equilibriumartificial intelligenceform gameregret minimizationMore(10+)
Weibo:
We demonstrate the effectiveness of our approach on several domains: a toy normal-form game where payoffs depend on external context; a one-card poker game; and a security resource allocation game, which is an extensive-form generalization of defender-attacker game in security do...

Abstract:

Although recent work in AI has made great progress in solving large, zero-sum, extensive-form games, the underlying assumption in most past work is that the parameters of the game itself are known to the agents. This paper deals with the relatively under-explored but equally important inverse setting, where the parameters of the underlyin...More

Code:

Data:

0
Introduction
  • Recent work in artificial intelligence has led to huge advances in methods for solving large-scale, zero-sum, extensive form games, both from methodological and applied standpoints.
  • There have been a number of recent breakthroughs, including exceeding human performance in no-limit poker [Brown and Sandholm, 2017; Moravc ́ık et al, 2017], essentially weakly solving limit poker [Bowling et al, 2015], work in security games with applications to infrastructure security [Pita et al, 2009], and many others
  • Virtually all this progress in game theoretic approaches to large games has operated under the assumption that the parameters of the game are known to the solvers, and that the main challenge is finding the optimal strategy.
  • In security games, the authors may want to understand the underlying payoffs of an adversary, rather than just their observed strategy, to better understand how aspects of the game can be manipulated or changed to get a desirable outcome
Highlights
  • Recent work in artificial intelligence has led to huge advances in methods for solving large-scale, zero-sum, extensive form games, both from methodological and applied standpoints
  • There have been a number of recent breakthroughs, including exceeding human performance in no-limit poker [Brown and Sandholm, 2017; Moravcık et al, 2017], essentially weakly solving limit poker [Bowling et al, 2015], work in security games with applications to infrastructure security [Pita et al, 2009], and many others. Virtually all this progress in game theoretic approaches to large games has operated under the assumption that the parameters of the game are known to the solvers, and that the main challenge is finding the optimal strategy
  • One of the most closely-related works to our own is the Computational Rationalization framework [Waugh et al, 2011], though 1) our approach differs in how the utilities/payoffs are modeled; and 2) we crucially focus heavily on the extensive form settings, whereas this past work considered only normal form games
  • We show that the solution of the quantal response equilibrium is a differentiable function of the game payoff matrix, and backpropagation can be computed analytically via implicit differentiation
  • We demonstrate the effectiveness of our approach on several domains: a toy normal-form game where payoffs depend on external context; a one-card poker game; and a security resource allocation game, which is an extensive-form generalization of defender-attacker game in security domain
Methods
  • The authors empirically demonstrate the module’s novel aspects – learning extensive form games in the presence of side information, with partial observations.
  • The authors' module works well with a medium or large batch size (e.g. 128), RMSProp [Tieleman and Hinton, 2012] or Adam [Kingma and Ba, 2014] optimizers with learning rates between [0.0001, 0.01]
Conclusion
  • The quality of learned parameters improves as the number of data points increases.
  • 4.3 Security Resource Allocation Game
  • In this set of experiments, the authors demonstrate the ability to learn from incomplete observations in a setting that abstracts attacks in cybersecurity domain.
  • If there are two defenders guarding T1, the chance of a successful attack onIn this paper, the authors present a fully differentiable module capable of learning payoff and other parameters in zero-sum games, given side information and partial observability.
  • Future work entails faster solvers by exploiting structure in the KKT matrix and extensions to learning general-sum games
Summary
  • Introduction:

    Recent work in artificial intelligence has led to huge advances in methods for solving large-scale, zero-sum, extensive form games, both from methodological and applied standpoints.
  • There have been a number of recent breakthroughs, including exceeding human performance in no-limit poker [Brown and Sandholm, 2017; Moravc ́ık et al, 2017], essentially weakly solving limit poker [Bowling et al, 2015], work in security games with applications to infrastructure security [Pita et al, 2009], and many others
  • Virtually all this progress in game theoretic approaches to large games has operated under the assumption that the parameters of the game are known to the solvers, and that the main challenge is finding the optimal strategy.
  • In security games, the authors may want to understand the underlying payoffs of an adversary, rather than just their observed strategy, to better understand how aspects of the game can be manipulated or changed to get a desirable outcome
  • Methods:

    The authors empirically demonstrate the module’s novel aspects – learning extensive form games in the presence of side information, with partial observations.
  • The authors' module works well with a medium or large batch size (e.g. 128), RMSProp [Tieleman and Hinton, 2012] or Adam [Kingma and Ba, 2014] optimizers with learning rates between [0.0001, 0.01]
  • Conclusion:

    The quality of learned parameters improves as the number of data points increases.
  • 4.3 Security Resource Allocation Game
  • In this set of experiments, the authors demonstrate the ability to learn from incomplete observations in a setting that abstracts attacks in cybersecurity domain.
  • If there are two defenders guarding T1, the chance of a successful attack onIn this paper, the authors present a fully differentiable module capable of learning payoff and other parameters in zero-sum games, given side information and partial observability.
  • Future work entails faster solvers by exploiting structure in the KKT matrix and extensions to learning general-sum games
Reference
  • [Amin et al., 2016] Kareem Amin, Satinder Singh, and Michael P Wellman. Gradient methods for stackelberg security games. In Conference on Uncertainty in Artificial Intelligence, pages 2–11, 2016.
    Google ScholarLocate open access versionFindings
  • [Amos and Kolter, 2017] Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks. arXiv preprint arXiv:1703.00443, 2017.
    Findings
  • [Blum et al., 2014] Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Learning optimal commitment to overcome insecurity. In Advances in Neural Information Processing Systems, pages 1826–1834, 2014.
    Google ScholarLocate open access versionFindings
  • [Bowling and Veloso, 2000] Michael Bowling and Manuela Veloso. An analysis of stochastic game theory for multiagent reinforcement learning. Technical report, CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, 2000.
    Google ScholarLocate open access versionFindings
  • [Bowling et al., 2015] Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold’em poker is solved. Science, 347(6218):145–149, 2015.
    Google ScholarLocate open access versionFindings
  • [Boyd and Vandenberghe, 2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
    Google ScholarFindings
  • [Brown and Sandholm, 2017] Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, page eaao1733, 2017.
    Google ScholarLocate open access versionFindings
  • [Busoniu et al., 2008] Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, And Cybernetics-Part C: Applications and Reviews, 38 (2), 2008, 2008.
    Google ScholarLocate open access versionFindings
  • [Fang et al., 2016] Fei Fang, Thanh Hong Nguyen, Rob Pickles, Wai Y Lam, Gopalasamy R Clements, Bo An, Amandeep Singh, Milind Tambe, and Andrew Lemieux. Deploying paws: Field optimization of the protection assistant for wildlife security. 2016.
    Google ScholarFindings
  • [Fearnley et al., 2015] John Fearnley, Martin Gairing, Paul W Goldberg, and Rahul Savani. Learning equilibria of games via payoff queries. Journal of Machine Learning Research, 16:1305–1344, 2015.
    Google ScholarLocate open access versionFindings
  • [Gould et al., 2016] Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. arXiv preprint arXiv:1607.05447, 2016.
    Findings
  • [Hoda et al., 2010] Samid Hoda, Andrew Gilpin, Javier Pena, and Tuomas Sandholm. Smoothing techniques for computing nash equilibria of sequential games. Mathematics of Operations Research, 35(2):494–512, 2010.
    Google ScholarLocate open access versionFindings
  • [Johnson et al., 2016] Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pages 2946–2954, 2016.
    Google ScholarLocate open access versionFindings
  • [Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • [Kroer et al., 2017] Christian Kroer, Kevin Waugh, Fatma Kilinc-Karzan, and Tuomas Sandholm. Theoretical and practical advances on smoothing for extensive-form games. arXiv preprint arXiv:1702.04849, 2017.
    Findings
  • [Letchford et al., 2009] Joshua Letchford, Vincent Conitzer, and Kamesh Munagala. Learning and approximating the optimal strategy to commit to. In International Symposium on Algorithmic Game Theory, pages 250–262.
    Google ScholarLocate open access versionFindings
  • [McKelvey and Palfrey, 1995] Richard D McKelvey and Thomas R Palfrey. Quantal response equilibria for normal form games. Games and economic behavior, 10(1):6–38, 1995.
    Google ScholarLocate open access versionFindings
  • [McKelvey and Palfrey, 1998] Richard D McKelvey and Thomas R Palfrey. Quantal response equilibria for extensive form games. Experimental economics, 1(1):9–41, 1998.
    Google ScholarLocate open access versionFindings
  • [Mertikopoulos and Sandholm, 2016] Panayotis
    Google ScholarFindings
  • Operations Research, 41(4):1297–1324, 2016.
    Google ScholarLocate open access versionFindings
  • [Moravcık et al., 2017] Matej Moravcık, Martin Schmid, Neil Burch, Viliam Lisy, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, 2017.
    Google ScholarLocate open access versionFindings
  • [Paszke et al., 2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
    Google ScholarFindings
  • [Pita et al., 2009] James Pita, Manish Jain, Fernando Ordonez, Christopher Portway, Milind Tambe, Craig Western, Praveen Paruchuri, and Sarit Kraus. Using game theory for los angeles airport security. Ai Magazine, 30(1):43, 2009.
    Google ScholarLocate open access versionFindings
  • [Tieleman and Hinton, 2012] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
    Google ScholarLocate open access versionFindings
  • [Von Stengel, 1996] Bernhard Von Stengel. Efficient computation of behavior strategies. Games and Economic Behavior, 14(2):220–246, 1996.
    Google ScholarLocate open access versionFindings
  • [Vorobeychik et al., 2007] Yevgeniy
    Google ScholarFindings
  • [Waugh et al., 2011] Kevin Waugh, Brian D Ziebart, and J Andrew Bagnell. Computational rationalization: the inverse equilibrium problem. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1169–1176.
    Google ScholarLocate open access versionFindings
  • Omnipress, 2011.
    Google ScholarFindings
  • [Zinkevich et al., 2008] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. In Advances in neural information processing systems, pages 1729–1736, 2008.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Best Paper
Best Paper of IJCAI, 2018
Tags
Comments