Fast computation of Nash Equilibria in Imperfect Information Games

ICML, pp. 7119-7129, 2020.

Cited by: 0|Bibtex|Views137|Links
EI
Keywords:
extensive form gamecomputational costmirror ascentsum gameMonte Carlo Tree SearchMore(21+)
Weibo:
Maybe the main contribution of Mirror Ascent against an Improved Opponent is that it offers a principled approach to use any reinforcement learning policy improvement technique to generate a sequence of policies with convergence guarantee to the set of Nash equilibria

Abstract:

We introduce and analyze a class of algorithms, called Mirror Ascent against an Improved Opponent (MAIO), for computing Nash equilibria in two-player zero-sum games, both in normal form and in sequential form with imperfect information. These algorithms update the policy of each player with a mirror-ascent step to maximize the value of pl...More

Code:

Data:

0
Introduction
  • This paper considers the problem of computing a Nash equilibrium for two-player zero-sum games in two types of games: normal-form games and imperfect information games (IIGs) in extensive form.
  • By that the authors mean that some weighted 2 distance between the policies produced by the algorithm and the set of Nash equilibria decreases as O(exp(−βt)), for some problem-dependent constant β > 0, where t is the number of iterations of the algorithm.
  • The authors' analysis shows that the speed of convergence to the set of Nash equilibria depends on a measure of how much each player is able to improve its own policy against a fixed opponent.
  • The authors' analysis shows convergence for all such cases, which opens new avenues for designing algorithms with convergence guarantees, while offering a trade-off in terms of computational cost versus convergence speed toward the Nash equilibrium
Highlights
  • This paper considers the problem of computing a Nash equilibrium for two-player zero-sum games in two types of games: normal-form games and imperfect information games (IIGs) in extensive form
  • We introduce and analyze a class of algorithms, called Mirror Ascent against an Improved Opponent (MAIO), which updates the policy of each player by following a step of mirror-ascent for maximizing its expected reward against an improved policy for the opponent
  • Examples of improved polices are the greedy policy, a multi-step improved policy, such as in Monte Carlo Tree Search (MCTS), a policy improved by policy gradient, or by any other reinforcement learning or search algorithm
  • We introduced a new class of algorithms for computing a Nash equilibrium in zero-sum normal form games and sequential information games and provided an analysis of the speed of convergence in terms of the notion of improvement
  • We show a new tradeoff between computational complexity of computing improved policies and speed of convergence to the set of Nash eq Under some condition exponential convergence is achieved when we use the best response as improved policy
  • Maybe the main contribution of Mirror Ascent against an Improved Opponent is that it offers a principled approach to use any reinforcement learning policy improvement technique to generate a sequence of policies with convergence guarantee to the set of Nash equilibria
Conclusion
  • The authors introduced a new class of algorithms for computing a Nash equilibrium in zero-sum normal form games and sequential IIGs and provided an analysis of the speed of convergence in terms of the notion of improvement.
  • The authors observe the exponential convergence with a rate that depends on ε (Fig. 1(a)) and the constant c (Fig. 1(b)) used in the learning rate (i.e., the authors chose ηt = c · I).
  • This is exactly what is p√redicted by the theory since the value of κ in Lemma 1 is ε/ 2 here
Summary
  • Introduction:

    This paper considers the problem of computing a Nash equilibrium for two-player zero-sum games in two types of games: normal-form games and imperfect information games (IIGs) in extensive form.
  • By that the authors mean that some weighted 2 distance between the policies produced by the algorithm and the set of Nash equilibria decreases as O(exp(−βt)), for some problem-dependent constant β > 0, where t is the number of iterations of the algorithm.
  • The authors' analysis shows that the speed of convergence to the set of Nash equilibria depends on a measure of how much each player is able to improve its own policy against a fixed opponent.
  • The authors' analysis shows convergence for all such cases, which opens new avenues for designing algorithms with convergence guarantees, while offering a trade-off in terms of computational cost versus convergence speed toward the Nash equilibrium
  • Conclusion:

    The authors introduced a new class of algorithms for computing a Nash equilibrium in zero-sum normal form games and sequential IIGs and provided an analysis of the speed of convergence in terms of the notion of improvement.
  • The authors observe the exponential convergence with a rate that depends on ε (Fig. 1(a)) and the constant c (Fig. 1(b)) used in the learning rate (i.e., the authors chose ηt = c · I).
  • This is exactly what is p√redicted by the theory since the value of κ in Lemma 1 is ε/ 2 here
Reference
  • Bubeck, S. (2015). Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–357.
    Google ScholarLocate open access versionFindings
  • Chen, Y. and Ye, X. (2011). Projection onto a simplex. arXiv preprint arXiv:1101.6081.
    Findings
  • Csiszar, I. and Korner, J. (1982). Information Theory: Coding Theorems for Discrete Memoryless Systems. Academic Press, Inc.
    Google ScholarFindings
  • Daskalakis, C., Deckelbaum, A., and Kim, A. (2011). Nearoptimal no-regret algorithms for zero-sum games. In ACM-SIAM Symposium on Discrete Algorithms (SODA).
    Google ScholarLocate open access versionFindings
  • Daskalakis, C. and Panageas, I. (2018). Last-iterate convergence: Zero-sum games and constrained min-max optimization. arXiv.
    Google ScholarFindings
  • Gidel, G., Jebara, T., and Lacoste-Julien, S. (2016). Frankwolfe algorithms for saddle point problems. In Artificial Intelligence and Statistics (AISTATS).
    Google ScholarLocate open access versionFindings
  • Gilpin, A., Hoda, S., Pena, J., and Sandholm, T. (2007). Gradient-based algorithms for finding Nash equilibria in extensive form games. In International Workshop on Web and Internet Economics.
    Google ScholarLocate open access versionFindings
  • Gilpin, A., Pena, J., and Sandholm, T. (2012). First-order algorithm with O(ln(1/ε)) convergence for ε-equilibrium in two-person zero-sum games. Mathematical programming, 133(1-2):279–298.
    Google ScholarLocate open access versionFindings
  • Gilpin, A., Pena, J., and Sandholm, T. W. (2008). First-order algorithm with O(ln(1/ε)) convergence for equilibrium in two-person zero-sum games. In AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Heinrich, J., Lanctot, M., and Silver, D. (2015). Fictitious self-play in extensive-form games. In International Conference on Machine Learning (ICML).
    Google ScholarLocate open access versionFindings
  • Heinrich, J. and Silver, D. (2016). Deep reinforcement learning from self-play in imperfect-information games. arXiv.
    Google ScholarFindings
  • Hoda, S., Gilpin, A., Pena, J., and Sandholm, T. (2010). Smoothing techniques for computing Nash equilibria of sequential games. Mathematics of Operations Research, 35(2):494–512.
    Google ScholarLocate open access versionFindings
  • Johanson, M., Bard, N., Burch, N., and Bowling, M. (2012). Finding optimal abstract strategies in extensive form games. In AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Kangarshahi, E. A., Hsieh, Y.-P., Sahin, M. F., and Cevher, V. (2018). Let’s be honest: An optimal no-regret framework for zero-sum games. In International Conference on Machine Learning (ICML).
    Google ScholarLocate open access versionFindings
  • Karmarkar, N. (1984). A new polynomial-time algorithm for linear programming. In ACM Symposium on Theory of Computing (STOC).
    Google ScholarLocate open access versionFindings
  • Khachiyan, L. (1980). Polynomial algorithms in linear programming. USSR Computational Mathematics and Mathematical Physics, 20(1):53 – 72.
    Google ScholarLocate open access versionFindings
  • Koller, D., Megiddo, N., and von Stengel, B. (1994). Efficient solutions of extensive two-person games. In ACM Symposium on the Theory of Computing (STOC).
    Google ScholarLocate open access versionFindings
  • Koller, D. and Pfeffer, A. (1997). Representations and solutions for game-theoretic problems. Artificial intelligence, 94(1-2):167–215.
    Google ScholarLocate open access versionFindings
  • Korpelevich, G. (1976). The extragradient method for finding saddle points and other problems. Matecon, 12:747– 756.
    Google ScholarLocate open access versionFindings
  • Kroer, C., Farina, G., and Sandholm, T. (2018). Solving large sequential games with the excessive gap technique. In Neural Information Processing Systems (NeurIPS).
    Google ScholarLocate open access versionFindings
  • Kuhn, H. W. (1950). A simplified two-person poker. Contributions to the Theory of Games, 1:97–103.
    Google ScholarLocate open access versionFindings
  • Lanctot, M., Lockhart, E., Lespiau, J.-B., Zambaldi, V., Upadhyay, S., Perolat, J., Srinivasan, S., Timbers, F., Tuyls, K., Omidshafiei, S., Hennes, D., Morrill, D., Muller, P., Ewalds, T., Faulkner, R., Kramar, J., Vylder, B. D., Saeta, B., Bradbury, J., Ding, D., Borgeaud, S., Lai, M., Schrittwieser, J., Anthony, T., Hughes, E., Danihelka, I., and Ryan-Davis, J. (2019). OpenSpiel: A framework for reinforcement learning in games. arXiv.
    Google ScholarFindings
  • Lattimore, T. and Szepesvari, C. (2020). Bandit Algorithms. Cambridge University Press.
    Google ScholarFindings
  • Lockhart, E., Lanctot, M., Perolat, J., Lespiau, J.-B., Morrill, D., Timbers, F., and Tuyls, K. (2019). Computing approximate equilibria in sequential adversarial games by exploitability descent. In International Joint Conference on Artificial Intelligence (IJCAI).
    Google ScholarLocate open access versionFindings
  • Martins, A. F. T. and Astudillo, R. F. (2016). From softmax to sparsemax: A sparse model of attention and multi-label classification. In International Conference on Machine Learning (ICML).
    Google ScholarLocate open access versionFindings
  • Mertikopoulos, P., Lecouat, B., Zenati, H., Foo, C., Chandrasekhar, V., and Piliouras, G. (2019). Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. In International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Mertikopoulos, P., Papadimitriou, C., and Piliouras, G. (2018). Cycles in adversarial regularized learning. In ACM-SIAM Symposium on Discrete Algorithms (SODA).
    Google ScholarLocate open access versionFindings
  • Mokhtari, A., Ozdaglar, A., and Pattathil, S. (2020). A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. Artificial Intelligence and Statistics (AISTATS).
    Google ScholarFindings
  • Mordukhovich, B. S., Pena, J. F., and Roshchina, V. (2010). Applying metric regularity to compute a condition measure of a smoothing algorithm for matrix games. SIAM Journal on Optimization, 20(6):3490–3511.
    Google ScholarLocate open access versionFindings
  • Nemirovski, A. and Yudin, D. (1983). Problem complexity and method efficiency in optimization. Wiley-Interscience Series in Discrete Mathematics.
    Google ScholarLocate open access versionFindings
  • Nesterov, Y. (2005). Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16(1):235–249.
    Google ScholarLocate open access versionFindings
  • Nesterov, Y. E. and Todd, M. J. (1998). Primal-dual interiorpoint methods for self-scaled cones. SIAM Journal on Optimization, 8(2):324–364.
    Google ScholarLocate open access versionFindings
  • Neumann, J. v. (1928). Zur Theorie der Gesellschaftsspiele. Mathematische annalen, 100(1):295–320.
    Google ScholarLocate open access versionFindings
  • Ponsen, M. J. V., de Jong, S., and Lanctot, M. (2011). Computing approximate Nash equilibria and robust bestresponses using sampling. J. Artif. Intell. Res., 42:575– 605.
    Google ScholarLocate open access versionFindings
  • Rakhlin, S. and Sridharan, K. (2013). Optimization, learning, and games with predictable sequences. In Neural Information Processing Systems (NIPS).
    Google ScholarLocate open access versionFindings
  • Schneider, R. (2014). Convex bodies: The BrunnMinkowski theory. Encyclopedia of Mathematics and its Applications, 1(151).
    Google ScholarLocate open access versionFindings
  • Syrgkanis, V., Agarwal, A., Luo, H., and Schapire, R. E. (2015). Fast convergence of regularized learning in games. In Neural Information Processing Systems (NIPS).
    Google ScholarLocate open access versionFindings
  • Tammelin, O., Burch, N., Johanson, M., and Bowling, M. (2015). Solving heads-up limit Texas Hold’em. In International Joint Conference on Artificial Intelligence (IJCAI).
    Google ScholarLocate open access versionFindings
  • Von Stengel, B. (1996). Efficient computation of behavior strategies. Games and Economic Behavior, 14(2):220– 246.
    Google ScholarLocate open access versionFindings
  • Zinkevich, M., Johanson, M., Bowling, M., and Piccione, C. (2008). Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems 20 (NIPS 2007).
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments