Nested rollout policy adaptation for Monte Carlo tree search

IJCAI, pp. 649-654, 2011.

Cited by: 92|Bibtex|Views171|Links
EI
Keywords:
monte carlo tree searchnew mcts methodrollout policydomain-specific policysearch efficiencyMore(6+)
Weibo:
We have presented Nested Rollout Policy Adaptation, an Monte Carlo tree search algorithm that uses gradient ascent on its rollout policy to navigate search

Abstract:

Monte Carlo tree search (MCTS) methods have had recent success in games, planning, and optimization. MCTS uses results from rollouts to guide search; a rollout is a path that descends the tree with a randomized decision at each ply until reaching a leaf. MCTS results can be strongly influenced by the choice of appropriate policy to bias t...More

Code:

Data:

Introduction
  • Monte Carlo tree search (MCTS) methods have had substantial recent success in two-player games [Gelly et al, 2007; Finnsson et al, 2010], planning [Nakhost et al, 2009; Silver et al, 2010], optimization and one-player games [Cazenave, 2009; Rimmel et al, 2011; Mehat et al, 2010] and practical applications [de Mesmay et al, 2009; Cazenave et al, 2009].
  • Nested Monte Carlo search has been successful, with world-record results in several problems [Cazenave, 2009; Bjarnason et al, 2007].
  • Methods for adapting rollout policies exist in control and reinforcement learning [Bertsekas, 1997; Fern et al, 2003; Veness et al, 2011]
Highlights
  • Monte Carlo tree search (MCTS) methods have had substantial recent success in two-player games [Gelly et al, 2007; Finnsson et al, 2010], planning [Nakhost et al, 2009; Silver et al, 2010], optimization and one-player games [Cazenave, 2009; Rimmel et al, 2011; Mehat et al, 2010] and practical applications [de Mesmay et al, 2009; Cazenave et al, 2009]
  • Most prior Monte Carlo tree search work uses static policies, but some work has appeared on adapting rollout policies in two-player games [Silver et al, 2009; Tesauro et al, 1996; Finnsson et al, 2010]
  • Multiple independent timelines can be combined to form a picture of the typical trajectory. This has been used to illustrate that Nested Monte Carlo Search typically becomes more efficient as nesting level increases [Cazenave, 2009], and Nested Rollout Policy Adaptation shows a similar trend (Fig. 5)
  • We examine the codes returned by code() for its sequence of actions, and categorize each code as one of: Prefix for the initial segment that exactly matches the previous solution sequence’s initial segment; Permutation matches the previous solution, but in a permuted order rather than as part of Prefix; Hybrid does not match the immediately previous solution, but does match an older solution at this level; and New were not used in any previous solution at this level
  • We have presented Nested Rollout Policy Adaptation, an Monte Carlo tree search algorithm that uses gradient ascent on its rollout policy to navigate search
  • Nested Rollout Policy Adaptation is the first computer search method to improve upon a human-generated Morpion Solitaire record that had stood for over 30 years
Methods
  • Level sec sec sec NRPA NMCS Sample-5 NMCS 4 Sample-15 NMCS 3.
  • Since NRPA adapts its rollout policy, it is natural to compare NMCS using a domain-specific policy.
  • Before running NRPA on MorpT, the authors manually tuned a rollout policy via small experiments evaluating policy elements’ impact on results.
  • Tuned NMCS 3
Results
  • All NRPA runs use Alpha=1.0 and N=100 iterations per level. These values were chosen via a limited set of initial experiments, and appeared to work well across problems.

    4.1 Comparing Efficiency of Search

    The authors wish to compare the effectiveness of NMCS and NRPA.
  • Multiple independent timelines can be combined to form a picture of the typical trajectory
  • This has been used to illustrate that NMCS typically becomes more efficient as nesting level increases [Cazenave, 2009], and NRPA shows a similar trend (Fig. 5).
  • At the nesting levels selected, approximate per-run reference machine time is 1 hour for MorpD, 15 hours for CrossP, 24 hours for CrossC, and 1 week for MorpT.
Conclusion
  • The authors have presented NRPA, an MCTS algorithm that uses gradient ascent on its rollout policy to navigate search.
  • NRPA yields substantial search efficiency improvements as well as new record solutions on the test problems.
  • NRPA is the first computer search method to improve upon a human-generated Morpion Solitaire record that had stood for over 30 years.
  • Ongoing work includes more complex applications, enabling code() to return a feature vector, and parallelization
Summary
  • Introduction:

    Monte Carlo tree search (MCTS) methods have had substantial recent success in two-player games [Gelly et al, 2007; Finnsson et al, 2010], planning [Nakhost et al, 2009; Silver et al, 2010], optimization and one-player games [Cazenave, 2009; Rimmel et al, 2011; Mehat et al, 2010] and practical applications [de Mesmay et al, 2009; Cazenave et al, 2009].
  • Nested Monte Carlo search has been successful, with world-record results in several problems [Cazenave, 2009; Bjarnason et al, 2007].
  • Methods for adapting rollout policies exist in control and reinforcement learning [Bertsekas, 1997; Fern et al, 2003; Veness et al, 2011]
  • Methods:

    Level sec sec sec NRPA NMCS Sample-5 NMCS 4 Sample-15 NMCS 3.
  • Since NRPA adapts its rollout policy, it is natural to compare NMCS using a domain-specific policy.
  • Before running NRPA on MorpT, the authors manually tuned a rollout policy via small experiments evaluating policy elements’ impact on results.
  • Tuned NMCS 3
  • Results:

    All NRPA runs use Alpha=1.0 and N=100 iterations per level. These values were chosen via a limited set of initial experiments, and appeared to work well across problems.

    4.1 Comparing Efficiency of Search

    The authors wish to compare the effectiveness of NMCS and NRPA.
  • Multiple independent timelines can be combined to form a picture of the typical trajectory
  • This has been used to illustrate that NMCS typically becomes more efficient as nesting level increases [Cazenave, 2009], and NRPA shows a similar trend (Fig. 5).
  • At the nesting levels selected, approximate per-run reference machine time is 1 hour for MorpD, 15 hours for CrossP, 24 hours for CrossC, and 1 week for MorpT.
  • Conclusion:

    The authors have presented NRPA, an MCTS algorithm that uses gradient ascent on its rollout policy to navigate search.
  • NRPA yields substantial search efficiency improvements as well as new record solutions on the test problems.
  • NRPA is the first computer search method to improve upon a human-generated Morpion Solitaire record that had stood for over 30 years.
  • Ongoing work includes more complex applications, enabling code() to return a feature vector, and parallelization
Tables
  • Table1: Test Problems Depth Branch Old Record
  • Table2: Median Scores from Timed Runs Method Level 102 sec 103 sec
  • Table3: CrossC Median Scores with Sampled NMCS
  • Table4: MorpT Median Scores with Tuned NMCS
  • Table5: NRPA Results of Longer Runs
  • Table6: NRPA Intermediate Solution Content for MorpT
Download tables as Excel
Funding
  • This work was supported in part by the DARPA GALE project, Contract No HR0011-08-C-0110
Reference
  • [Akiyama et al., 2010] H. Akiyama et al. Nested MonteCarlo search with AMAF heuristic. In TAAI, 2010.
    Google ScholarLocate open access versionFindings
  • [Bertsekas, 1997] D. Bertsekas. Differential training of rollout policies. In Allerton Conf., 1997.
    Google ScholarLocate open access versionFindings
  • [Bjarnason et al., 2007] R. Bjarnason et al. Searching solitaire in real time. ICGA J., 2007.
    Google ScholarLocate open access versionFindings
  • [Boyer, 2010] C. Boyer. Science & Vie, page 144, Nov. 2010.
    Google ScholarFindings
  • [Boyer, 2011] C. Boyer. http://morpionsolitaire.com, 2011.
    Findings
  • [Bruneau, 1976] C.-H. Bruneau. Science & Vie, April 1976.
    Google ScholarFindings
  • [Cazenave, 2007] T. Cazenave. Reflexive Monte-Carlo search. In CGW, 2007.
    Google ScholarLocate open access versionFindings
  • [Cazenave, 2009] T. Cazenave. Nested Monte-Carlo search. In IJCAI, 2009.
    Google ScholarLocate open access versionFindings
  • [Cazenave et al., 2009] T. Cazenave et al. Monte-Carlo bus regulation. In ITSC, 2009.
    Google ScholarLocate open access versionFindings
  • [Coulom, 2007] R. Coulom. Computing Elo ratings of move patterns in the game of Go. In CGW, 2007.
    Google ScholarLocate open access versionFindings
  • [CrossC, 2006] GAMES Magazine, page 76, August 2006. Winning solution: page 93, December 2006.
    Google ScholarFindings
  • [CrossP, 1994] GAMES Magazine, page 8, June 1994. Winning solution: page 67, October 1994.
    Google ScholarFindings
  • [de Mesmay et al., 2009] F. de Mesmay et al. Bandit-based optimization for library performance tuning. ICML, 2009.
    Google ScholarLocate open access versionFindings
  • [Demaine et al., 2006] E. D. Demaine et al. Morpion Solitaire. Theory Comput. Syst., 2006.
    Google ScholarLocate open access versionFindings
  • [Fern et al., 2003] A. Fern et al. Approximate policy iteration with a policy language bias. In NIPS, 2003.
    Google ScholarLocate open access versionFindings
  • [Finnsson et al., 2010] H. Finnsson et al. Learning simulation control in GGP agents. In AAAI, 2010.
    Google ScholarLocate open access versionFindings
  • [Garey and Johnson, 1979] M. Garey and D. Johnson. Computers and Intractability. 1979.
    Google ScholarFindings
  • [Gelly et al., 2007] S. Gelly et al. Combining online and offline knowledge in UCT. In ICML, 2007.
    Google ScholarLocate open access versionFindings
  • [Ginsberg et al., 1990] M. Ginsberg et al. Search lessons learned from crossword puzzles. In AAAI, 1990.
    Google ScholarLocate open access versionFindings
  • [Larranaga et al., 2002] P. Larranaga et al. Estimation of Distribution Algorithms. Kluwer, 2002.
    Google ScholarLocate open access versionFindings
  • [Mehat et al., 2010] J. Mehat et al. Combining UCT and NMCS for single-player GGP. IEEE TCIAIG, 2010.
    Google ScholarLocate open access versionFindings
  • [Nakhost et al., 2009] H. Nakhost et al. Monte-Carlo exploration for deterministic planning. IJCAI, 2009.
    Google ScholarLocate open access versionFindings
  • [Rimmel et al., 2011] A. Rimmel et al. Optimization of the Nested Monte-Carlo Algorithm on the Traveling Salesman Problem with Time Windows. In Evostar, 2011.
    Google ScholarLocate open access versionFindings
  • [Saffidine et al., 2010] A. Saffidine et al. UCD: Upper confidence bound for directed acyclic graphs. In TAAI, 2010.
    Google ScholarLocate open access versionFindings
  • [Silver et al., 2009] D. Silver et al. Monte-Carlo simulation balancing. In ICML, 2009.
    Google ScholarLocate open access versionFindings
  • [Silver et al., 2010] D. Silver et al. Monte-Carlo planning in large POMDPs. In NIPS, 2010.
    Google ScholarLocate open access versionFindings
  • [Tesauro et al., 1996] G. Tesauro et al. On-line policy improvement using Monte-Carlo search. In NIPS, 1996.
    Google ScholarLocate open access versionFindings
  • [Veness et al., 2011] J. Veness et al. A Monte-Carlo AIXI approximation. JAIR, 2011.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Best Paper
Best Paper of IJCAI, 2011
Tags
Comments