Nonasymptotic Analysis of Monte Carlo Tree Search
Devavrat Shah (),
Qiaomin Xie () and
Zhi Xu ()
Additional contact information
Devavrat Shah: Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Qiaomin Xie: Industrial and Systems and Engineering, University of Wisconsin-Madison, Madison, Wisconsin 53706
Zhi Xu: Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Operations Research, 2022, vol. 70, issue 6, 3234-3260
Abstract:
In this work, we consider the popular tree-based search strategy within the framework of reinforcement learning, the Monte Carlo tree search (MCTS), in the context of the infinite-horizon discounted cost Markov decision process (MDP). Although MCTS is believed to provide an approximate value function for a given state with enough simulations, the claimed proof of this property is incomplete. This is because the variant of MCTS, the upper confidence bound for trees (UCT), analyzed in prior works, uses “logarithmic” bonus term for balancing exploration and exploitation within the tree-based search, following the insights from stochastic multiarm bandit (MAB) literature. In effect, such an approach assumes that the regret of the underlying recursively dependent nonstationary MABs concentrates around their mean exponentially in the number of steps, which is unlikely to hold, even for stationary MABs. As the key contribution of this work, we establish polynomial concentration property of regret for a class of nonstationary MABs. This in turn establishes that the MCTS with appropriate polynomial rather than logarithmic bonus term in UCB has a claimed property. Interestingly enough, empirically successful approaches use a similar polynomial form of MCTS as suggested by our result. Using this as a building block, we argue that MCTS, combined with nearest neighbor supervised learning, acts as a “policy improvement” operator; that is, it iteratively improves value function approximation for all states because of combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn an ε approximation of the value function with respect to ℓ ∞ norm, MCTS combined with nearest neighbor requires a sample size scaling as O ˜ ( ε − ( d + 4 ) ) , where d is the dimension of the state space. This is nearly optimal because of a minimax lower bound of Ω ˜ ( ε − ( d + 2 ) ) , suggesting the strength of the variant of MCTS we propose here and our resulting analysis.
Keywords: Machine Learning and Data Science; Monte Carlo tree search; Nonstationary multi-armed bandit; reinforcement learning (search for similar items in EconPapers)
Date: 2022
References: Add references at CitEc
Citations:
Downloads: (external link)
http://dx.doi.org/10.1287/opre.2021.2239 (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:inm:oropre:v:70:y:2022:i:6:p:3234-3260
Access Statistics for this article
More articles in Operations Research from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().