Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis
Gen Li (),
Changxiao Cai (),
Yuxin Chen (),
Yuting Wei () and
Yuejie Chi ()
Additional contact information
Gen Li: Department of Statistics and Data Science, Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104
Changxiao Cai: Department of Biostatistics, University of Pennsylvania, Philadelphia, Pennsylvania 19104
Yuxin Chen: Department of Statistics and Data Science, Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104
Yuting Wei: Department of Statistics and Data Science, Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104
Yuejie Chi: Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
Operations Research, 2024, vol. 72, issue 1, 222-236
Abstract:
Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronous setting (such that independent samples for all state–action pairs are drawn from a generative model in each iteration), substantial progress has been made toward understanding the sample efficiency of Q-learning. Consider a γ -discounted infinite-horizon MDP with state space S and action space A : to yield an entry-wise ε -approximation of the optimal Q-function, state-of-the-art theory for Q-learning requires a sample size exceeding the order of | S ‖ A | ( 1 − γ ) 5 ε 2 , which fails to match existing minimax lower bounds. This gives rise to natural questions: What is the sharp sample complexity of Q-learning? Is Q-learning provably suboptimal? This paper addresses these questions for the synchronous setting: (1) when the action space contains a single action (so that Q-learning reduces to TD learning), we prove that the sample complexity of TD learning is minimax optimal and scales as | S | ( 1 − γ ) 3 ε 2 (up to log factor); (2) when the action space contains at least two actions, we settle the sample complexity of Q-learning to be on the order of | S ‖ A | ( 1 − γ ) 4 ε 2 (up to log factor). Our theory unveils the strict suboptimality of Q-learning when the action space contains at least two actions and rigorizes the negative impact of overestimation in Q-learning. Finally, we extend our analysis to accommodate asynchronous Q-learning (i.e., the case with Markovian samples), sharpening the horizon dependency of its sample complexity to be 1 ( 1 − γ ) 4 .
Keywords: Machine Learning and Data Science; Q-learning; temporal difference learning; effective horizon; sample complexity; minimax optimality; lower bound; overestimation (search for similar items in EconPapers)
Date: 2024
References: Add references at CitEc
Citations:
Downloads: (external link)
http://dx.doi.org/10.1287/opre.2023.2450 (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:inm:oropre:v:72:y:2024:i:1:p:222-236
Access Statistics for this article
More articles in Operations Research from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().