Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model
Gen Li (),
Yuting Wei (),
Yuejie Chi () and
Yuxin Chen ()
Additional contact information
Gen Li: Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104
Yuting Wei: Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104
Yuejie Chi: Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
Yuxin Chen: Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104; Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, Pennsylvania 19104
Operations Research, 2024, vol. 72, issue 1, 203-221
Abstract:
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator). We first consider γ -discounted infinite-horizon Markov decision processes (MDPs) with state space S and action space A . Despite a number of prior works tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy has yet to be determined. In particular, all prior results suffer from a severe sample size barrier in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least | S ‖ A | ( 1 − γ ) 2 . The current paper overcomes this barrier by certifying the minimax optimality of two algorithms—a perturbed model-based algorithm and a conservative model-based algorithm—as soon as the sample size exceeds the order of | S ‖ A | 1 − γ (modulo some log factor). Moving beyond infinite-horizon MDPs, we further study time-inhomogeneous finite-horizon MDPs and prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level. To the best of our knowledge, this work delivers the first minimax-optimal guarantees that accommodate the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically infeasible).
Keywords: Machine Learning and Data Science; model-based reinforcement learning; minimaxity; policy evaluation; generative model (search for similar items in EconPapers)
Date: 2024
References: Add references at CitEc
Citations:
Downloads: (external link)
http://dx.doi.org/10.1287/opre.2023.2451 (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:inm:oropre:v:72:y:2024:i:1:p:203-221
Access Statistics for this article
More articles in Operations Research from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().