Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds
Shipra Agrawal () and
Randy Jia ()
Additional contact information
Shipra Agrawal: Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027
Randy Jia: Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027
Mathematics of Operations Research, 2023, vol. 48, issue 1, 363-392
Abstract:
We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov decision process (MDP) is communicating with a finite, although unknown, diameter. Our main result is a high probability regret upper bound of O ˜ ( D S A T ) for any communicating MDP with S states, A actions, and diameter D . Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy in time horizon T . This result closely matches the known lower bound of Ω ( DSAT ) . Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.
Keywords: Primary: 68T05; 90-10; Thompson sampling; reinforcement learning; Markov decision process; regret bounds (search for similar items in EconPapers)
Date: 2023
References: Add references at CitEc
Citations:
Downloads: (external link)
http://dx.doi.org/10.1287/moor.2022.1266 (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:inm:ormoor:v:48:y:2023:i:1:p:363-392
Access Statistics for this article
More articles in Mathematics of Operations Research from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().