Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

Agrawal, Shipra; Jia, Randy

Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

Shipra Agrawal () and Randy Jia ()
Additional contact information
Shipra Agrawal: Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027
Randy Jia: Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027

Mathematics of Operations Research, 2023, vol. 48, issue 1, 363-392

Abstract: We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov decision process (MDP) is communicating with a finite, although unknown, diameter. Our main result is a high probability regret upper bound of O ˜ ( D S A T ) for any communicating MDP with S states, A actions, and diameter D . Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy in time horizon T . This result closely matches the known lower bound of Ω ( DSAT ) . Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.

Keywords: Primary: 68T05; 90-10; Thompson sampling; reinforcement learning; Markov decision process; regret bounds (search for similar items in EconPapers)
Date: 2023
References: Add references at CitEc
Citations:

Downloads: (external link)
http://dx.doi.org/10.1287/moor.2022.1266 (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:inm:ormoor:v:48:y:2023:i:1:p:363-392

Access Statistics for this article

More articles in Mathematics of Operations Research from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().