EconPapers    
Economics at your fingertips  
 

Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

Shipra Agrawal () and Randy Jia ()
Additional contact information
Shipra Agrawal: Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027
Randy Jia: Department of Industrial Engineering and Operations Research, Columbia University, New York, New York 10027

Mathematics of Operations Research, 2023, vol. 48, issue 1, 363-392

Abstract: We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov decision process (MDP) is communicating with a finite, although unknown, diameter. Our main result is a high probability regret upper bound of O ˜ ( D S A T ) for any communicating MDP with S states, A actions, and diameter D . Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy in time horizon T . This result closely matches the known lower bound of Ω ( DSAT ) . Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.

Keywords: Primary: 68T05; 90-10; Thompson sampling; reinforcement learning; Markov decision process; regret bounds (search for similar items in EconPapers)
Date: 2023
References: Add references at CitEc
Citations:

Downloads: (external link)
http://dx.doi.org/10.1287/moor.2022.1266 (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:inm:ormoor:v:48:y:2023:i:1:p:363-392

Access Statistics for this article

More articles in Mathematics of Operations Research from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().

 
Page updated 2025-03-19
Handle: RePEc:inm:ormoor:v:48:y:2023:i:1:p:363-392