EconPapers    
Economics at your fingertips  
 

Regret Analysis of a Markov Policy Gradient Algorithm for Multiarm Bandits

Neil Walton () and Denis Denisov ()
Additional contact information
Neil Walton: Durham University Business School, Durham DH1 3LB, United Kingdom
Denis Denisov: Durham University Business School, Durham DH1 3LB, United Kingdom

Mathematics of Operations Research, 2023, vol. 48, issue 3, 1553-1588

Abstract: We consider a policy gradient algorithm applied to a finite-arm bandit problem with Bernoulli rewards. We allow learning rates to depend on the current state of the algorithm rather than using a deterministic time-decreasing learning rate. The state of the algorithm forms a Markov chain on the probability simplex. We apply Foster–Lyapunov techniques to analyze the stability of this Markov chain. We prove that, if learning rates are well-chosen, then the policy gradient algorithm is a transient Markov chain, and the state of the chain converges on the optimal arm with logarithmic or polylogarithmic regret.

Keywords: Primary: 62M20; secondary: 62L20; 60J05; regret; policy gradient; multiarm bandit; Markov chains; Foster–Lyapunov (search for similar items in EconPapers)
Date: 2023
References: Add references at CitEc
Citations:

Downloads: (external link)
http://dx.doi.org/10.1287/moor.2022.1311 (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:inm:ormoor:v:48:y:2023:i:3:p:1553-1588

Access Statistics for this article

More articles in Mathematics of Operations Research from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().

 
Page updated 2025-03-19
Handle: RePEc:inm:ormoor:v:48:y:2023:i:3:p:1553-1588