Regret Analysis of a Markov Policy Gradient Algorithm for Multiarm Bandits
Neil Walton () and
Denis Denisov ()
Additional contact information
Neil Walton: Durham University Business School, Durham DH1 3LB, United Kingdom
Denis Denisov: Durham University Business School, Durham DH1 3LB, United Kingdom
Mathematics of Operations Research, 2023, vol. 48, issue 3, 1553-1588
Abstract:
We consider a policy gradient algorithm applied to a finite-arm bandit problem with Bernoulli rewards. We allow learning rates to depend on the current state of the algorithm rather than using a deterministic time-decreasing learning rate. The state of the algorithm forms a Markov chain on the probability simplex. We apply Foster–Lyapunov techniques to analyze the stability of this Markov chain. We prove that, if learning rates are well-chosen, then the policy gradient algorithm is a transient Markov chain, and the state of the chain converges on the optimal arm with logarithmic or polylogarithmic regret.
Keywords: Primary: 62M20; secondary: 62L20; 60J05; regret; policy gradient; multiarm bandit; Markov chains; Foster–Lyapunov (search for similar items in EconPapers)
Date: 2023
References: Add references at CitEc
Citations:
Downloads: (external link)
http://dx.doi.org/10.1287/moor.2022.1311 (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:inm:ormoor:v:48:y:2023:i:3:p:1553-1588
Access Statistics for this article
More articles in Mathematics of Operations Research from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().