EconPapers    
Economics at your fingertips  
 

Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

Shicong Cen (), Chen Cheng (), Yuxin Chen (), Yuting Wei () and Yuejie Chi ()
Additional contact information
Shicong Cen: Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
Chen Cheng: Department of Statistics, Stanford University, Stanford, California 94305
Yuxin Chen: Department of Electrical and Computer Engineering, Princeton University, Princeton, New Jersey 08544
Yuting Wei: Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104
Yuejie Chi: Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213

Operations Research, 2022, vol. 70, issue 4, 2563-2578

Abstract: Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms in contemporary reinforcement learning. This class of methods is often applied in conjunction with entropy regularization—an algorithmic scheme that encourages exploration—and is closely related to soft policy iteration and trust region policy optimization. Despite the empirical success, the theoretical underpinnings for NPG methods remain limited even for the tabular setting. This paper develops nonasymptotic convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on discounted Markov decision processes (MDPs). Assuming access to exact policy evaluation, we demonstrate that the algorithm converges linearly—even quadratically, once it enters a local region around the optimal policy—when computing optimal value functions of the regularized MDP. Moreover, the algorithm is provably stable vis-à-vis inexactness of policy evaluation. Our convergence results accommodate a wide range of learning rates and shed light upon the role of entropy regularization in enabling fast convergence.

Keywords: Machine Learning and Data Science; reinforcement learning; natural policy gradient methods; entropy regularization; global convergence (search for similar items in EconPapers)
Date: 2022
References: Add references at CitEc
Citations:

Downloads: (external link)
http://dx.doi.org/10.1287/opre.2021.2151 (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:inm:oropre:v:70:y:2022:i:4:p:2563-2578

Access Statistics for this article

More articles in Operations Research from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().

 
Page updated 2025-03-19
Handle: RePEc:inm:oropre:v:70:y:2022:i:4:p:2563-2578