Sublinear regret for learning POMDPs

Xiong, Yi; Chen, Ningyuan; Gao, Xuefeng; Zhou, Xiang

Sublinear regret for learning POMDPs

Yi Xiong, Ningyuan Chen, Xuefeng Gao and Xiang Zhou

Production and Operations Management, 2022, vol. 31, issue 9, 3491-3504

Abstract: We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method‐of‐moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound of O(T2/3logT)$O(T^{2/3}\sqrt {\log T})$ for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.

Date: 2022
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://doi.org/10.1111/poms.13778

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bla:popmgt:v:31:y:2022:i:9:p:3491-3504

Ordering information: This journal article can be ordered from
http://onlinelibrary ... 1111/(ISSN)1937-5956

Access Statistics for this article

Production and Operations Management is currently edited by Kalyan Singhal

More articles in Production and Operations Management from Production and Operations Management Society
Bibliographic data for series maintained by Wiley Content Delivery ().