Sublinear regret for learning POMDPs
Yi Xiong,
Ningyuan Chen,
Xuefeng Gao and
Xiang Zhou
Production and Operations Management, 2022, vol. 31, issue 9, 3491-3504
Abstract:
We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method‐of‐moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound of O(T2/3logT)$O(T^{2/3}\sqrt {\log T})$ for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.
Date: 2022
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://doi.org/10.1111/poms.13778
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:bla:popmgt:v:31:y:2022:i:9:p:3491-3504
Ordering information: This journal article can be ordered from
http://onlinelibrary ... 1111/(ISSN)1937-5956
Access Statistics for this article
Production and Operations Management is currently edited by Kalyan Singhal
More articles in Production and Operations Management from Production and Operations Management Society
Bibliographic data for series maintained by Wiley Content Delivery ().