Modified monotone policy iteration for interpretable policies in Markov decision processes and the impact of state ordering rules
Sun Ju Lee (),
Xingyu Gong () and
Gian-Gabriel Garcia ()
Additional contact information
Sun Ju Lee: Georgia Institute of Technology
Xingyu Gong: Georgia Institute of Technology
Gian-Gabriel Garcia: Georgia Institute of Technology
Annals of Operations Research, 2025, vol. 347, issue 2, No 2, 783-841
Abstract:
Abstract Optimizing interpretable policies for Markov decision processes (MDPs) can be computationally intractable for large-scale MDPs, e.g., for monotone policies, the optimal interpretable policy depends on the initial state distribution, precluding standard dynamic programming techniques. Previous work has proposed monotone policy iteration (MPI) to produce a feasible solution for warm starting a mixed integer linear program that finds an optimal monotone policy. However, this prior work did not investigate the convergence and optimality of this algorithm, nor did they investigate the impact of state ordering rules, i.e., the order in which policy improvement steps are performed in MPI. In this study, we analytically characterize the convergence and optimality of the MPI algorithm, introduce a modified MPI (MMPI) algorithm, and show that our algorithm improves upon the MPI algorithm. To test MMPI numerically, we conduct experiments in two settings: (1) perturbations of a machine maintenance problem wherein the optimal policy is guaranteed to be monotone or near-monotone and (2) randomly generated MDPs. We propose and investigate 19 state ordering rules for MMPI based on each state’s value function, initial probability, and stationary distribution. Computational results reveal a trade-off between computational time and optimality gap; in the structured machine maintenance setting, the fastest state ordering rules still yield high quality policies while the trade-off is more pronounced in the random MDP setting. Across both settings, the random state ordering rule performs the best in terms of optimality gap (less than approximately 5% on average) at the expense of computational time.
Keywords: Markov decision process; Interpretable; Monotone policies; Monotone policy iteration; State ordering rules (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
http://link.springer.com/10.1007/s10479-024-06158-3 Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:annopr:v:347:y:2025:i:2:d:10.1007_s10479-024-06158-3
Ordering information: This journal article can be ordered from
http://www.springer.com/journal/10479
DOI: 10.1007/s10479-024-06158-3
Access Statistics for this article
Annals of Operations Research is currently edited by Endre Boros
More articles in Annals of Operations Research from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().