Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning

Das, Tapas K.; Gosavi, Abhijit; Mahadevan, Sridhar; Marchalleck, Nicholas

Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning

Tapas K. Das, Abhijit Gosavi, Sridhar Mahadevan and Nicholas Marchalleck
Additional contact information
Tapas K. Das: Department of Industrial and Management Systems Engineering, University of South Florida, Tampa, Florida 33620
Abhijit Gosavi: Department of Industrial and Management Systems Engineering, University of South Florida, Tampa, Florida 33620
Sridhar Mahadevan: Department of Computer Science, Michigan State University, East Lansing, Michigan 48824
Nicholas Marchalleck: Cybear, Inc., 2709 Rocky Pointe Drive, Tampa, Florida 33607

Management Science, 1999, vol. 45, issue 4, 560-574

Abstract: A large class of problems of sequential decision making under uncertainty, of which the underlying probability structure is a Markov process, can be modeled as stochastic dynamic programs (referred to, in general, as Markov decision problems or MDPs). However, the computational complexity of the classical MDP algorithms, such as value iteration and policy iteration, is prohibitive and can grow intractably with the size of the problem and its related data. Furthermore, these techniques require for each action the one step transition probability and reward matrices, and obtaining these is often unrealistic for large and complex systems. Recently, there has been much interest in a simulation-based stochastic approximation framework called reinforcement learning (RL), for computing near optimal policies for MDPs. RL has been successfully applied to very large problems, such as elevator scheduling, and dynamic channel allocation of cellular telephone systems. In this paper, we extend RL to a more general class of decision tasks that are referred to as semi-Markov decision problems (SMDPs). In particular, we focus on SMDPs under the average-reward criterion. We present a new model-free RL algorithm called SMART (Semi-Markov Average Reward Technique). We present a detailed study of this algorithm on a combinatorially large problem of determining the optimal preventive maintenance schedule of a production inventory system. Numerical results from both the theoretical model and the RL algorithm are presented and compared.

Keywords: semi-Markov decision processes (SMDP); reinforcement learning; average reward; preventive maintenance (search for similar items in EconPapers)
Date: 1999
References: View complete reference list from CitEc
Citations: View citations in EconPapers (19)

Downloads: (external link)
http://dx.doi.org/10.1287/mnsc.45.4.560 (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:inm:ormnsc:v:45:y:1999:i:4:p:560-574

Access Statistics for this article

More articles in Management Science from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().