Learning Unknown Service Rates in Queues: A Multiarmed Bandit Approach

Krishnasamy, Subhashini; Sen, Rajat; Johari, Ramesh; Shakkottai, Sanjay

Learning Unknown Service Rates in Queues: A Multiarmed Bandit Approach

Subhashini Krishnasamy (), Rajat Sen (), Ramesh Johari () and Sanjay Shakkottai ()
Additional contact information
Subhashini Krishnasamy: Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas 78712
Rajat Sen: Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas 78712
Ramesh Johari: Department of Management Science and Engineering, Stanford University, Stanford, California 94305
Sanjay Shakkottai: Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas 78712

Operations Research, 2021, vol. 69, issue 1, 315-330

Abstract: Consider a queueing system consisting of multiple servers. Jobs arrive over time and enter a queue for service; the goal is to minimize the size of this queue. At each opportunity for service, at most one server can be chosen, and at most one job can be served. Service is successful with a probability (the service probability ) that is a priori unknown for each server. An algorithm that knows the service probabilities (the “genie”) can always choose the server of highest service probability. We study algorithms that learn the unknown service probabilities. Our goal is to minimize queue regret : the (expected) difference between the queue lengths obtained by the algorithm and those obtained by the “genie.” Because queue regret cannot be larger than classical regret, results for the standard multiarmed bandit problem give algorithms for which queue regret increases no more than logarithmically in time. Our paper shows surprisingly more complex behavior. In particular, as long as the bandit algorithm’s queues have relatively long regenerative cycles, queue regret is similar to cumulative regret and scales (essentially) logarithmically. However, we show that this “early stage” of the queueing bandit eventually gives way to a “late stage,” where the optimal queue-regret scaling is O (1/ t ). We demonstrate an algorithm that (orderwise) achieves this asymptotic queue regret in the late stage. Our results are developed in a more general model that allows for multiple job classes as well.

Keywords: bandit algorithms; queueing systems; scheduling; regret (search for similar items in EconPapers)
Date: 2021
References: Add references at CitEc
Citations:

Downloads: (external link)
https://doi.org/10.1287/opre.2020.1995 (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:inm:oropre:v:69:y:2021:i:1:p:315-330

Access Statistics for this article

More articles in Operations Research from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().