Fast-exploring reinforcement learning with applications to stochastic networks

Mastropietro, D.; Ayesta, U.; Jonckheere, M.; Majewski, S.

Fast-exploring reinforcement learning with applications to stochastic networks

D. Mastropietro (), U. Ayesta, M. Jonckheere and S. Majewski
Additional contact information
D. Mastropietro: CNRS
U. Ayesta: CNRS
M. Jonckheere: CNRS
S. Majewski: Ecole Polytechnique

Queueing Systems: Theory and Applications, 2025, vol. 109, issue 3, No 7, 49 pages

Abstract: Abstract We introduce FVRL (Fleming–Viot reinforcement learning), a reinforcement learning algorithm for optimisation problems where a long-term objective is largely influenced by states that are very rarely observed under all policies. In this context, usual discovery techniques including importance sampling are inapplicable because no alternative policy exists that increases the observed frequency of the rare states. We instead propose a novel approach that uses Fleming–Viot particle systems, a family of stochastic processes evolving simultaneously under the same law, that exploits prior knowledge of the environment to boost exploration of the rare states. A renewal theory argument allows us to consistently estimate the stationary probability of the rare states from excursions that have considerably lower sample complexity than usual Monte Carlo explorations. We demonstrate how to combine this estimator with policy gradient learning to construct the FVRL algorithm, which is suited to efficiently solve problems where the optimisation function is expressed as a long-run expectation, such as the long-run expected reward. We show that the FVRL algorithm converges to a local optimiser of the parameterised objective function, and illustrate the method on two optimisation problems that aim at minimising the long-run expected cost under admission control policies of threshold type: a simple M/M/1 queue system and a two-job-class loss network. Our experimental results show that, under the same sample complexity, FVRL outperforms a vanilla Monte Carlo reinforcement learning method by converging to the optimum thresholds considerably faster.

Keywords: Fleming–Viot particle system; Policy gradient; Queueing system; Renewal theory; Threshold-type policy; 60J28; 60K20; 65C35; 90C40 (search for similar items in EconPapers)
Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
http://link.springer.com/10.1007/s11134-025-09950-5 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:queues:v:109:y:2025:i:3:d:10.1007_s11134-025-09950-5

Ordering information: This journal article can be ordered from
http://www.springer.com/journal/11134/

DOI: 10.1007/s11134-025-09950-5

Access Statistics for this article

Queueing Systems: Theory and Applications is currently edited by Sergey Foss

More articles in Queueing Systems: Theory and Applications from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().