M-Learning: Heuristic Approach for Delayed Rewards in Reinforcement Learning

Charry, Cesar Andrey Perdomo; Cortes, Marlon Sneider Mora; Perdomo, Oscar J.

M-Learning: Heuristic Approach for Delayed Rewards in Reinforcement Learning

Cesar Andrey Perdomo Charry (), Marlon Sneider Mora Cortes and Oscar J. Perdomo
Additional contact information
Cesar Andrey Perdomo Charry: Faculty of Engineering, Universidad Distrital Francisco José de Caldas, Bogotá 111611, Colombia
Marlon Sneider Mora Cortes: Faculty of Engineering, Universidad Distrital Francisco José de Caldas, Bogotá 111611, Colombia
Oscar J. Perdomo: Deptartment of Electrical and Electronic, Universidad Nacional de Colombia, Bogotá 111321, Colombia

Mathematics, 2025, vol. 13, issue 13, 1-21

Abstract: The current design of reinforcement learning methods requires extensive computational resources. Algorithms such as Deep Q-Network (DQN) have obtained outstanding results in advancing the field. However, the need to tune thousands of parameters and run millions of training episodes remains a significant challenge. This document proposes a comparative analysis between the Q-Learning algorithm, which laid the foundations for Deep Q-Learning, and our proposed method, termed M-Learning. The comparison is conducted using Markov Decision Processes with the delayed reward as a general test bench framework. Firstly, this document provides a full description of the main challenges related to implementing Q-Learning, particularly concerning its multiple parameters. Then, the foundations of our proposed heuristic are presented, including its formulation, and the algorithm is described in detail. The methodology used to compare both algorithms involved training them in the Frozen Lake environment. The experimental results, along with an analysis of the best solutions, demonstrate that our proposal requires fewer episodes and exhibits reduced variability in the outcomes. Specifically, M-Learning trains agents 30.7% faster in the deterministic environment and 61.66% faster in the stochastic environment. Additionally, it achieves greater consistency, reducing the standard deviation of scores by 58.37% and 49.75% in the deterministic and stochastic settings, respectively. The code will be made available in a GitHub repository upon this paper’s publication.

Keywords: reinforcement learning; exploration–exploitation dilemma; Q-learning; frozen lake; heuristic approach (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/13/13/2108/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/13/2108/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:13:p:2108-:d:1689127

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().