Off-Policy Temporal Difference Learning with Bellman Residuals

Yang, Shangdong; Sun, Dingyuanhao; Chen, Xingguo

Off-Policy Temporal Difference Learning with Bellman Residuals

Shangdong Yang, Dingyuanhao Sun and Xingguo Chen ()
Additional contact information
Shangdong Yang: School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
Dingyuanhao Sun: School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
Xingguo Chen: School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China

Mathematics, 2024, vol. 12, issue 22, 1-18

Abstract: In reinforcement learning, off-policy temporal difference learning methods have gained significant attention due to their flexibility in utilizing existing data. However, traditional off-policy temporal difference methods often suffer from poor convergence and stability when handling complex problems. To address these issues, this paper proposes an off-policy temporal difference algorithm with Bellman residuals (TDBR). By incorporating Bellman residuals, the proposed algorithm effectively improves the convergence and stability of the off-policy learning process. This paper first introduces the basic concepts of reinforcement learning and value function approximation, highlighting the importance of Bellman residuals in off-policy learning. Then, the theoretical foundation and implementation details of the TDBR algorithm are described in detail. Experimental results in multiple benchmark environments demonstrate that the TDBR algorithm significantly outperforms traditional methods in terms of both convergence speed and solution quality. Overall, the TDBR algorithm provides an effective and stable solution for off-policy reinforcement learning with broad application prospects. Future research can further optimize the algorithm parameters and extend its application to continuous state and action spaces to enhance its applicability and performance in real-world problems.

Keywords: reinforcement learning; value function approximation; stability; off-policy; Bellman residual (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/12/22/3603/pdf (application/pdf)
https://www.mdpi.com/2227-7390/12/22/3603/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:12:y:2024:i:22:p:3603-:d:1523394

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().