Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning

Kallus, Nathan; Uehara, Masatoshi

Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning

Nathan Kallus () and Masatoshi Uehara ()
Additional contact information
Nathan Kallus: Cornell University, New York, New York 10044
Masatoshi Uehara: Cornell University, New York, New York 10044

Operations Research, 2022, vol. 70, issue 6, 3282-3302

Abstract: Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant structure in efficient OPE. We first derive the efficiency bounds and efficient influence functions for OPE when one assumes each of these structures. This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible in the near-on-policy setting, where behavior and target policies are sufficiently similar. But, in time-invariant Markov decision processes, our bounds show that truly off-policy evaluation is feasible, even with only just one dependent trajectory, and provide the limits of how well we could hope to do. We develop a new estimator based on double reinforcement learning (DRL) that leverages this structure for OPE. Our DRL estimator simultaneously uses estimated stationary density ratios and q -functions and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently. We investigate these properties and the performance benefits of leveraging the problem structure for more efficient OPE.

Keywords: Machine Learning and Data Science; off-policy evaluation; Markov decision processes; infinite horizon; semiparametric efficiency (search for similar items in EconPapers)
Date: 2022
References: Add references at CitEc
Citations: View citations in EconPapers (3)

Downloads: (external link)
http://dx.doi.org/10.1287/opre.2021.2249 (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:inm:oropre:v:70:y:2022:i:6:p:3282-3302

Access Statistics for this article

More articles in Operations Research from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().