Policy Learning with Adaptively Collected Data

Zhan, Ruohan; Ren, Zhimei; Athey, Susan; Zhou, Zhengyuan

Policy Learning with Adaptively Collected Data

Ruohan Zhan (), Zhimei Ren (), Susan Athey and Zhengyuan Zhou ()
Additional contact information
Ruohan Zhan: Department of Industrial Engineering and Decision Analytics, Hong Kong University of Science and Technology, Hong Kong
Zhimei Ren: Department of Statistics and Data Science, Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104
Zhengyuan Zhou: Stern School of Business, New York University, New York, New York 10012

Management Science, 2024, vol. 70, issue 8, 5270-5297

Abstract: In a wide variety of applications, including healthcare, bidding in first price auctions, digital recommendations, and online education, it can be beneficial to learn a policy that assigns treatments to individuals based on their characteristics. The growing policy-learning literature focuses on settings in which policies are learned from historical data in which the treatment assignment rule is fixed throughout the data-collection period. However, adaptive data collection is becoming more common in practice from two primary sources: (1) data collected from adaptive experiments that are designed to improve inferential efficiency and (2) data collected from production systems that progressively evolve an operational policy to improve performance over time (e.g., contextual bandits). Yet adaptivity complicates the problem of learning an optimal policy ex post for two reasons: first, samples are dependent and, second, an adaptive assignment rule may not assign each treatment to each type of individual sufficiently often. In this paper, we address these challenges. We propose an algorithm based on generalized augmented inverse propensity weighted (AIPW) estimators, which nonuniformly reweight the elements of a standard AIPW estimator to control worst case estimation variance. We establish a finite-sample regret upper bound for our algorithm and complement it with a regret lower bound that quantifies the fundamental difficulty of policy learning with adaptive data. When equipped with the best weighting scheme, our algorithm achieves minimax rate-optimal regret guarantees even with diminishing exploration. Finally, we demonstrate our algorithm’s effectiveness using both synthetic data and public benchmark data sets.

Keywords: off-line policy learning; adaptive data collection; minimax optimality; personalized decision making; contextual bandits (search for similar items in EconPapers)
Date: 2024
References: Add references at CitEc
Citations:

Downloads: (external link)
http://dx.doi.org/10.1287/mnsc.2023.4921 (application/pdf)

Related works:
Working Paper: Policy Learning with Adaptively Collected Data (2022)
Working Paper: Policy Learning with Adaptively Collected Data (2021)
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:inm:ormnsc:v:70:y:2024:i:8:p:5270-5297

Access Statistics for this article

More articles in Management Science from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().