Offline Contextual Bandits in the Presence of New Actions
Ren Kishimoto,
Tatsuhiro Shimizu,
Kazuki Kawamura,
Takanori Muroi,
Yusuke Narita,
Yuki Sasamoto,
Kei Tateno,
Takuma Udagawa and
Yuta Saito
Additional contact information
Ren Kishimoto: Institute of Science Tokyo
Tatsuhiro Shimizu: Yale University
Kazuki Kawamura: Sony Group Corporation
Takanori Muroi: Sony Group Corporation
Yusuke Narita: Yale University
Yuki Sasamoto: Sony Group Corporation
Kei Tateno: Sony Group Corporation
Takuma Udagawa: Sony Group Corporation
Yuta Saito: Cornell University
No 2456, Cowles Foundation Discussion Papers from Cowles Foundation for Research in Economics, Yale University
Abstract:
Automated decision-making algorithms drive applications in domains such as recommendation systems and search engines. These algorithms often rely on off-policy contextual bandits or off-policy learning (OPL). Conventionally, OPL selects actions that maximize the expected reward within an existing action set. However, in many real-world scenarios, actionsÑsuch as news articles or video contentÑchange continuously, and the action space evolves over time compared to when the logged data was collected. We define actions introduced after deploying the logging policy as new actions and focus on the problem of OPL with new actions. Existing OPL methods cannot learn and select new actions because no relevant data are logged. To address this limitation, we propose a new OPL method that leverages action features. In particular, we first introduce the Local Combination PseudoInverse (LCPI) estimator for the policy gradient, generalizing the PseudoInverse estimator initially proposed for off-policy evaluation of slate bandits. LCPI controls the trade-off between reward-modeling condition and the condition for data collection regarding the action features, capturing the interaction effects among different dimensions of action features. Furthermore, we propose a generalized algorithm called Policy Optimization for Effective New Actions (PONA), which integrates LCPI, a component specialized for new action selection, with Doubly Robust (DR), which excels at learning within existing actions. We define PONA as a weighted sum of the LCPI and DR estimators, optimizing both the selection of existing and new actions, and allowing the proportion of new action selections to be adjusted by controlling the weight parameter.
Pages: 7 pages
Date: 2025-08-20
References: Add references at CitEc
Citations:
Downloads: (external link)
https://cowles.yale.edu/sites/default/files/2025-09/d2456.pdf (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:cwl:cwldpp:2456
Ordering information: This working paper can be ordered from
Cowles Foundation, Yale University, Box 208281, New Haven, CT 06520-8281 USA
The price is None.
Access Statistics for this paper
More papers in Cowles Foundation Discussion Papers from Cowles Foundation for Research in Economics, Yale University Yale University, Box 208281, New Haven, CT 06520-8281 USA. Contact information at EDIRC.
Bibliographic data for series maintained by Brittany Ladd ().