Small-Loss Bounds for Online Learning with Partial Information
Thodoris Lykouris (),
Karthik Sridharan () and
Éva Tardos ()
Additional contact information
Thodoris Lykouris: Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Karthik Sridharan: Cornell University, Ithaca, New York 14850
Éva Tardos: Cornell University, Ithaca, New York 14850
Mathematics of Operations Research, 2022, vol. 47, issue 3, 2186-2218
Abstract:
We consider the problem of adversarial (nonstochastic) online learning with partial-information feedback, in which, at each round, a decision maker selects an action from a finite set of alternatives. We develop a black-box approach for such problems in which the learner observes as feedback only losses of a subset of the actions that includes the selected action. When losses of actions are nonnegative, under the graph-based feedback model introduced by Mannor and Shamir, we offer algorithms that attain the so called “small-loss” o ( α L ⋆ ) regret bounds with high probability, where α is the independence number of the graph and L ⋆ is the loss of the best action. Prior to our work, there was no data-dependent guarantee for general feedback graphs even for pseudo-regret (without dependence on the number of actions, i.e., utilizing the increased information feedback). Taking advantage of the black-box nature of our technique, we extend our results to many other applications, such as combinatorial semi-bandits (including routing in networks), contextual bandits (even with an infinite comparator class), and learning with slowly changing (shifting) comparators. In the special case of multi-armed bandit and combinatorial semi-bandit problems, we provide optimal small-loss, high-probability regret guarantees of O ˜ ( d L ⋆ ) , where d is the number of actions, answering open questions of Neu. Previous bounds for multi-armed bandits and semi-bandits were known only for pseudo-regret and only in expectation. We also offer an optimal O ˜ ( κ L ⋆ ) regret guarantee for fixed feedback graphs with clique-partition number at most κ .
Keywords: Primary: 68Q32; online learning; feedback graphs; bandit algorithms; semi-bandits; contextual bandits; partial information; regret bounds; small-loss bounds; first-order bounds; high probability (search for similar items in EconPapers)
Date: 2022
References: Add references at CitEc
Citations:
Downloads: (external link)
http://dx.doi.org/10.1287/moor.2021.1204 (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:inm:ormoor:v:47:y:2022:i:3:p:2186-2218
Access Statistics for this article
More articles in Mathematics of Operations Research from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().