Long-Term Values in Markov Decision Processes and Repeated Games, and a New Distance for Probability Spaces
Jérôme Renault and
Xavier Venel ()
Additional contact information
Xavier Venel: Paris School of Economics; and Université Paris 1 Panthéon-Sorbonne, Paris, France
Mathematics of Operations Research, 2017, vol. 42, issue 2, 349-376
Abstract:
We study long-term Markov decision processes (MDPs) and gambling houses, with applications to any partial observation MDPs with finitely many states and zero-sum repeated games with an informed controller. We consider a decision maker who is maximizing the weighted sum ∑ t ≥ 1 𝜃 t r t , where r t is the expected reward of the t -th stage. We prove the existence of a very strong notion of long-term value called general uniform value , representing the fact that the decision maker can play well independently of the evaluations (𝜃 t ) t ≥ 1 over stages, provided the total variation (or impatience) ∑ t ≥ 1 | 𝜃 t + 1 − 𝜃 t | is small enough. This result generalizes previous results of the literature that focus on arithmetic means and discounted evaluations. Moreover, we give a variational characterization of the general uniform value via the introduction of appropriate invariant measures for the decision problems, generalizing the fundamental theorem of gambling or the Aumann–Maschler cav( u ) formula for repeated games with incomplete information. Apart the introduction of appropriate invariant measures, the main innovation in our proofs is the introduction of a new metric, d * , such that partial observation MDPs and repeated games with an informed controller may be associated with auxiliary problems that are nonexpansive with respect to d * . Given two Borel probabilities over a compact subset X of a normed vector space, we define d ∗ ( u , v ) = sup f ∈ D 1 | u ( f ) − v ( f ) | , where D 1 is the set of functions satisfying ∀ x , y ∈ X , ∀ a , b ≥ 0, af ( x ) − bf ( y ) ≤ ‖ ax − by ‖. The particular case where X is a simplex endowed with the L 1 -norm is particularly interesting: d * is the largest distance over the probabilities with finite support over X , which makes every disintegration nonexpansive. Moreover, we obtain a Kantorovich–Rubinstein-type duality formula for d * ( u , v ), involving couples of measures ( α , β ) over X × X such that the first marginal of α is u and the second marginal of β is v .
Keywords: gambling houses; partial observation Markov decision processes; repeated games; Wasserstein metric; uniform value; characterization of the value (search for similar items in EconPapers)
Date: 2017
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (8)
Downloads: (external link)
https://doi.org/10.1287/moor.2016.0814 (application/pdf)
Related works:
Working Paper: Long-term values in Markov Decision Processes and Repeated Games, and a new distance for probability spaces (2017)
Working Paper: Long-term values in Markov Decision Processes and Repeated Games, and a new distance for probability spaces (2017)
Working Paper: Long-term values in Markov Decision Processes and Repeated Games, and a new distance for probability spaces (2017)
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:inm:ormoor:v:42:y:2017:i:2:p:349-376
Access Statistics for this article
More articles in Mathematics of Operations Research from INFORMS Contact information at EDIRC.
Bibliographic data for series maintained by Chris Asher ().