-
Ergodic Unobservable MDPs: Decidability of Approximation
Authors:
Krishnendu Chatterjee,
David Lurie,
Raimundo Saona,
Bruno Ziliotto
Abstract:
Unobservable Markov decision processes (UMDPs) serve as a prominent mathematical framework for modeling sequential decision-making problems. A key aspect in computational analysis is the consideration of decidability, which concerns the existence of algorithms. In general, the computation of the exact and approximated values is undecidable for UMDPs with the long-run average objective. Building on…
▽ More
Unobservable Markov decision processes (UMDPs) serve as a prominent mathematical framework for modeling sequential decision-making problems. A key aspect in computational analysis is the consideration of decidability, which concerns the existence of algorithms. In general, the computation of the exact and approximated values is undecidable for UMDPs with the long-run average objective. Building on matrix product theory and ergodic properties, we introduce a novel subclass of UMDPs, termed ergodic UMDPs. Our main result demonstrates that approximating the value within this subclass is decidable. However, we show that the exact problem remains undecidable. Finally, we discuss the primary challenges of extending these results to partially observable Markov decision processes.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Prophet Inequalities Require Only a Constant Number of Samples
Authors:
Andrés Cristi,
Bruno Ziliotto
Abstract:
In a prophet inequality problem, $n$ independent random variables are presented to a gambler one by one. The gambler decides when to stop the sequence and obtains the most recent value as reward. We evaluate a stop** rule by the worst-case ratio between its expected reward and the expectation of the maximum variable. In the classic setting, the order is fixed, and the optimal ratio is known to b…
▽ More
In a prophet inequality problem, $n$ independent random variables are presented to a gambler one by one. The gambler decides when to stop the sequence and obtains the most recent value as reward. We evaluate a stop** rule by the worst-case ratio between its expected reward and the expectation of the maximum variable. In the classic setting, the order is fixed, and the optimal ratio is known to be 1/2. Three variants of this problem have been extensively studied: the prophet-secretary model, where variables arrive in uniformly random order; the free-order model, where the gambler chooses the arrival order; and the i.i.d. model, where the distributions are all the same, rendering the arrival order irrelevant.
Most of the literature assumes that distributions are known to the gambler. Recent work has considered the question of what is achievable when the gambler has access only to a few samples per distribution. Surprisingly, in the fixed-order case, a single sample from each distribution is enough to approximate the optimal ratio, but this is not the case in any of the three variants.
We provide a unified proof that for all three variants of the problem, a constant number of samples (independent of n) for each distribution is good enough to approximate the optimal ratios.
Prior to our work, this was known to be the case only in the i.i.d. variant. We complement our result showing that our algorithms can be implemented in polynomial time.
A key ingredient in our proof is an existential result based on a minimax argument, which states that there must exist an algorithm that attains the optimal ratio and does not rely on the knowledge of the upper tail of the distributions. A second key ingredient is a refined sample-based version of a decomposition of the instance into "small" and "large" variables, first introduced by Liu et al. [EC'21].
△ Less
Submitted 15 November, 2023;
originally announced November 2023.
-
Blackwell's Approachability with Time-Dependent Outcome Functions and Dot Products. Application to the Big Match
Authors:
Joon Kwon,
Bruno Ziliotto
Abstract:
Blackwell's approachability is a very general sequential decision framework where a Decision Maker obtains vector-valued outcomes, and aims at the convergence of the average outcome to a given "target" set. Blackwell gave a sufficient condition for the decision maker having a strategy guaranteeing such a convergence against an adversarial environment, as well as what we now call the Blackwell's al…
▽ More
Blackwell's approachability is a very general sequential decision framework where a Decision Maker obtains vector-valued outcomes, and aims at the convergence of the average outcome to a given "target" set. Blackwell gave a sufficient condition for the decision maker having a strategy guaranteeing such a convergence against an adversarial environment, as well as what we now call the Blackwell's algorithm, which then ensures convergence. Blackwell's approachability has since been applied to numerous problems, in online learning and game theory, in particular. We extend this framework by allowing the outcome function and the dot product to be time-dependent. We establish a general guarantee for the natural extension to this framework of Blackwell's algorithm. In the case where the target set is an orthant, we present a family of time-dependent dot products which yields different convergence speeds for each coordinate of the average outcome. We apply this framework to the Big Match (one of the most important toy examples of stochastic games) where an $ε$-uniformly optimal strategy for Player I is given by Blackwell's algorithm in a well-chosen auxiliary approachability problem.
△ Less
Submitted 8 March, 2023;
originally announced March 2023.
-
Unknown I.I.D. Prophets: Better Bounds, Streaming Algorithms, and a New Impossibility
Authors:
José Correa,
Paul Dütting,
Felix Fischer,
Kevin Schewior,
Bruno Ziliotto
Abstract:
A prophet inequality states, for some $α\in[0,1]$, that the expected value achievable by a gambler who sequentially observes random variables $X_1,\dots,X_n$ and selects one of them is at least an $α$ fraction of the maximum value in the sequence. We obtain three distinct improvements for a setting that was first studied by Correa et al. (EC, 2019) and is particularly relevant to modern applicatio…
▽ More
A prophet inequality states, for some $α\in[0,1]$, that the expected value achievable by a gambler who sequentially observes random variables $X_1,\dots,X_n$ and selects one of them is at least an $α$ fraction of the maximum value in the sequence. We obtain three distinct improvements for a setting that was first studied by Correa et al. (EC, 2019) and is particularly relevant to modern applications in algorithmic pricing. In this setting, the random variables are i.i.d. from an unknown distribution and the gambler has access to an additional $βn$ samples for some $β\geq 0$. We first give improved lower bounds on $α$ for a wide range of values of $β$; specifically, $α\geq(1+β)/e$ when $β\leq 1/(e-1)$, which is tight, and $α\geq 0.648$ when $β=1$, which improves on a bound of around $0.635$ due to Correa et al. (SODA, 2020). Adding to their practical appeal, specifically in the context of algorithmic pricing, we then show that the new bounds can be obtained even in a streaming model of computation and thus in situations where the use of relevant data is complicated by the sheer amount of data available. We finally establish that the upper bound of $1/e$ for the case without samples is robust to additional information about the distribution, and applies also to sequences of i.i.d. random variables whose distribution is itself drawn, according to a known distribution, from a finite set of known candidate distributions. This implies a tight prophet inequality for exchangeable sequences of random variables, answering a question of Hill and Kertz (Contemporary Mathematics, 1992), but leaves open the possibility of better guarantees when the number of candidate distributions is small, a setting we believe is of strong interest to applications.
△ Less
Submitted 20 November, 2020; v1 submitted 12 July, 2020;
originally announced July 2020.
-
Finite-Memory Strategies in POMDPs with Long-Run Average Objectives
Authors:
Krishnendu Chatterjee,
Raimundo Saona,
Bruno Ziliotto
Abstract:
Partially observable Markov decision processes (POMDPs) are standard models for dynamic systems with probabilistic and nondeterministic behaviour in uncertain environments. We prove that in POMDPs with long-run average objective, the decision maker has approximately optimal strategies with finite memory. This implies notably that approximating the long-run value is recursively enumerable, as well…
▽ More
Partially observable Markov decision processes (POMDPs) are standard models for dynamic systems with probabilistic and nondeterministic behaviour in uncertain environments. We prove that in POMDPs with long-run average objective, the decision maker has approximately optimal strategies with finite memory. This implies notably that approximating the long-run value is recursively enumerable, as well as a weak continuity property of the value with respect to the transition function.
△ Less
Submitted 28 September, 2022; v1 submitted 30 April, 2019;
originally announced April 2019.
-
Prophet Secretary Through Blind Strategies
Authors:
Jose Correa,
Raimundo Saona,
Bruno Ziliotto
Abstract:
In the classic prophet inequality, samples from independent random variables arrive online. A gambler that knows the distributions must decide at each point in time whether to stop and pick the current sample or to continue and lose that sample forever. The goal of the gambler is to maximize the expected value of what she picks and the performance measure is the worst case ratio between the expect…
▽ More
In the classic prophet inequality, samples from independent random variables arrive online. A gambler that knows the distributions must decide at each point in time whether to stop and pick the current sample or to continue and lose that sample forever. The goal of the gambler is to maximize the expected value of what she picks and the performance measure is the worst case ratio between the expected value the gambler gets and what a prophet, that sees all the realizations in advance, gets. In the late seventies, Krengel and Sucheston, and Gairing (1977) established that this worst case ratio is a universal constant equal to 1/2. In the last decade prophet inequalities has resurged as an important problem due to its connections to posted price mechanisms, frequently used in online sales. A very interesting variant is the Prophet Secretary problem, in which the only difference is that the samples arrive in a uniformly random order. For this variant several algorithms achieve a constant of 1-1/e and very recently this barrier was slightly improved. This paper analyzes strategies that set a nonincreasing sequence of thresholds to be applied at different times. The gambler stops the first time a sample surpasses the corresponding threshold. Specifically we consider a class of strategies called blind quantile strategies. They consist in fixing a function which is used to define a sequence of thresholds once the instance is revealed. Our main result shows that they can achieve a constant of 0.665, improving upon the best known result of Azar et al. (2018), and on Beyhaghi et al. (2018) (order selection). Our proof analyzes precisely the underlying stop** time distribution, relying on Schur-convexity theory. We further prove that blind strategies cannot achieve better than 0.675. Finally we prove that no algorithm for the gambler can achieve better than 0.732.
△ Less
Submitted 12 March, 2019; v1 submitted 19 July, 2018;
originally announced July 2018.
-
Hidden Stochastic Games and Limit Equilibrium Payoffs
Authors:
Jérôme Renault,
Bruno Ziliotto
Abstract:
We consider 2-player stochastic games with perfectly observed actions, and study the limit, as the discount factor goes to one, of the equilibrium payoffs set. In the usual setup where current states are observed by the players, we show that the set of stationary equilibrium payoffs always converges, and provide a simple example where the set of equilibrium payoffs has no limit. We then introduce…
▽ More
We consider 2-player stochastic games with perfectly observed actions, and study the limit, as the discount factor goes to one, of the equilibrium payoffs set. In the usual setup where current states are observed by the players, we show that the set of stationary equilibrium payoffs always converges, and provide a simple example where the set of equilibrium payoffs has no limit. We then introduce the more general model of hidden stochastic game, where the players publicly receive imperfect signals over current states. In this setup we present an example where not only the limit set of equilibrium payoffs does not exist, but there is no converging selection of equilibrium payoffs. This second example is robust in many aspects, in particular to perturbations of the payoffs and to the introduction of correlation or communication devices.
△ Less
Submitted 10 December, 2014; v1 submitted 11 July, 2014;
originally announced July 2014.
-
Zero-sum repeated games: Counterexamples to the existence of the asymptotic value and the conjecture $\operatorname{maxmin}=\operatorname{lim}v_n$
Authors:
Bruno Ziliotto
Abstract:
Mertens [In Proceedings of the International Congress of Mathematicians (Berkeley, Calif., 1986) (1987) 1528-1577 Amer. Math. Soc.] proposed two general conjectures about repeated games: the first one is that, in any two-person zero-sum repeated game, the asymptotic value exists, and the second one is that, when Player 1 is more informed than Player 2, in the long run Player 1 is able to guarantee…
▽ More
Mertens [In Proceedings of the International Congress of Mathematicians (Berkeley, Calif., 1986) (1987) 1528-1577 Amer. Math. Soc.] proposed two general conjectures about repeated games: the first one is that, in any two-person zero-sum repeated game, the asymptotic value exists, and the second one is that, when Player 1 is more informed than Player 2, in the long run Player 1 is able to guarantee the asymptotic value. We disprove these two long-standing conjectures by providing an example of a zero-sum repeated game with public signals and perfect observation of the actions, where the value of the $λ$-discounted game does not converge when $λ$ goes to 0. The aforementioned example involves seven states, two actions and two signals for each player. Remarkably, players observe the payoffs, and play in turn.
△ Less
Submitted 15 March, 2016; v1 submitted 21 May, 2013;
originally announced May 2013.