-
Fast Last-Iterate Convergence of Learning in Games Requires Forgetful Algorithms
Authors:
Yang Cai,
Gabriele Farina,
Julien Grand-Clément,
Christian Kroer,
Chung-Wei Lee,
Haipeng Luo,
Weiqiang Zheng
Abstract:
Self-play via online learning is one of the premier ways to solve large-scale two-player zero-sum games, both in theory and practice. Particularly popular algorithms include optimistic multiplicative weights update (OMWU) and optimistic gradient-descent-ascent (OGDA). While both algorithms enjoy $O(1/T)$ ergodic convergence to Nash equilibrium in two-player zero-sum games, OMWU offers several adva…
▽ More
Self-play via online learning is one of the premier ways to solve large-scale two-player zero-sum games, both in theory and practice. Particularly popular algorithms include optimistic multiplicative weights update (OMWU) and optimistic gradient-descent-ascent (OGDA). While both algorithms enjoy $O(1/T)$ ergodic convergence to Nash equilibrium in two-player zero-sum games, OMWU offers several advantages including logarithmic dependence on the size of the payoff matrix and $\widetilde{O}(1/T)$ convergence to coarse correlated equilibria even in general-sum games. However, in terms of last-iterate convergence in two-player zero-sum games, an increasingly popular topic in this area, OGDA guarantees that the duality gap shrinks at a rate of $O(1/\sqrt{T})$, while the best existing last-iterate convergence for OMWU depends on some game-dependent constant that could be arbitrarily large. This begs the question: is this potentially slow last-iterate convergence an inherent disadvantage of OMWU, or is the current analysis too loose? Somewhat surprisingly, we show that the former is true. More generally, we prove that a broad class of algorithms that do not forget the past quickly all suffer the same issue: for any arbitrarily small $δ>0$, there exists a $2\times 2$ matrix game such that the algorithm admits a constant duality gap even after $1/δ$ rounds. This class of algorithms includes OMWU and other standard optimistic follow-the-regularized-leader algorithms.
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
Extensive-Form Game Solving via Blackwell Approachability on Treeplexes
Authors:
Darshan Chakrabarti,
Julien Grand-Clément,
Christian Kroer
Abstract:
In this paper, we introduce the first algorithmic framework for Blackwell approachability on the sequence-form polytope, the class of convex polytopes capturing the strategies of players in extensive-form games (EFGs). This leads to a new class of regret-minimization algorithms that are stepsize-invariant, in the same sense as the Regret Matching and Regret Matching$^+$ algorithms for the simplex.…
▽ More
In this paper, we introduce the first algorithmic framework for Blackwell approachability on the sequence-form polytope, the class of convex polytopes capturing the strategies of players in extensive-form games (EFGs). This leads to a new class of regret-minimization algorithms that are stepsize-invariant, in the same sense as the Regret Matching and Regret Matching$^+$ algorithms for the simplex. Our modular framework can be combined with any existing regret minimizer over cones to compute a Nash equilibrium in two-player zero-sum EFGs with perfect recall, through the self-play framework. Leveraging predictive online mirror descent, we introduce Predictive Treeplex Blackwell$^+$ (PTB$^+$), and show a $O(1/\sqrt{T})$ convergence rate to Nash equilibrium in self-play. We then show how to stabilize PTB$^+$ with a stepsize, resulting in an algorithm with a state-of-the-art $O(1/T)$ convergence rate. We provide an extensive set of experiments to compare our framework with several algorithmic benchmarks, including CFR$^+$ and its predictive variant, and we highlight interesting connections between practical performance and the stepsize-dependence or stepsize-invariance properties of classical algorithms.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
Beyond discounted returns: Robust Markov decision processes with average and Blackwell optimality
Authors:
Julien Grand-Clement,
Marek Petrik,
Nicolas Vieille
Abstract:
Robust Markov Decision Processes (RMDPs) are a widely used framework for sequential decision-making under parameter uncertainty. RMDPs have been extensively studied when the objective is to maximize the discounted return, but little is known for average optimality (optimizing the long-run average of the rewards obtained over time) and Blackwell optimality (remaining discount optimal for all discou…
▽ More
Robust Markov Decision Processes (RMDPs) are a widely used framework for sequential decision-making under parameter uncertainty. RMDPs have been extensively studied when the objective is to maximize the discounted return, but little is known for average optimality (optimizing the long-run average of the rewards obtained over time) and Blackwell optimality (remaining discount optimal for all discount factors sufficiently close to 1). In this paper, we prove several foundational results for RMDPs beyond the discounted return. We show that average optimal policies can be chosen stationary and deterministic for sa-rectangular RMDPs but, perhaps surprisingly, that history-dependent (Markovian) policies strictly outperform stationary policies for average optimality in s-rectangular RMDPs. We also study Blackwell optimality for sa-rectangular RMDPs, where we show that {\em approximate} Blackwell optimal policies always exist, although Blackwell optimal policies may not exist. We also provide a sufficient condition for their existence, which encompasses virtually any examples from the literature. We then discuss the connection between average and Blackwell optimality, and we describe several algorithms to compute the optimal average return. Interestingly, our approach leverages the connections between RMDPs and stochastic games.
△ Less
Submitted 7 March, 2024; v1 submitted 6 December, 2023;
originally announced December 2023.
-
Last-Iterate Convergence Properties of Regret-Matching Algorithms in Games
Authors:
Yang Cai,
Gabriele Farina,
Julien Grand-Clément,
Christian Kroer,
Chung-Wei Lee,
Haipeng Luo,
Weiqiang Zheng
Abstract:
Algorithms based on regret matching, specifically regret matching$^+$ (RM$^+$), and its variants are the most popular approaches for solving large-scale two-player zero-sum games in practice. Unlike algorithms such as optimistic gradient descent ascent, which have strong last-iterate and ergodic convergence properties for zero-sum games, virtually nothing is known about the last-iterate properties…
▽ More
Algorithms based on regret matching, specifically regret matching$^+$ (RM$^+$), and its variants are the most popular approaches for solving large-scale two-player zero-sum games in practice. Unlike algorithms such as optimistic gradient descent ascent, which have strong last-iterate and ergodic convergence properties for zero-sum games, virtually nothing is known about the last-iterate properties of regret-matching algorithms. Given the importance of last-iterate convergence for numerical optimization reasons and relevance as modeling real-word learning in games, in this paper, we study the last-iterate convergence properties of various popular variants of RM$^+$. First, we show numerically that several practical variants such as simultaneous RM$^+$, alternating RM$^+$, and simultaneous predictive RM$^+$, all lack last-iterate convergence guarantees even on a simple $3\times 3$ game. We then prove that recent variants of these algorithms based on a smoothing technique do enjoy last-iterate convergence: we prove that extragradient RM$^{+}$ and smooth Predictive RM$^+$ enjoy asymptotic last-iterate convergence (without a rate) and $1/\sqrt{t}$ best-iterate convergence. Finally, we introduce restarted variants of these algorithms, and show that they enjoy linear-rate last-iterate convergence.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
Regret Matching+: (In)Stability and Fast Convergence in Games
Authors:
Gabriele Farina,
Julien Grand-Clément,
Christian Kroer,
Chung-Wei Lee,
Haipeng Luo
Abstract:
Regret Matching+ (RM+) and its variants are important algorithms for solving large-scale games. However, a theoretical understanding of their success in practice is still a mystery. Moreover, recent advances on fast convergence in games are limited to no-regret algorithms such as online mirror descent, which satisfy stability. In this paper, we first give counterexamples showing that RM+ and its p…
▽ More
Regret Matching+ (RM+) and its variants are important algorithms for solving large-scale games. However, a theoretical understanding of their success in practice is still a mystery. Moreover, recent advances on fast convergence in games are limited to no-regret algorithms such as online mirror descent, which satisfy stability. In this paper, we first give counterexamples showing that RM+ and its predictive version can be unstable, which might cause other players to suffer large regret. We then provide two fixes: restarting and chop** off the positive orthant that RM+ works in. We show that these fixes are sufficient to get $O(T^{1/4})$ individual regret and $O(1)$ social regret in normal-form games via RM+ with predictions. We also apply our stabilizing techniques to clairvoyant updates in the uncoupled learning setting for RM+ and prove desirable results akin to recent works for Clairvoyant online mirror descent. Our experiments show the advantages of our algorithms over vanilla RM+-based algorithms in matrix and extensive-form games.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Reducing Blackwell and Average Optimality to Discounted MDPs via the Blackwell Discount Factor
Authors:
Julien Grand-Clément,
Marek Petrik
Abstract:
We introduce the Blackwell discount factor for Markov Decision Processes (MDPs). Classical objectives for MDPs include discounted, average, and Blackwell optimality. Many existing approaches to computing average-optimal policies solve for discounted optimal policies with a discount factor close to $1$, but they only work under strong or hard-to-verify assumptions such as ergodicity or weakly commu…
▽ More
We introduce the Blackwell discount factor for Markov Decision Processes (MDPs). Classical objectives for MDPs include discounted, average, and Blackwell optimality. Many existing approaches to computing average-optimal policies solve for discounted optimal policies with a discount factor close to $1$, but they only work under strong or hard-to-verify assumptions such as ergodicity or weakly communicating MDPs. In this paper, we show that when the discount factor is larger than the Blackwell discount factor $γ_{\mathrm{bw}}$, all discounted optimal policies become Blackwell- and average-optimal, and we derive a general upper bound on $γ_{\mathrm{bw}}$. The upper bound on $γ_{\mathrm{bw}}$ provides the first reduction from average and Blackwell optimality to discounted optimality, without any assumptions, and new polynomial-time algorithms for average- and Blackwell-optimal policies. Our work brings new ideas from the study of polynomials and algebraic numbers to the analysis of MDPs. Our results also apply to robust MDPs, enabling the first algorithms to compute robust Blackwell-optimal policies.
△ Less
Submitted 31 January, 2023;
originally announced February 2023.
-
On the convex formulations of robust Markov decision processes
Authors:
Julien Grand-Clément,
Marek Petrik
Abstract:
Robust Markov decision processes (MDPs) are used for applications of dynamic optimization in uncertain environments and have been studied extensively. Many of the main properties and algorithms of MDPs, such as value iteration and policy iteration, extend directly to RMDPs. Surprisingly, there is no known analog of the MDP convex optimization formulation for solving RMDPs. This work describes the…
▽ More
Robust Markov decision processes (MDPs) are used for applications of dynamic optimization in uncertain environments and have been studied extensively. Many of the main properties and algorithms of MDPs, such as value iteration and policy iteration, extend directly to RMDPs. Surprisingly, there is no known analog of the MDP convex optimization formulation for solving RMDPs. This work describes the first convex optimization formulation of RMDPs under the classical sa-rectangularity and s-rectangularity assumptions. By using entropic regularization and exponential change of variables, we derive a convex formulation with a number of variables and constraints polynomial in the number of states and actions, but with large coefficients in the constraints. We further simplify the formulation for RMDPs with polyhedral, ellipsoidal, or entropy-based uncertainty sets, showing that, in these cases, RMDPs can be reformulated as conic programs based on exponential cones, quadratic cones, and non-negative orthants. Our work opens a new research direction for RMDPs and can serve as a first step toward obtaining a tractable convex formulation of RMDPs.
△ Less
Submitted 13 December, 2023; v1 submitted 21 September, 2022;
originally announced September 2022.
-
The Best Decisions Are Not the Best Advice: Making Adherence-Aware Recommendations
Authors:
Julien Grand-Clément,
Jean Pauphilet
Abstract:
Many high-stake decisions follow an expert-in-loop structure in that a human operator receives recommendations from an algorithm but is the ultimate decision maker. Hence, the algorithm's recommendation may differ from the actual decision implemented in practice. However, most algorithmic recommendations are obtained by solving an optimization problem that assumes recommendations will be perfectly…
▽ More
Many high-stake decisions follow an expert-in-loop structure in that a human operator receives recommendations from an algorithm but is the ultimate decision maker. Hence, the algorithm's recommendation may differ from the actual decision implemented in practice. However, most algorithmic recommendations are obtained by solving an optimization problem that assumes recommendations will be perfectly implemented. We propose an adherence-aware optimization framework to capture the dichotomy between the recommended and the implemented policy and analyze the impact of partial adherence on the optimal recommendation. We show that overlooking the partial adherence phenomenon, as is currently being done by most recommendation engines, can lead to arbitrarily severe performance deterioration, compared with both the current human baseline performance and what is expected by the recommendation algorithm. Our framework also provides useful tools to analyze the structure and to compute optimal recommendation policies that are naturally immune against such human deviations, and are guaranteed to improve upon the baseline policy.
△ Less
Submitted 9 December, 2023; v1 submitted 5 September, 2022;
originally announced September 2022.
-
Solving optimization problems with Blackwell approachability
Authors:
Julien Grand-Clément,
Christian Kroer
Abstract:
We introduce the Conic Blackwell Algorithm$^+$ (CBA$^+$) regret minimizer, a new parameter- and scale-free regret minimizer for general convex sets. CBA$^+$ is based on Blackwell approachability and attains $O(\sqrt{T})$ regret. We show how to efficiently instantiate CBA$^+$ for many decision sets of interest, including the simplex, $\ell_{p}$ norm balls, and ellipsoidal confidence regions in the…
▽ More
We introduce the Conic Blackwell Algorithm$^+$ (CBA$^+$) regret minimizer, a new parameter- and scale-free regret minimizer for general convex sets. CBA$^+$ is based on Blackwell approachability and attains $O(\sqrt{T})$ regret. We show how to efficiently instantiate CBA$^+$ for many decision sets of interest, including the simplex, $\ell_{p}$ norm balls, and ellipsoidal confidence regions in the simplex. Based on CBA$^+$, we introduce SP-CBA$^+$, a new parameter-free algorithm for solving convex-concave saddle-point problems, which achieves a $O(1/\sqrt{T})$ ergodic rate of convergence. In our simulations, we demonstrate the wide applicability of SP-CBA$^+$ on several standard saddle-point problems, including matrix games, extensive-form games, distributionally robust logistic regression, and Markov decision processes. In each setting, SP-CBA$^+$ achieves state-of-the-art numerical performance, and outperforms classical methods, without the need for any choice of step sizes or other algorithmic parameters.
△ Less
Submitted 24 February, 2022;
originally announced February 2022.
-
Interpretable Machine Learning for Resource Allocation with Application to Ventilator Triage
Authors:
Julien Grand-Clément,
Carri Chan,
Vineet Goyal,
Elizabeth Chuang
Abstract:
Rationing of healthcare resources is a challenging decision that policy makers and providers may be forced to make during a pandemic, natural disaster, or mass casualty event. Well-defined guidelines to triage scarce life-saving resources must be designed to promote transparency, trust, and consistency. To facilitate buy-in and use during high-stress situations, these guidelines need to be interpr…
▽ More
Rationing of healthcare resources is a challenging decision that policy makers and providers may be forced to make during a pandemic, natural disaster, or mass casualty event. Well-defined guidelines to triage scarce life-saving resources must be designed to promote transparency, trust, and consistency. To facilitate buy-in and use during high-stress situations, these guidelines need to be interpretable and operational. We propose a novel data-driven model to compute interpretable triage guidelines based on policies for Markov Decision Process that can be represented as simple sequences of decision trees ("tree policies"). In particular, we characterize the properties of optimal tree policies and present an algorithm based on dynamic programming recursions to compute good tree policies. We utilize this methodology to obtain simple, novel triage guidelines for ventilator allocations for COVID-19 patients, based on real patient data from Montefiore hospitals. We also compare the performance of our guidelines to the official New York State guidelines that were developed in 2015 (well before the COVID-19 pandemic). Our empirical study shows that the number of excess deaths associated with ventilator shortages could be reduced significantly using our policy. Our work highlights the limitations of the existing official triage guidelines, which need to be adapted specifically to COVID-19 before being successfully deployed.
△ Less
Submitted 21 October, 2021;
originally announced October 2021.
-
Conic Blackwell Algorithm: Parameter-Free Convex-Concave Saddle-Point Solving
Authors:
Julien Grand-Clément,
Christian Kroer
Abstract:
We develop new parameter-free and scale-free algorithms for solving convex-concave saddle-point problems. Our results are based on a new simple regret minimizer, the Conic Blackwell Algorithm$^+$ (CBA$^+$), which attains $O(1/\sqrt{T})$ average regret. Intuitively, our approach generalizes to other decision sets of interest ideas from the Counterfactual Regret minimization (CFR$^+$) algorithm, whi…
▽ More
We develop new parameter-free and scale-free algorithms for solving convex-concave saddle-point problems. Our results are based on a new simple regret minimizer, the Conic Blackwell Algorithm$^+$ (CBA$^+$), which attains $O(1/\sqrt{T})$ average regret. Intuitively, our approach generalizes to other decision sets of interest ideas from the Counterfactual Regret minimization (CFR$^+$) algorithm, which has very strong practical performance for solving sequential games on simplexes. We show how to implement CBA$^+$ for the simplex, $\ell_{p}$ norm balls, and ellipsoidal confidence regions in the simplex, and we present numerical experiments for solving matrix games and distributionally robust optimization problems. Our empirical results show that CBA$^+$ is a simple algorithm that outperforms state-of-the-art methods on synthetic data and real data instances, without the need for any choice of step sizes or other algorithmic parameters.
△ Less
Submitted 14 October, 2021; v1 submitted 27 May, 2021;
originally announced May 2021.
-
From Convex Optimization to MDPs: A Review of First-Order, Second-Order and Quasi-Newton Methods for MDPs
Authors:
Julien Grand-Clément
Abstract:
In this paper we present a review of the connections between classical algorithms for solving Markov Decision Processes (MDPs) and classical gradient-based algorithms in convex optimization. Some of these connections date as far back as the 1980s, but they have gained momentum in recent years and have lead to faster algorithms for solving MDPs. In particular, two of the most popular methods for so…
▽ More
In this paper we present a review of the connections between classical algorithms for solving Markov Decision Processes (MDPs) and classical gradient-based algorithms in convex optimization. Some of these connections date as far back as the 1980s, but they have gained momentum in recent years and have lead to faster algorithms for solving MDPs. In particular, two of the most popular methods for solving MDPs, Value Iteration and Policy Iteration, can be linked to first-order and second-order methods in convex optimization. In addition, recent results in quasi-Newton methods lead to novel algorithms for MDPs, such as Anderson acceleration. By explicitly classifying algorithms for MDPs as first-order, second-order, and quasi-Newton methods, we hope to provide a better understanding of these algorithms, and, further expanding this analogy, to help to develop novel algorithms for MDPs, based on recent advances in convex optimization.
△ Less
Submitted 24 November, 2021; v1 submitted 19 April, 2021;
originally announced April 2021.
-
First-Order Methods for Wasserstein Distributionally Robust MDP
Authors:
Julien Grand-Clément,
Christian Kroer
Abstract:
Markov decision processes (MDPs) are known to be sensitive to parameter specification. Distributionally robust MDPs alleviate this issue by allowing for \emph{ambiguity sets} which give a set of possible distributions over parameter sets. The goal is to find an optimal policy with respect to the worst-case parameter distribution. We propose a framework for solving Distributionally robust MDPs via…
▽ More
Markov decision processes (MDPs) are known to be sensitive to parameter specification. Distributionally robust MDPs alleviate this issue by allowing for \emph{ambiguity sets} which give a set of possible distributions over parameter sets. The goal is to find an optimal policy with respect to the worst-case parameter distribution. We propose a framework for solving Distributionally robust MDPs via first-order methods, and instantiate it for several types of Wasserstein ambiguity sets. By develo** efficient proximal updates, our algorithms achieve a convergence rate of $O\left(NA^{2.5}S^{3.5}\log(S)\log(ε^{-1})ε^{-1.5} \right)$ for the number of kernels $N$ in the support of the nominal distribution, states $S$, and actions $A$; this rate varies slightly based on the Wasserstein setup. Our dependence on $N,A$ and $S$ is significantly better than existing methods, which have a complexity of $O\left(N^{3.5}A^{3.5}S^{4.5}\log^{2}(ε^{-1}) \right)$. Numerical experiments show that our algorithm is significantly more scalable than state-of-the-art approaches across several domains.
△ Less
Submitted 3 May, 2021; v1 submitted 14 September, 2020;
originally announced September 2020.
-
Scalable First-Order Methods for Robust MDPs
Authors:
Julien Grand-Clément,
Christian Kroer
Abstract:
Robust Markov Decision Processes (MDPs) are a powerful framework for modeling sequential decision-making problems with model uncertainty. This paper proposes the first first-order framework for solving robust MDPs. Our algorithm interleaves primal-dual first-order updates with approximate Value Iteration updates. By carefully controlling the tradeoff between the accuracy and cost of Value Iteratio…
▽ More
Robust Markov Decision Processes (MDPs) are a powerful framework for modeling sequential decision-making problems with model uncertainty. This paper proposes the first first-order framework for solving robust MDPs. Our algorithm interleaves primal-dual first-order updates with approximate Value Iteration updates. By carefully controlling the tradeoff between the accuracy and cost of Value Iteration updates, we achieve an ergodic convergence rate of $O \left( A^{2} S^{3}\log(S)\log(ε^{-1}) ε^{-1} \right)$ for the best choice of parameters on ellipsoidal and Kullback-Leibler $s$-rectangular uncertainty sets, where $S$ and $A$ is the number of states and actions, respectively. Our dependence on the number of states and actions is significantly better (by a factor of $O(A^{1.5}S^{1.5})$) than that of pure Value Iteration algorithms. In numerical experiments on ellipsoidal uncertainty sets we show that our algorithm is significantly more scalable than state-of-the-art approaches. Our framework is also the first one to solve robust MDPs with $s$-rectangular KL uncertainty sets.
△ Less
Submitted 14 January, 2021; v1 submitted 11 May, 2020;
originally announced May 2020.
-
Robust Policies For Proactive ICU Transfers
Authors:
Julien Grand-Clement,
Carri W. Chan,
Vineet Goyal,
Gabriel Escobar
Abstract:
Patients whose transfer to the Intensive Care Unit (ICU) is unplanned are prone to higher mortality rates than those who were admitted directly to the ICU. Recent advances in machine learning to predict patient deterioration have introduced the possibility of \emph{proactive transfer} from the ward to the ICU. In this work, we study the problem of finding \emph{robust} patient transfer policies wh…
▽ More
Patients whose transfer to the Intensive Care Unit (ICU) is unplanned are prone to higher mortality rates than those who were admitted directly to the ICU. Recent advances in machine learning to predict patient deterioration have introduced the possibility of \emph{proactive transfer} from the ward to the ICU. In this work, we study the problem of finding \emph{robust} patient transfer policies which account for uncertainty in statistical estimates due to data limitations when optimizing to improve overall patient care. We propose a Markov Decision Process model to capture the evolution of patient health, where the states represent a measure of patient severity. Under fairly general assumptions, we show that an optimal transfer policy has a threshold structure, i.e., that it transfers all patients above a certain severity level to the ICU (subject to available capacity). As model parameters are typically determined based on statistical estimations from real-world data, they are inherently subject to misspecification and estimation errors. We account for this parameter uncertainty by deriving a robust policy that optimizes the worst-case reward across all plausible values of the model parameters. We show that the robust policy also has a threshold structure under fairly general assumptions. Moreover, it is more aggressive in transferring patients than the optimal nominal policy, which does not take into account parameter uncertainty. We present computational experiments using a dataset of hospitalizations at 21 KNPC hospitals, and present empirical evidence of the sensitivity of various hospital metrics (mortality, length-of-stay, average ICU occupancy) to small changes in the parameters. Our work provides useful insights into the impact of parameter uncertainty on deriving simple policies for proactive ICU transfer that have strong empirical performance and theoretical guarantees.
△ Less
Submitted 22 January, 2021; v1 submitted 14 February, 2020;
originally announced February 2020.
-
A First-Order Approach To Accelerated Value Iteration
Authors:
Vineet Goyal,
Julien Grand-Clement
Abstract:
Markov decision processes (MDPs) are used to model stochastic systems in many applications. Several efficient algorithms to compute optimal policies have been studied in the literature, including value iteration (VI) and policy iteration. However, these do not scale well especially when the discount factor for the infinite horizon discounted reward, $λ$, gets close to $1$. In particular, the runni…
▽ More
Markov decision processes (MDPs) are used to model stochastic systems in many applications. Several efficient algorithms to compute optimal policies have been studied in the literature, including value iteration (VI) and policy iteration. However, these do not scale well especially when the discount factor for the infinite horizon discounted reward, $λ$, gets close to $1$. In particular, the running time scales as $O \left( 1/(1-λ) \right)$ for these algorithms. In this paper, our goal is to design new algorithms that scale better than previous approaches when $λ$ approaches $1$. Our main contribution is to present a connection between VI and gradient descent and adapt the ideas of acceleration and momentum in convex optimization to design faster algorithms for MDPs. We prove theoretical guarantees of a faster convergence of our algorithms for the computation of the value function of a policy, where the running times of our algorithms scale as $O \left( 1/\sqrt{1-λ} \right)$ for reversible MDP instances. The improvement is quite analogous to Nesterov's acceleration and momentum in convex optimization. We also provide a lower bound on the convergence properties of any first-order algorithm for solving MDPs, presenting a family of MDPs instances for which no algorithm can converge faster than VI when the number of iterations is smaller than the number of states. We introduce a Safe Accelerated Value Iteration (S-AVI), which alternates between accelerated updates and value iteration updates. Our algorithm S-AVI is worst-case optimal and retains the theoretical convergence properties of VI while exhibiting strong empirical performances, providing significant speedups compared to classical approaches (up to one order of magnitude in many cases) for a large test bed of MDP instances.
△ Less
Submitted 27 August, 2021; v1 submitted 23 May, 2019;
originally announced May 2019.
-
The operator approach to entropy games
Authors:
Marianne Akian,
Stéphane Gaubert,
Julien Grand-Clément,
Jérémie Guillaud
Abstract:
Entropy games and matrix multiplication games have been recently introduced by Asarin et al. They model the situation in which one player (Despot) wishes to minimize the growth rate of a matrix product, whereas the other player (Tribune) wishes to maximize it. We develop an operator approach to entropy games. This allows us to show that entropy games can be cast as stochastic mean payoff games in…
▽ More
Entropy games and matrix multiplication games have been recently introduced by Asarin et al. They model the situation in which one player (Despot) wishes to minimize the growth rate of a matrix product, whereas the other player (Tribune) wishes to maximize it. We develop an operator approach to entropy games. This allows us to show that entropy games can be cast as stochastic mean payoff games in which some action spaces are simplices and payments are given by a relative entropy (Kullback-Leibler divergence). In this way, we show that entropy games with a fixed number of states belonging to Despot can be solved in polynomial time. This approach also allows us to solve these games by a policy iteration algorithm, which we compare with the spectral simplex algorithm developed by Protasov.
△ Less
Submitted 10 April, 2019;
originally announced April 2019.
-
Robust Markov Decision Process: Beyond Rectangularity
Authors:
Vineet Goyal,
Julien Grand-Clément
Abstract:
We consider a robust approach to address uncertainty in model parameters in Markov Decision Processes (MDPs), which are widely used to model dynamic optimization in many applications. Most prior works consider the case where the uncertainty on transitions related to different states is uncoupled and the adversary is allowed to select the worst possible realization for each state unrelated to other…
▽ More
We consider a robust approach to address uncertainty in model parameters in Markov Decision Processes (MDPs), which are widely used to model dynamic optimization in many applications. Most prior works consider the case where the uncertainty on transitions related to different states is uncoupled and the adversary is allowed to select the worst possible realization for each state unrelated to others, potentially leading to highly conservative solutions. On the other hand, the case of general uncertainty sets is known to be intractable. We consider a factor model for probability transitions where the transition probability is a linear function of a factor matrix that is uncertain and belongs to a factor matrix uncertainty set. This is a fairly general model of uncertainty in probability transitions, allowing the decision maker to model dependence between probability transitions across different states and it is significantly less conservative than prior approaches. We show that under an underlying rectangularity assumption, we can efficiently compute an optimal robust policy under the factor matrix uncertainty model. Furthermore, we show that there is an optimal robust policy that is deterministic, which is of interest from an interpretability standpoint. We also introduce the robust counterpart of important structural results of classical MDPs, including the maximum principle and Blackwell optimality, and we provide a computational study to demonstrate the effectiveness of our approach in mitigating the conservativeness of robust policies.
△ Less
Submitted 1 September, 2021; v1 submitted 1 November, 2018;
originally announced November 2018.