-
Achieving Tractable Minimax Optimal Regret in Average Reward MDPs
Authors:
Victor Boone,
Zihan Zhang
Abstract:
In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first tractable algorithm with minimax optimal regret of $\widetilde{\mathrm{O}}(\sqrt{\mathrm{sp}(h^*) S A T})$, where…
▽ More
In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first tractable algorithm with minimax optimal regret of $\widetilde{\mathrm{O}}(\sqrt{\mathrm{sp}(h^*) S A T})$, where $\mathrm{sp}(h^*)$ is the span of the optimal bias function $h^*$, $S \times A$ is the size of the state-action space and $T$ the number of learning steps. Remarkably, our algorithm does not require prior information on $\mathrm{sp}(h^*)$. Our algorithm relies on a novel subroutine, Projected Mitigated Extended Value Iteration (PMEVI), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to improve regret bounds.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
The Sliding Regret in Stochastic Bandits: Discriminating Index and Randomized Policies
Authors:
Victor Boone
Abstract:
This paper studies the one-shot behavior of no-regret algorithms for stochastic bandits. Although many algorithms are known to be asymptotically optimal with respect to the expected regret, over a single run, their pseudo-regret seems to follow one of two tendencies: it is either smooth or bumpy. To measure this tendency, we introduce a new notion: the sliding regret, that measures the worst pseud…
▽ More
This paper studies the one-shot behavior of no-regret algorithms for stochastic bandits. Although many algorithms are known to be asymptotically optimal with respect to the expected regret, over a single run, their pseudo-regret seems to follow one of two tendencies: it is either smooth or bumpy. To measure this tendency, we introduce a new notion: the sliding regret, that measures the worst pseudo-regret over a time-window of fixed length sliding to infinity. We show that randomized methods (e.g. Thompson Sampling and MED) have optimal sliding regret, while index policies, although possibly asymptotically optimal for the expected regret, have the worst possible sliding regret under regularity conditions on their index (e.g. UCB, UCB-V, KL-UCB, MOSS, IMED etc.). We further analyze the average bumpiness of the pseudo-regret of index policies via the regret of exploration, that we show to be suboptimal as well.
△ Less
Submitted 30 November, 2023;
originally announced November 2023.
-
When do discounted-optimal policies also optimize the gain?
Authors:
Victor Boone
Abstract:
In this technical note, we establish an upper-bound on the threshold on the discount factor starting from which all discounted-optimal deterministic policies are gain-optimal, that we prove to be tight on an example. To address computability issues of that theoretical threshold, we provide a weaker bound which is tractable on ergodic MDPs in polynomial time.
In this technical note, we establish an upper-bound on the threshold on the discount factor starting from which all discounted-optimal deterministic policies are gain-optimal, that we prove to be tight on an example. To address computability issues of that theoretical threshold, we provide a weaker bound which is tractable on ergodic MDPs in polynomial time.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Towards model predictive control of supercritical CO2 cycles
Authors:
Viv Bone,
Michael Kearney,
Ingo Jahn
Abstract:
Control of non-condensing non-ideal-gas power cycles is challenging because their output power dynamics depend on complex system interactions, non-ideal-gas effects complicate turbomachinery behavior, and state constraints must be respected. This article presents a control methodology for these systems, comprising a control modeling approach and model predictive control (MPC) strategy. This method…
▽ More
Control of non-condensing non-ideal-gas power cycles is challenging because their output power dynamics depend on complex system interactions, non-ideal-gas effects complicate turbomachinery behavior, and state constraints must be respected. This article presents a control methodology for these systems, comprising a control modeling approach and model predictive control (MPC) strategy. This methodology is demonstrated on the high-pressure side of a simple supercritical CO2 cycle power block, composed of a variable-speed compressor, heat exchanger, and fixed-speed turbine. The control model is developed by applying timescale-separation arguments to a high-fidelity simulation model and locally linearizing non-ideal-gas turbomachinery performance maps. MPC is implemented by linearizing the control model online at each sampling instant. Closed-loop simulations with a full-order gas-dynamics truth model demonstrate the effectiveness of the proposed control methodology. In response to load changes, the controller maintains high turbine inlet temperatures while achieving net power output ramp rates in excess of 100% of nameplate output per minute. The controller often acts at the intersection of motor torque, compressor surge, and turbine inlet temperature constraints, and performs well from 35 to 105% of nameplate capacity with no parameter scheduling. Good performance and fast update rates are obtained via online linearization. The results demonstrate the suitability of MPC for the supercritical CO2 cycle, and the proposed methodology is extensible to more complex cycle variants such as the recuperated and recompression cycle.
△ Less
Submitted 27 August, 2021;
originally announced August 2021.