-
Tree Search-Based Policy Optimization under Stochastic Execution Delay
Authors:
David Valensi,
Esther Derman,
Shie Mannor,
Gal Dalal
Abstract:
The standard formulation of Markov decision processes (MDPs) assumes that the agent's decisions are executed immediately. However, in numerous realistic applications such as robotics or healthcare, actions are performed with a delay whose value can even be stochastic. In this work, we introduce stochastic delayed execution MDPs, a new formalism addressing random delays without resorting to state a…
▽ More
The standard formulation of Markov decision processes (MDPs) assumes that the agent's decisions are executed immediately. However, in numerous realistic applications such as robotics or healthcare, actions are performed with a delay whose value can even be stochastic. In this work, we introduce stochastic delayed execution MDPs, a new formalism addressing random delays without resorting to state augmentation. We show that given observed delay values, it is sufficient to perform a policy search in the class of Markov policies in order to reach optimal performance, thus extending the deterministic fixed delay case. Armed with this insight, we devise DEZ, a model-based algorithm that optimizes over the class of Markov policies. DEZ leverages Monte-Carlo tree search similar to its non-delayed variant EfficientZero to accurately infer future states from the action queue. Thus, it handles delayed execution while preserving the sample efficiency of EfficientZero. Through a series of experiments on the Atari suite, we demonstrate that although the previous baseline outperforms the naive method in scenarios with constant delay, it underperforms in the face of stochastic delays. In contrast, our approach significantly outperforms the baselines, for both constant and stochastic delays. The code is available at http://github.com/davidva1/Delayed-EZ .
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization
Authors:
Uri Gadot,
Esther Derman,
Navdeep Kumar,
Maxence Mohamed Elfatihi,
Kfir Levy,
Shie Mannor
Abstract:
In robust Markov decision processes (RMDPs), it is assumed that the reward and the transition dynamics lie in a given uncertainty set. By targeting maximal return under the most adversarial model from that set, RMDPs address performance sensitivity to misspecified environments. Yet, to preserve computational tractability, the uncertainty set is traditionally independently structured for each state…
▽ More
In robust Markov decision processes (RMDPs), it is assumed that the reward and the transition dynamics lie in a given uncertainty set. By targeting maximal return under the most adversarial model from that set, RMDPs address performance sensitivity to misspecified environments. Yet, to preserve computational tractability, the uncertainty set is traditionally independently structured for each state. This so-called rectangularity condition is solely motivated by computational concerns. As a result, it lacks a practical incentive and may lead to overly conservative behavior. In this work, we study coupled reward RMDPs where the transition kernel is fixed, but the reward function lies within an $α$-radius from a nominal one. We draw a direct connection between this type of non-rectangular reward-RMDPs and applying policy visitation frequency regularization. We introduce a policy-gradient method and prove its convergence. Numerical experiments illustrate the learned policy's robustness and its less conservative behavior when compared to rectangular uncertainty.
△ Less
Submitted 12 February, 2024; v1 submitted 3 September, 2023;
originally announced September 2023.
-
Twice Regularized Markov Decision Processes: The Equivalence between Robustness and Regularization
Authors:
Esther Derman,
Yevgeniy Men,
Matthieu Geist,
Shie Mannor
Abstract:
Robust Markov decision processes (MDPs) aim to handle changing or partially known system dynamics. To solve them, one typically resorts to robust optimization methods. However, this significantly increases computational complexity and limits scalability in both learning and planning. On the other hand, regularized MDPs show more stability in policy learning without impairing time complexity. Yet,…
▽ More
Robust Markov decision processes (MDPs) aim to handle changing or partially known system dynamics. To solve them, one typically resorts to robust optimization methods. However, this significantly increases computational complexity and limits scalability in both learning and planning. On the other hand, regularized MDPs show more stability in policy learning without impairing time complexity. Yet, they generally do not encompass uncertainty in the model dynamics. In this work, we aim to learn robust MDPs using regularization. We first show that regularized MDPs are a particular instance of robust MDPs with uncertain reward. We thus establish that policy iteration on reward-robust MDPs can have the same time complexity as on regularized MDPs. We further extend this relationship to MDPs with uncertain transitions: this leads to a regularization term with an additional dependence on the value function. We then generalize regularized MDPs to twice regularized MDPs ($\text{R}^2$ MDPs), i.e., MDPs with $\textit{both}$ value and policy regularization. The corresponding Bellman operators enable us to derive planning and learning schemes with convergence and generalization guarantees, thus reducing robustness to regularization. We numerically show this two-fold advantage on tabular and physical domains, highlighting the fact that $\text{R}^2$ preserves its efficacy in continuous environments.
△ Less
Submitted 12 March, 2023;
originally announced March 2023.
-
Policy Gradient for Rectangular Robust Markov Decision Processes
Authors:
Navdeep Kumar,
Esther Derman,
Matthieu Geist,
Kfir Levy,
Shie Mannor
Abstract:
Policy gradient methods have become a standard for training reinforcement learning agents in a scalable and efficient manner. However, they do not account for transition uncertainty, whereas learning robust policies can be computationally expensive. In this paper, we introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (…
▽ More
Policy gradient methods have become a standard for training reinforcement learning agents in a scalable and efficient manner. However, they do not account for transition uncertainty, whereas learning robust policies can be computationally expensive. In this paper, we introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (MDPs). We provide a closed-form expression for the worst occupation measure. Incidentally, we find that the worst kernel is a rank-one perturbation of the nominal. Combining the worst occupation measure with a robust Q-value estimation yields an explicit form of the robust gradient. Our resulting RPG can be estimated from data with the same time complexity as its non-robust equivalent. Hence, it relieves the computational burden of convex optimization problems required for training robust policies by current policy gradient approaches.
△ Less
Submitted 10 December, 2023; v1 submitted 31 January, 2023;
originally announced January 2023.
-
Twice regularized MDPs and the equivalence between robustness and regularization
Authors:
Esther Derman,
Matthieu Geist,
Shie Mannor
Abstract:
Robust Markov decision processes (MDPs) aim to handle changing or partially known system dynamics. To solve them, one typically resorts to robust optimization methods. However, this significantly increases computational complexity and limits scalability in both learning and planning. On the other hand, regularized MDPs show more stability in policy learning without impairing time complexity. Yet,…
▽ More
Robust Markov decision processes (MDPs) aim to handle changing or partially known system dynamics. To solve them, one typically resorts to robust optimization methods. However, this significantly increases computational complexity and limits scalability in both learning and planning. On the other hand, regularized MDPs show more stability in policy learning without impairing time complexity. Yet, they generally do not encompass uncertainty in the model dynamics. In this work, we aim to learn robust MDPs using regularization. We first show that regularized MDPs are a particular instance of robust MDPs with uncertain reward. We thus establish that policy iteration on reward-robust MDPs can have the same time complexity as on regularized MDPs. We further extend this relationship to MDPs with uncertain transitions: this leads to a regularization term with an additional dependence on the value function. We finally generalize regularized MDPs to twice regularized MDPs (R${}^2$ MDPs), i.e., MDPs with $\textit{both}$ value and policy regularization. The corresponding Bellman operators enable develo** policy iteration schemes with convergence and robustness guarantees. It also reduces planning and learning in robust MDPs to regularized MDPs.
△ Less
Submitted 12 October, 2021;
originally announced October 2021.
-
Dataset Bias Mitigation Through Analysis of CNN Training Scores
Authors:
Ekberjan Derman
Abstract:
Training datasets are crucial for convolutional neural network-based algorithms, which directly impact their overall performance. As such, using a well-structured dataset that has minimum level of bias is always desirable. In this paper, we proposed a novel, domain-independent approach, called score-based resampling (SBR), to locate the under-represented samples of the original training dataset ba…
▽ More
Training datasets are crucial for convolutional neural network-based algorithms, which directly impact their overall performance. As such, using a well-structured dataset that has minimum level of bias is always desirable. In this paper, we proposed a novel, domain-independent approach, called score-based resampling (SBR), to locate the under-represented samples of the original training dataset based on the model prediction scores obtained with that training set. In our method, once trained, we use the same CNN model to infer on its own training samples, obtain prediction scores, and based on the distance between predicted and ground-truth, we identify samples that are far away from their ground-truth and augment them in the original training set. The temperature term of the Sigmoid function is decreased to better differentiate scores. For experimental evaluation, we selected one Kaggle dataset for gender classification. We first used a CNN-based classifier with relatively standard structure, trained on the training images, and evaluated on the provided validation samples of the original dataset. Then, we assessed it on a totally new test dataset consisting of light male, light female, dark male, and dark female groups. The obtained accuracies varied, revealing the existence of categorical bias against certain groups in the original dataset. Subsequently, we trained the model after resampling based on our proposed approach. We compared our method with a previously proposed variational autoencoder (VAE) based algorithm. The obtained results confirmed the validity of our proposed method regrading identifying under-represented samples among original dataset to decrease categorical bias of classifying certain groups. Although tested for gender classification, the proposed algorithm can be used for investigating dataset structure of any CNN-based tasks.
△ Less
Submitted 28 June, 2021;
originally announced June 2021.
-
Acting in Delayed Environments with Non-Stationary Markov Policies
Authors:
Esther Derman,
Gal Dalal,
Shie Mannor
Abstract:
The standard Markov Decision Process (MDP) formulation hinges on the assumption that an action is executed immediately after it was chosen. However, assuming it is often unrealistic and can lead to catastrophic failures in applications such as robotic manipulation, cloud computing, and finance. We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that…
▽ More
The standard Markov Decision Process (MDP) formulation hinges on the assumption that an action is executed immediately after it was chosen. However, assuming it is often unrealistic and can lead to catastrophic failures in applications such as robotic manipulation, cloud computing, and finance. We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of $m$ steps. The brute-force state augmentation baseline where the state is concatenated to the last $m$ committed actions suffers from an exponential complexity in $m$, as we show for policy iteration. We then prove that with execution delay, deterministic Markov policies in the original state-space are sufficient for attaining maximal reward, but need to be non-stationary. As for stationary Markov policies, we show they are sub-optimal in general. Consequently, we devise a non-stationary Q-learning style model-based algorithm that solves delayed execution tasks without resorting to state-augmentation. Experiments on tabular, physical, and Atari domains reveal that it converges quickly to high performance even for substantial delays, while standard approaches that either ignore the delay or rely on state-augmentation struggle or fail due to divergence. The code is available at github.com/galdl/rl_delay_basic and github.com/galdl/rl_delay_atari.
△ Less
Submitted 12 December, 2023; v1 submitted 28 January, 2021;
originally announced January 2021.
-
Distributional Robustness and Regularization in Reinforcement Learning
Authors:
Esther Derman,
Shie Mannor
Abstract:
Distributionally Robust Optimization (DRO) has enabled to prove the equivalence between robustness and regularization in classification and regression, thus providing an analytical reason why regularization generalizes well in statistical learning. Although DRO's extension to sequential decision-making overcomes $\textit{external uncertainty}$ through the robust Markov Decision Process (MDP) setti…
▽ More
Distributionally Robust Optimization (DRO) has enabled to prove the equivalence between robustness and regularization in classification and regression, thus providing an analytical reason why regularization generalizes well in statistical learning. Although DRO's extension to sequential decision-making overcomes $\textit{external uncertainty}$ through the robust Markov Decision Process (MDP) setting, the resulting formulation is hard to solve, especially on large domains. On the other hand, existing regularization methods in reinforcement learning only address $\textit{internal uncertainty}$ due to stochasticity. Our study aims to facilitate robust reinforcement learning by establishing a dual relation between robust MDPs and regularization. We introduce Wasserstein distributionally robust MDPs and prove that they hold out-of-sample performance guarantees. Then, we introduce a new regularizer for empirical value functions and show that it lower bounds the Wasserstein distributionally robust value function. We extend the result to linear value function approximation for large state spaces. Our approach provides an alternative formulation of robustness with guaranteed finite-sample performance. Moreover, it suggests using regularization as a practical tool for dealing with $\textit{external uncertainty}$ in reinforcement learning methods.
△ Less
Submitted 14 July, 2020; v1 submitted 5 March, 2020;
originally announced March 2020.
-
A Bayesian Approach to Robust Reinforcement Learning
Authors:
Esther Derman,
Daniel Mankowitz,
Timothy Mann,
Shie Mannor
Abstract:
Robust Markov Decision Processes (RMDPs) intend to ensure robustness with respect to changing or adversarial system behavior. In this framework, transitions are modeled as arbitrary elements of a known and properly structured uncertainty set and a robust optimal policy can be derived under the worst-case scenario. In this study, we address the issue of learning in RMDPs using a Bayesian approach.…
▽ More
Robust Markov Decision Processes (RMDPs) intend to ensure robustness with respect to changing or adversarial system behavior. In this framework, transitions are modeled as arbitrary elements of a known and properly structured uncertainty set and a robust optimal policy can be derived under the worst-case scenario. In this study, we address the issue of learning in RMDPs using a Bayesian approach. We introduce the Uncertainty Robust Bellman Equation (URBE) which encourages safe exploration for adapting the uncertainty set to new observations while preserving robustness. We propose a URBE-based algorithm, DQN-URBE, that scales this method to higher dimensional domains. Our experiments show that the derived URBE-based strategy leads to a better trade-off between less conservative solutions and robustness in the presence of model misspecification. In addition, we show that the DQN-URBE algorithm can adapt significantly faster to changing dynamics online compared to existing robust techniques with fixed uncertainty sets.
△ Less
Submitted 23 July, 2019; v1 submitted 20 May, 2019;
originally announced May 2019.
-
Soft-Robust Actor-Critic Policy-Gradient
Authors:
Esther Derman,
Daniel J. Mankowitz,
Timothy A. Mann,
Shie Mannor
Abstract:
Robust Reinforcement Learning aims to derive optimal behavior that accounts for model uncertainty in dynamical systems. However, previous studies have shown that by considering the worst case scenario, robust policies can be overly conservative. Our soft-robust framework is an attempt to overcome this issue. In this paper, we present a novel Soft-Robust Actor-Critic algorithm (SR-AC). It learns an…
▽ More
Robust Reinforcement Learning aims to derive optimal behavior that accounts for model uncertainty in dynamical systems. However, previous studies have shown that by considering the worst case scenario, robust policies can be overly conservative. Our soft-robust framework is an attempt to overcome this issue. In this paper, we present a novel Soft-Robust Actor-Critic algorithm (SR-AC). It learns an optimal policy with respect to a distribution over an uncertainty set and stays robust to model uncertainty but avoids the conservativeness of robust strategies. We show the convergence of SR-AC and test the efficiency of our approach on different domains by comparing it against regular learning methods and their robust formulations.
△ Less
Submitted 24 October, 2018; v1 submitted 11 March, 2018;
originally announced March 2018.
-
Clustering and Model Selection via Penalized Likelihood for Different-sized Categorical Data Vectors
Authors:
Esther Derman,
Erwan Le Pennec
Abstract:
In this study, we consider unsupervised clustering of categorical vectors that can be of different size using mixture. We use likelihood maximization to estimate the parameters of the underlying mixture model and a penalization technique to select the number of mixture components. Regardless of the true distribution that generated the data, we show that an explicit penalty, known up to a multiplic…
▽ More
In this study, we consider unsupervised clustering of categorical vectors that can be of different size using mixture. We use likelihood maximization to estimate the parameters of the underlying mixture model and a penalization technique to select the number of mixture components. Regardless of the true distribution that generated the data, we show that an explicit penalty, known up to a multiplicative constant, leads to a non-asymptotic oracle inequality with the Kullback-Leibler divergence on the two sides of the inequality. This theoretical result is illustrated by a document clustering application. To this aim a novel robust expectation-maximization algorithm is proposed to estimate the mixture parameters that best represent the different topics. Slope heuristics are used to calibrate the penalty and to select a number of clusters.
△ Less
Submitted 7 September, 2017;
originally announced September 2017.
-
New transit observations for HAT-P-30 b, HAT-P-37 b, TrES-5 b, WASP-28 b, WASP-36 b, and WASP-39 b
Authors:
G. Maciejewski,
D. Dimitrov,
L. Mancini,
J. Southworth,
S. Ciceri,
G. D'Ago,
I. Bruni,
St. Raetz,
G. Nowak,
J. Ohlert,
D. Puchalski,
G. Saral,
E. Derman,
R. Petrucci,
E. Jofre,
M. Seeliger,
T. Henning
Abstract:
We present new transit light curves for planets in six extrasolar planetary systems. They were acquired with 0.4-2.2 m telescopes located in west Asia, Europe, and South America. When combined with literature data, they allowed us to redetermine system parameters in a homogeneous way. Our results for individual systems are in agreement with values reported in previous studies. We refined transit e…
▽ More
We present new transit light curves for planets in six extrasolar planetary systems. They were acquired with 0.4-2.2 m telescopes located in west Asia, Europe, and South America. When combined with literature data, they allowed us to redetermine system parameters in a homogeneous way. Our results for individual systems are in agreement with values reported in previous studies. We refined transit ephemerides and reduced uncertainties of orbital periods by a factor between 2 and 7. No sign of any variations in transit times was detected for the planets studied.
△ Less
Submitted 10 March, 2016;
originally announced March 2016.
-
Transit Timing Analysis in the HAT-P-32 system
Authors:
M. Seeliger,
D. Dimitrov,
D. Kjurkchieva,
M. Mallonn,
M. Fernandez,
M. Kitze,
V. Casanova,
G. Maciejewski,
J. M. Ohlert,
J. G. Schmidt,
A. Pannicke,
D. Puchalski,
E. Göğüş,
T. Güver,
S. Bilir,
T. Ak,
M. M. Hohle,
T. O. B. Schmidt,
R. Errmann,
E. Jensen,
D. Cohen,
L. Marschall,
G. Saral,
I. Bernt,
E. Derman
, et al. (2 additional authors not shown)
Abstract:
We present the results of 45 transit observations obtained for the transiting exoplanet HAT-P-32b. The transits have been observed using several telescopes mainly throughout the YETI network. In 25 cases, complete transit light curves with a timing precision better than $1.4\:$min have been obtained. These light curves have been used to refine the system properties, namely inclination $i$, planet-…
▽ More
We present the results of 45 transit observations obtained for the transiting exoplanet HAT-P-32b. The transits have been observed using several telescopes mainly throughout the YETI network. In 25 cases, complete transit light curves with a timing precision better than $1.4\:$min have been obtained. These light curves have been used to refine the system properties, namely inclination $i$, planet-to-star radius ratio $R_\textrm{p}/R_\textrm{s}$, and the ratio between the semimajor axis and the stellar radius $a/R_\textrm{s}$. First analyses by Hartman et al. (2011) suggest the existence of a second planet in the system, thus we tried to find an additional body using the transit timing variation (TTV) technique. Taking also literature data points into account, we can explain all mid-transit times by refining the linear ephemeris by 21ms. Thus we can exclude TTV amplitudes of more than $\sim1.5$min.
△ Less
Submitted 25 April, 2014;
originally announced April 2014.
-
The Perception of Time, Risk and Return During Periods of Speculation
Authors:
Emanuel Derman
Abstract:
What return should you expect when you take on a given amount of risk? How should that return depend upon other people's behavior? What principles can you use to answer these questions? In this paper, we approach these topics by exploring the consequences of two simple hypotheses about risk.
The first is a common-sense invariance principle: assets with the same perceived risk must have the sam…
▽ More
What return should you expect when you take on a given amount of risk? How should that return depend upon other people's behavior? What principles can you use to answer these questions? In this paper, we approach these topics by exploring the consequences of two simple hypotheses about risk.
The first is a common-sense invariance principle: assets with the same perceived risk must have the same expected return. The second hypothesis concerns the perception of time. We conjecture that in times of speculative excitement, short-term investors may instinctively imagine stock prices to be evolving in a time measure different from that of calendar time. They may instead perceive and experience the risk and return of a stock in intrinsic time, a dimensionless time scale that counts the number of trading opportunities that occur.
The most noteworthy result is that, in the short-term, a stock's trading frequency affects its expected return. We show that short-term stock speculators will expect returns proportional to the temperature of a stock, where temperature is defined as the product of the stock's traditional volatility and the square root of its trading frequency. We hope that this model will have some relevance to the behavior of investors expecting inordinate returns in highly speculative markets.
△ Less
Submitted 18 January, 2002;
originally announced January 2002.