-
Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm
Authors:
Lin Chen,
Bruno Scherrer,
Peter L. Bartlett
Abstract:
In this paper, we investigate the sample complexity of policy evaluation in infinite-horizon offline reinforcement learning (also known as the off-policy evaluation problem) with linear function approximation. We identify a hard regime $dγ^{2}>1$, where $d$ is the dimension of the feature vector and $γ$ is the discount rate. In this regime, for any $q\in[γ^{2},1]$, we can construct a hard instance…
▽ More
In this paper, we investigate the sample complexity of policy evaluation in infinite-horizon offline reinforcement learning (also known as the off-policy evaluation problem) with linear function approximation. We identify a hard regime $dγ^{2}>1$, where $d$ is the dimension of the feature vector and $γ$ is the discount rate. In this regime, for any $q\in[γ^{2},1]$, we can construct a hard instance such that the smallest eigenvalue of its feature covariance matrix is $q/d$ and it requires $Ω\left(\frac{d}{γ^{2}\left(q-γ^{2}\right)\varepsilon^{2}}\exp\left(Θ\left(dγ^{2}\right)\right)\right)$ samples to approximate the value function up to an additive error $\varepsilon$. Note that the lower bound of the sample complexity is exponential in $d$. If $q=γ^{2}$, even infinite data cannot suffice. Under the low distribution shift assumption, we show that there is an algorithm that needs at most $O\left(\max\left\{ \frac{\left\Vert θ^π\right\Vert _{2}^{4}}{\varepsilon^{4}}\log\frac{d}δ,\frac{1}{\varepsilon^{2}}\left(d+\log\frac{1}δ\right)\right\} \right)$ samples ($θ^π$ is the parameter of the policy in linear function approximation) and guarantees approximation to the value function up to an additive error of $\varepsilon$ with probability at least $1-δ$.
△ Less
Submitted 17 March, 2021;
originally announced March 2021.
-
Film flip and transfer process to enhance light harvesting in ultrathin absorber films on specular back-reflectors
Authors:
Asaf Kay,
Barbara Scherrer,
Yifat Piekner,
Kirtiman Deo Malviya,
Daniel A Grave,
Hen Dotan,
Avner Rothschild
Abstract:
Optical interference is used to enhance light-matter interaction and harvest broadband light in ultrathin semiconductor absorber films on specular back-reflectors. However, the high-temperature processing in oxygen atmosphere required for oxide absorbers often degrades metallic back-reflectors and their specular reflectance. In order to overcome this problem, we present a newly developed film flip…
▽ More
Optical interference is used to enhance light-matter interaction and harvest broadband light in ultrathin semiconductor absorber films on specular back-reflectors. However, the high-temperature processing in oxygen atmosphere required for oxide absorbers often degrades metallic back-reflectors and their specular reflectance. In order to overcome this problem, we present a newly developed film flip and transfer process that allows for high-temperature processing without degradation of the metallic back-reflector and without the need of passivation interlayers. The film flip and transfer process improves the performance of photoanodes for photoelectrochemical water splitting comprising ultrathin (< 20 nm) hematite (Fe2O3) films on silver-gold alloy (90 at% Ag-10 at% Au) back-reflectors. We obtain specular back-reflectors with high reflectance below hematite films, which is necessary for maximizing the productive light absorption in the hematite film and minimizing non-productive absorption in the back-reflector. Furthermore, the film flip and transfer process opens up a new route to attach thin film stacks onto a wide range of substrates including flexible or temperature sensitive materials.
△ Less
Submitted 8 December, 2020;
originally announced December 2020.
-
Defect segregation and its effect on the photoelectrochemical properties of Ti-doped hematite photoanodes for solar water splitting
Authors:
Barbara Scherrer,
Tong Li,
Anton Tsyganok,
Max Döbeli,
Bhavana Gupta,
Kirtiman Deo Malviya,
Olga Kasian,
Nitzan Maman,
Baptiste Gault,
Daniel A. Grave,
Alexander Mehlman,
Iris Visoly-Fisher,
Dierk Raabe,
Avner Rothschild
Abstract:
Optimising the photoelectrochemical performance of hematite photoanodes for solar water splitting requires better understanding of the relationships between dopant distribution, structural defects and photoelectrochemical properties. Here, we use complementary characterisation techniques including electron microscopy, conductive atomic force microscopy (CAFM), Rutherford backscattering spectroscop…
▽ More
Optimising the photoelectrochemical performance of hematite photoanodes for solar water splitting requires better understanding of the relationships between dopant distribution, structural defects and photoelectrochemical properties. Here, we use complementary characterisation techniques including electron microscopy, conductive atomic force microscopy (CAFM), Rutherford backscattering spectroscopy (RBS), atom probe tomography (APT) and intensity modulated photocurrent spectroscopy (IMPS) to study this correlation in Ti-doped (1 cat.%) hematite films deposited by pulsed laser deposition (PLD) on F:SnO2 (FTO) coated glass substrates. The deposition was carried out at 300 °C, followed by annealing at 500 deg C for 2 h. Upon annealing, Ti was observed by APT to segregate to the hematite/FTO interface and into some hematite grains. Since no other pronounced changes in microstructure and chemical composition were observed by electron microscopy and RBS after annealing, the non-uniform Ti redistribution seems to be the reason for a reduced interfacial recombination in the annealed films, as observed by IMPS. This results in a lower onset potential, higher photocurrent and larger fill factor with respect to the as-deposited state. This work provides atomic-scale insights into the microscopic inhomogeneity in Ti-doped hematite thin films and the role of defect segregation in their electrical and photoelectrochemical properties.
△ Less
Submitted 8 December, 2020;
originally announced December 2020.
-
Magic DIAMOND: Multi-Fascicle Diffusion Compartment Imaging with Tensor Distribution Modeling and Tensor-Valued Diffusion Encoding
Authors:
A. Reymbaut,
A. Valcourt Caron,
G. Gilbert,
F. Szczepankiewicz,
M. Nilsson,
S. K. Warfield,
M. Descoteaux,
B. Scherrer
Abstract:
Diffusion tensor imaging provides increased sensitivity to microstructural tissue changes compared to conventional anatomical imaging but also presents limited specificity. To tackle this problem, the DIAMOND model subdivides the voxel content into diffusion compartments and draws from diffusion-weighted data to estimate compartmental non-central matrix-variate Gamma distribution of diffusion tens…
▽ More
Diffusion tensor imaging provides increased sensitivity to microstructural tissue changes compared to conventional anatomical imaging but also presents limited specificity. To tackle this problem, the DIAMOND model subdivides the voxel content into diffusion compartments and draws from diffusion-weighted data to estimate compartmental non-central matrix-variate Gamma distribution of diffusion tensors, thereby resolving crossing fascicles while accounting for their respective heterogeneity. Alternatively, tensor-valued diffusion encoding defines new acquisition schemes tagging specific features of the intra-voxel diffusion tensor distribution directly from the outcome of the measurement. However, the impact of such schemes on estimating brain microstructural features has only been studied in a handful of parametric single-fascicle models. In this work, we derive a general Laplace transform for the non-central matrix-variate Gamma distribution, which enables the extension of DIAMOND to tensor-valued encoded data. We then evaluate this "Magic DIAMOND" model in silico and in vivo on various combinations of tensor-valued encoded data. Assessing uncertainty on parameter estimation via stratified bootstrap, we investigate both voxel-based and fixel-based metrics by carrying out multi-peak tractography. We show that our estimated metrics can be mapped along tracks robustly across regions of fiber crossing, which opens new perspectives for tractometry and microstructure map** along specific white-matter tracts.
△ Less
Submitted 17 April, 2020; v1 submitted 15 April, 2020;
originally announced April 2020.
-
Leverage the Average: an Analysis of KL Regularization in RL
Authors:
Nino Vieillard,
Tadashi Kozuno,
Bruno Scherrer,
Olivier Pietquin,
Rémi Munos,
Matthieu Geist
Abstract:
Recent Reinforcement Learning (RL) algorithms making use of Kullback-Leibler (KL) regularization as a core component have shown outstanding performance. Yet, only little is understood theoretically about why KL regularization helps, so far. We study KL regularization within an approximate value iteration scheme and show that it implicitly averages q-values. Leveraging this insight, we provide a ve…
▽ More
Recent Reinforcement Learning (RL) algorithms making use of Kullback-Leibler (KL) regularization as a core component have shown outstanding performance. Yet, only little is understood theoretically about why KL regularization helps, so far. We study KL regularization within an approximate value iteration scheme and show that it implicitly averages q-values. Leveraging this insight, we provide a very strong performance bound, the very first to combine two desirable aspects: a linear dependency to the horizon (instead of quadratic) and an error propagation term involving an averaging effect of the estimation errors (instead of an accumulation effect). We also study the more general case of an additional entropy regularizer. The resulting abstract scheme encompasses many existing RL algorithms. Some of our assumptions do not hold with neural networks, so we complement this theoretical analysis with an extensive empirical study.
△ Less
Submitted 6 January, 2021; v1 submitted 31 March, 2020;
originally announced March 2020.
-
Momentum in Reinforcement Learning
Authors:
Nino Vieillard,
Bruno Scherrer,
Olivier Pietquin,
Matthieu Geist
Abstract:
We adapt the optimization's concept of momentum to reinforcement learning. Seeing the state-action value functions as an analog to the gradients in optimization, we interpret momentum as an average of consecutive $q$-functions. We derive Momentum Value Iteration (MoVI), a variation of Value Iteration that incorporates this momentum idea. Our analysis shows that this allows MoVI to average errors o…
▽ More
We adapt the optimization's concept of momentum to reinforcement learning. Seeing the state-action value functions as an analog to the gradients in optimization, we interpret momentum as an average of consecutive $q$-functions. We derive Momentum Value Iteration (MoVI), a variation of Value Iteration that incorporates this momentum idea. Our analysis shows that this allows MoVI to average errors over successive iterations. We show that the proposed approach can be readily extended to deep learning. Specifically, we propose a simple improvement on DQN based on MoVI, and experiment it on Atari games.
△ Less
Submitted 31 March, 2020; v1 submitted 21 October, 2019;
originally announced October 2019.
-
A Theory of Regularized Markov Decision Processes
Authors:
Matthieu Geist,
Bruno Scherrer,
Olivier Pietquin
Abstract:
Many recent successful (deep) reinforcement learning algorithms make use of regularization, generally based on entropy or Kullback-Leibler divergence. We propose a general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: we consider a larger class of regularizers, and we consider the general modified policy iteration approach, encompassing both p…
▽ More
Many recent successful (deep) reinforcement learning algorithms make use of regularization, generally based on entropy or Kullback-Leibler divergence. We propose a general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: we consider a larger class of regularizers, and we consider the general modified policy iteration approach, encompassing both policy iteration and value iteration. The core building blocks of this theory are a notion of regularized Bellman operator and the Legendre-Fenchel transform, a classical tool of convex optimization. This approach allows for error propagation analyses of general algorithmic schemes of which (possibly variants of) classical algorithms such as Trust Region Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy Programming are special cases. This also draws connections to proximal convex optimization, especially to Mirror Descent.
△ Less
Submitted 4 June, 2019; v1 submitted 31 January, 2019;
originally announced January 2019.
-
Anderson Acceleration for Reinforcement Learning
Authors:
Matthieu Geist,
Bruno Scherrer
Abstract:
Anderson acceleration is an old and simple method for accelerating the computation of a fixed point. However, as far as we know and quite surprisingly, it has never been applied to dynamic programming or reinforcement learning. In this paper, we explain briefly what Anderson acceleration is and how it can be applied to value iteration, this being supported by preliminary experiments showing a sign…
▽ More
Anderson acceleration is an old and simple method for accelerating the computation of a fixed point. However, as far as we know and quite surprisingly, it has never been applied to dynamic programming or reinforcement learning. In this paper, we explain briefly what Anderson acceleration is and how it can be applied to value iteration, this being supported by preliminary experiments showing a significant speed up of convergence, that we critically discuss. We also discuss how this idea could be applied more generally to (deep) reinforcement learning.
△ Less
Submitted 25 September, 2018;
originally announced September 2018.
-
How to Combine Tree-Search Methods in Reinforcement Learning
Authors:
Yonathan Efroni,
Gal Dalal,
Bruno Scherrer,
Shie Mannor
Abstract:
Finite-horizon lookahead policies are abundantly used in Reinforcement Learning and demonstrate impressive empirical success. Usually, the lookahead policies are implemented with specific planning methods such as Monte Carlo Tree Search (e.g. in AlphaZero). Referring to the planning problem as tree search, a reasonable practice in these implementations is to back up the value only at the leaves wh…
▽ More
Finite-horizon lookahead policies are abundantly used in Reinforcement Learning and demonstrate impressive empirical success. Usually, the lookahead policies are implemented with specific planning methods such as Monte Carlo Tree Search (e.g. in AlphaZero). Referring to the planning problem as tree search, a reasonable practice in these implementations is to back up the value only at the leaves while the information obtained at the root is not leveraged other than for updating the policy. Here, we question the potency of this approach. Namely, the latter procedure is non-contractive in general, and its convergence is not guaranteed. Our proposed enhancement is straightforward and simple: use the return from the optimal tree path to back up the values at the descendants of the root. This leads to a $γ^h$-contracting procedure, where $γ$ is the discount factor and $h$ is the tree depth. To establish our results, we first introduce a notion called \emph{multiple-step greedy consistency}. We then provide convergence rates for two algorithmic instantiations of the above enhancement in the presence of noise injected to both the tree search stage and value estimation stage.
△ Less
Submitted 17 February, 2019; v1 submitted 6 September, 2018;
originally announced September 2018.
-
Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning
Authors:
Yonathan Efroni,
Gal Dalal,
Bruno Scherrer,
Shie Mannor
Abstract:
Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control. In a recent work \cite{efroni2018beyond}, multiple-step greedy policies and their use in vanilla Policy Iteration algorithms were proposed and analyzed. In this work, we study multiple-step greedy algorithms in more practical se…
▽ More
Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control. In a recent work \cite{efroni2018beyond}, multiple-step greedy policies and their use in vanilla Policy Iteration algorithms were proposed and analyzed. In this work, we study multiple-step greedy algorithms in more practical setups. We begin by highlighting a counter-intuitive difficulty, arising with soft-policy updates: even in the absence of approximations, and contrary to the 1-step-greedy case, monotonic policy improvement is not guaranteed unless the update stepsize is sufficiently large. Taking particular care about this difficulty, we formulate and analyze online and approximate algorithms that use such a multi-step greedy operator.
△ Less
Submitted 20 September, 2018; v1 submitted 21 May, 2018;
originally announced May 2018.
-
Beyond the One Step Greedy Approach in Reinforcement Learning
Authors:
Yonathan Efroni,
Gal Dalal,
Bruno Scherrer,
Shie Mannor
Abstract:
The famous Policy Iteration algorithm alternates between policy improvement and policy evaluation. Implementations of this algorithm with several variants of the latter evaluation stage, e.g, $n$-step and trace-based returns, have been analyzed in previous works. However, the case of multiple-step lookahead policy improvement, despite the recent increase in empirical evidence of its strength, has…
▽ More
The famous Policy Iteration algorithm alternates between policy improvement and policy evaluation. Implementations of this algorithm with several variants of the latter evaluation stage, e.g, $n$-step and trace-based returns, have been analyzed in previous works. However, the case of multiple-step lookahead policy improvement, despite the recent increase in empirical evidence of its strength, has to our knowledge not been carefully analyzed yet. In this work, we introduce the first such analysis. Namely, we formulate variants of multiple-step policy improvement, derive new algorithms using these definitions and prove their convergence. Moreover, we show that recent prominent Reinforcement Learning algorithms are, in fact, instances of our framework. We thus shed light on their empirical success and give a recipe for deriving new algorithms for future study.
△ Less
Submitted 30 July, 2018; v1 submitted 10 February, 2018;
originally announced February 2018.
-
Diffusion MRI microstructure models with in vivo human brain Connectom data: results from a multi-group comparison
Authors:
Uran Ferizi,
Benoit Scherrer,
Torben Schneider,
Mohammad Alipoor,
Odin Eufracio,
Rutger H. J. Fick,
Rachid Deriche,
Markus Nilsson,
Ana K. Loya-Olivas,
Mariano Rivera,
Dirk H. J. Poot,
Alonso Ramirez-Manzanares,
Jose L. Marroquin,
Ariel Rokem,
Christian Pötter,
Robert F. Dougherty,
Ken Sakaie,
Claudia Wheeler-Kingshott,
Simon K. Warfield,
Thomas Witzel,
Lawrence L. Wald,
José G. Raya,
Daniel C. Alexander
Abstract:
A large number of mathematical models have been proposed to describe the measured signal in diffusion-weighted (DW) magnetic resonance imaging (MRI) and infer properties about the white matter microstructure. However, a head-to-head comparison of DW-MRI models is critically missing in the field. To address this deficiency, we organized the "White Matter Modeling Challenge" during the International…
▽ More
A large number of mathematical models have been proposed to describe the measured signal in diffusion-weighted (DW) magnetic resonance imaging (MRI) and infer properties about the white matter microstructure. However, a head-to-head comparison of DW-MRI models is critically missing in the field. To address this deficiency, we organized the "White Matter Modeling Challenge" during the International Symposium on Biomedical Imaging (ISBI) 2015 conference. This competition aimed at identifying the DW-MRI models that best predict unseen DW data. in vivo DW-MRI data was acquired on the Connectom scanner at the A.A.Martinos Center (Massachusetts General Hospital) using gradients strength of up to 300 mT/m and a broad set of diffusion times. We focused on assessing the DW signal prediction in two regions: the genu in the corpus callosum, where the fibres are relatively straight and parallel, and the fornix, where the configuration of fibres is more complex. The challenge participants had access to three-quarters of the whole dataset, and their models were ranked on their ability to predict the remaining unseen quarter of data. In this paper we provide both an overview and a more in-depth description of each evaluated model, report the challenge results, and infer trends about the model characteristics that were associated with high model ranking. This work provides a much needed benchmark for DW-MRI models. The acquired data and model details for signal prediction evaluation are provided online to encourage a larger scale assessment of diffusion models in the future.
△ Less
Submitted 9 November, 2016; v1 submitted 25 April, 2016;
originally announced April 2016.
-
Rate of Convergence and Error Bounds for LSTD($λ$)
Authors:
Manel Tagorti,
Bruno Scherrer
Abstract:
We consider LSTD($λ$), the least-squares temporal-difference algorithm with eligibility traces algorithm proposed by Boyan (2002). It computes a linear approximation of the value function of a fixed policy in a large Markov Decision Process. Under a $β$-mixing assumption, we derive, for any value of $λ\in (0,1)$, a high-probability estimate of the rate of convergence of this algorithm to its limit…
▽ More
We consider LSTD($λ$), the least-squares temporal-difference algorithm with eligibility traces algorithm proposed by Boyan (2002). It computes a linear approximation of the value function of a fixed policy in a large Markov Decision Process. Under a $β$-mixing assumption, we derive, for any value of $λ\in (0,1)$, a high-probability estimate of the rate of convergence of this algorithm to its limit. We deduce a high-probability bound on the error of this algorithm, that extends (and slightly improves) that derived by Lazaric et al. (2012) in the specific case where $λ=0$. In particular, our analysis sheds some light on the choice of $λ$ with respect to the quality of the chosen linear space and the number of samples, that complies with simulations.
△ Less
Submitted 13 May, 2014;
originally announced May 2014.
-
Approximate Policy Iteration Schemes: A Comparison
Authors:
Bruno Scherrer
Abstract:
We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on several approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration, Conservative Policy Iteration (CPI), a natural adaptation of the Policy Search by Dynamic Programming algorithm to the infinite-horizon case (PSDP$_\infty$), and the recently proposed…
▽ More
We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on several approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration, Conservative Policy Iteration (CPI), a natural adaptation of the Policy Search by Dynamic Programming algorithm to the infinite-horizon case (PSDP$_\infty$), and the recently proposed Non-Stationary Policy iteration (NSPI(m)). For all algorithms, we describe performance bounds, and make a comparison by paying a particular attention to the concentrability constants involved, the number of iterations and the memory required. Our analysis highlights the following points: 1) The performance guarantee of CPI can be arbitrarily better than that of API/API($α$), but this comes at the cost of a relative---exponential in $\frac{1}ε$---increase of the number of iterations. 2) PSDP$_\infty$ enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API. 3) Contrary to API that requires a constant memory, the memory needed by CPI and PSDP$_\infty$ is proportional to their number of iterations, which may be problematic when the discount factor $γ$ is close to 1 or the approximation error $ε$ is close to $0$; we show that the NSPI(m) algorithm allows to make an overall trade-off between memory and performance. Simulations with these schemes confirm our analysis.
△ Less
Submitted 12 May, 2014;
originally announced May 2014.
-
Pt-based nanowire networks with enhanced oxygen-reduction activity
Authors:
Henning Galinski,
Thomas Ryll,
Yang Lin,
Barbara Scherrer,
Anna Evans,
Max Döbeli,
Ludwig J. Gauckler
Abstract:
Pt-Al and Pt-Y-Al thin film electrodes on yttria-stabilised zirconia electrolytes were prepared by dealloying of co-sputtered Pt-Al or Pt-Y-Al films. The selective dissolution of Al from the Pt-alloy compound causes the formation of a highly porous nanowire network with a mean branch thickness below 25 nm and a pore intercept length below 35 nm. The oxygen reduction capability of the resulting ele…
▽ More
Pt-Al and Pt-Y-Al thin film electrodes on yttria-stabilised zirconia electrolytes were prepared by dealloying of co-sputtered Pt-Al or Pt-Y-Al films. The selective dissolution of Al from the Pt-alloy compound causes the formation of a highly porous nanowire network with a mean branch thickness below 25 nm and a pore intercept length below 35 nm. The oxygen reduction capability of the resulting electrodes was analysed in a micro-solid oxide fuel cell setup at elevated temperatures (598-873 K). Here, we demonstrate that these nanoporous thin films excel "state-of-the-art" fuel cell electrodes in terms of catalytic activity and thermal stability. The nanoporous Pt electrodes exhibit exchange current densities that are up to 13 times higher than conventional Pt electrodes, measured at 648 K. It is shown that the enhanced catalytic activity of these Pt electrodes is achieved through the engineering of the materials d-bands due to the addition of yttrium as ternary constituent.
△ Less
Submitted 17 July, 2014; v1 submitted 18 March, 2014;
originally announced March 2014.
-
Policy Search: Any Local Optimum Enjoys a Global Performance Guarantee
Authors:
Bruno Scherrer,
Matthieu Geist
Abstract:
Local Policy Search is a popular reinforcement learning approach for handling large state spaces. Formally, it searches locally in a paramet erized policy space in order to maximize the associated value function averaged over some predefined distribution. It is probably commonly b elieved that the best one can hope in general from such an approach is to get a local optimum of this criterion. In th…
▽ More
Local Policy Search is a popular reinforcement learning approach for handling large state spaces. Formally, it searches locally in a paramet erized policy space in order to maximize the associated value function averaged over some predefined distribution. It is probably commonly b elieved that the best one can hope in general from such an approach is to get a local optimum of this criterion. In this article, we show th e following surprising result: \emph{any} (approximate) \emph{local optimum} enjoys a \emph{global performance guarantee}. We compare this g uarantee with the one that is satisfied by Direct Policy Iteration, an approximate dynamic programming algorithm that does some form of Poli cy Search: if the approximation error of Local Policy Search may generally be bigger (because local search requires to consider a space of s tochastic policies), we argue that the concentrability coefficient that appears in the performance bound is much nicer. Finally, we discuss several practical and theoretical consequences of our analysis.
△ Less
Submitted 6 June, 2013;
originally announced June 2013.
-
On the Performance Bounds of some Policy Search Dynamic Programming Algorithms
Authors:
Bruno Scherrer
Abstract:
We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on Policy Search algorithms, that compute an approximately optimal policy by following the standard Policy Iteration (PI) scheme via an -approximate greedy operator (Kakade and Langford, 2002; Lazaric et al., 2010). We describe existing and a few new performance bounds for Direct P…
▽ More
We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on Policy Search algorithms, that compute an approximately optimal policy by following the standard Policy Iteration (PI) scheme via an -approximate greedy operator (Kakade and Langford, 2002; Lazaric et al., 2010). We describe existing and a few new performance bounds for Direct Policy Iteration (DPI) (Lagoudakis and Parr, 2003; Fern et al., 2006; Lazaric et al., 2010) and Conservative Policy Iteration (CPI) (Kakade and Langford, 2002). By paying a particular attention to the concentrability constants involved in such guarantees, we notably argue that the guarantee of CPI is much better than that of DPI, but this comes at the cost of a relative--exponential in $\frac{1}ε$-- increase of time complexity. We then describe an algorithm, Non-Stationary Direct Policy Iteration (NSDPI), that can either be seen as 1) a variation of Policy Search by Dynamic Programming by Bagnell et al. (2003) to the infinite horizon situation or 2) a simplified version of the Non-Stationary PI with growing period of Scherrer and Lesner (2012). We provide an analysis of this algorithm, that shows in particular that it enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a time complexity similar to that of DPI.
△ Less
Submitted 3 June, 2013;
originally announced June 2013.
-
Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
Authors:
Bruno Scherrer
Abstract:
Given a Markov Decision Process (MDP) with $n$ states and a totalnumber $m$ of actions, we study the number of iterations needed byPolicy Iteration (PI) algorithms to converge to the optimal$γ$-discounted policy. We consider two variations of PI: Howard'sPI that changes the actions in all states with a positive advantage,and Simplex-PI that only changes the action in the state with maximaladvantag…
▽ More
Given a Markov Decision Process (MDP) with $n$ states and a totalnumber $m$ of actions, we study the number of iterations needed byPolicy Iteration (PI) algorithms to converge to the optimal$γ$-discounted policy. We consider two variations of PI: Howard'sPI that changes the actions in all states with a positive advantage,and Simplex-PI that only changes the action in the state with maximaladvantage. We show that Howard's PI terminates after at most $O\left(\frac{m}{1-γ}\log\left(\frac{1}{1-γ}\right)\right)$iterations, improving by a factor $O(\log n)$ a result by Hansen etal., while Simplex-PI terminates after at most $O\left(\frac{nm}{1-γ}\log\left(\frac{1}{1-γ}\right)\right)$iterations, improving by a factor $O(\log n)$ a result by Ye. Undersome structural properties of the MDP, we then consider bounds thatare independent of the discount factor~$γ$: quantities ofinterest are bounds $τ\_t$ and $τ\_r$---uniform on all states andpolicies---respectively on the \emph{expected time spent in transientstates} and \emph{the inverse of the frequency of visits in recurrentstates} given that the process starts from the uniform distribution.Indeed, we show that Simplex-PI terminates after at most $\tilde O\left(n^3 m^2 τ\_t τ\_r \right)$ iterations. This extends arecent result for deterministic MDPs by Post & Ye, in which $τ\_t\le 1$ and $τ\_r \le n$, in particular it shows that Simplex-PI isstrongly polynomial for a much larger class of MDPs. We explain whysimilar results seem hard to derive for Howard's PI. Finally, underthe additional (restrictive) assumption that the state space ispartitioned in two sets, respectively states that are transient andrecurrent for all policies, we show that both Howard's PI andSimplex-PI terminate after at most $\tilde O(m(n^2τ\_t+nτ\_r))$iterations.
△ Less
Submitted 10 February, 2016; v1 submitted 3 June, 2013;
originally announced June 2013.
-
Tight Performance Bounds for Approximate Modified Policy Iteration with Non-Stationary Policies
Authors:
Boris Lesner,
Bruno Scherrer
Abstract:
We consider approximate dynamic programming for the infinite-horizon stationary $γ$-discounted optimal control problem formalized by Markov Decision Processes. While in the exact case it is known that there always exists an optimal policy that is stationary, we show that when using value function approximation, looking for a non-stationary policy may lead to a better performance guarantee. We defi…
▽ More
We consider approximate dynamic programming for the infinite-horizon stationary $γ$-discounted optimal control problem formalized by Markov Decision Processes. While in the exact case it is known that there always exists an optimal policy that is stationary, we show that when using value function approximation, looking for a non-stationary policy may lead to a better performance guarantee. We define a non-stationary variant of MPI that unifies a broad family of approximate DP algorithms of the literature. For this algorithm we provide an error propagation analysis in the form of a performance bound of the resulting policies that can improve the usual performance bound by a factor $O(1-γ)$, which is significant when the discount factor $γ$ is close to 1. Doing so, our approach unifies recent results for Value and Policy Iteration. Furthermore, we show, by constructing a specific deterministic MDP, that our performance guarantee is tight.
△ Less
Submitted 20 April, 2013;
originally announced April 2013.
-
Off-policy Learning with Eligibility Traces: A Survey
Authors:
Matthieu Geist,
Bruno Scherrer
Abstract:
In the framework of Markov Decision Processes, off-policy learning, that is the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We briefly review on-policy learning algorithms of the literature (gradient-based and least-squares-based), adopting a unified algorithmic view. Then, we highlight a systema…
▽ More
In the framework of Markov Decision Processes, off-policy learning, that is the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We briefly review on-policy learning algorithms of the literature (gradient-based and least-squares-based), adopting a unified algorithmic view. Then, we highlight a systematic approach for adapting them to off-policy learning with eligibility traces. This leads to some known algorithms - off-policy LSTD(λ), LSPE(λ), TD(λ), TDC/GQ(λ) - and suggests new extensions - off-policy FPKF(λ), BRM(λ), gBRM(λ), GTD2(λ). We describe a comprehensive algorithmic derivation of all algorithms in a recursive and memory-efficent form, discuss their known convergence properties and illustrate their relative empirical behavior on Garnet problems. Our experiments suggest that the most standard algorithms on and off-policy LSTD(λ)/LSPE(λ) - and TD(λ) if the feature space dimension is too large for a least-squares approach - perform the best.
△ Less
Submitted 15 April, 2013;
originally announced April 2013.
-
On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
Authors:
Bruno Scherrer,
Boris Lesner
Abstract:
We consider infinite-horizon stationary $γ$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error $ε$ at each iteration, it is well-known that one can compute stationary policies that are $\frac{2γ}{(1-γ)^2}ε$-optimal. After arguing that this guarantee is tight, we develop variations of Value and…
▽ More
We consider infinite-horizon stationary $γ$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error $ε$ at each iteration, it is well-known that one can compute stationary policies that are $\frac{2γ}{(1-γ)^2}ε$-optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for computing non-stationary policies that can be up to $\frac{2γ}{1-γ}ε$-optimal, which constitutes a significant improvement in the usual situation when $γ$ is close to 1. Surprisingly, this shows that the problem of "computing near-optimal non-stationary policies" is much simpler than that of "computing near-optimal stationary policies".
△ Less
Submitted 29 November, 2012;
originally announced November 2012.
-
A Dantzig Selector Approach to Temporal Difference Learning
Authors:
Matthieu Geist,
Bruno Scherrer,
Alessandro Lazaric,
Mohammad Ghavamzadeh
Abstract:
LSTD is a popular algorithm for value function approximation. Whenever the number of features is larger than the number of samples, it must be paired with some form of regularization. In particular, L1-regularization methods tend to perform feature selection by promoting sparsity, and thus, are well-suited for high-dimensional problems. However, since LSTD is not a simple regression algorithm, but…
▽ More
LSTD is a popular algorithm for value function approximation. Whenever the number of features is larger than the number of samples, it must be paired with some form of regularization. In particular, L1-regularization methods tend to perform feature selection by promoting sparsity, and thus, are well-suited for high-dimensional problems. However, since LSTD is not a simple regression algorithm, but it solves a fixed--point problem, its integration with L1-regularization is not straightforward and might come with some drawbacks (e.g., the P-matrix assumption for LASSO-TD). In this paper, we introduce a novel algorithm obtained by integrating LSTD with the Dantzig Selector. We investigate the performance of the proposed algorithm and its relationship with the existing regularized approaches, and show how it addresses some of their drawbacks.
△ Less
Submitted 27 June, 2012;
originally announced June 2012.
-
Approximate Modified Policy Iteration
Authors:
Bruno Scherrer,
Victor Gabillon,
Mohammad Ghavamzadeh,
Matthieu Geist
Abstract:
Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensio…
▽ More
Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of well-known approximate DP algorithms: fitted-value iteration, fitted-Q iteration, and classification-based policy iteration. We provide error propagation analyses that unify those for approximate policy and value iteration. On the last classification-based implementation, we develop a finite-sample analysis that shows that MPI's main parameter allows to control the balance between the estimation error of the classifier and the overall value function approximation.
△ Less
Submitted 18 May, 2012; v1 submitted 14 May, 2012;
originally announced May 2012.
-
On the Use of Non-Stationary Policies for Infinite-Horizon Discounted Markov Decision Processes
Authors:
Bruno Scherrer
Abstract:
We consider infinite-horizon $γ$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. We consider the algorithm Value Iteration and the sequence of policies $π_1,...,π_k$ it implicitely generates until some iteration $k$. We provide performance bounds for non-stationary policies involving the last $m$ generated policies that reduce the state-of…
▽ More
We consider infinite-horizon $γ$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. We consider the algorithm Value Iteration and the sequence of policies $π_1,...,π_k$ it implicitely generates until some iteration $k$. We provide performance bounds for non-stationary policies involving the last $m$ generated policies that reduce the state-of-the-art bound for the last stationary policy $π_k$ by a factor $\frac{1-γ}{1-γ^m}$. In particular, the use of non-stationary policies allows to reduce the usual asymptotic performance bounds of Value Iteration with errors bounded by $ε$ at each iteration from $\fracγ{(1-γ)^2}ε$ to $\fracγ{1-γ}ε$, which is significant in the usual situation when $γ$ is close to 1. Given Bellman operators that can only be computed with some error $ε$, a surprising consequence of this result is that the problem of "computing an approximately optimal non-stationary policy" is much simpler than that of "computing an approximately optimal stationary policy", and even slightly simpler than that of "approximately computing the value of some fixed policy", since this last problem only has a guarantee of $\frac{1}{1-γ}ε$.
△ Less
Submitted 30 March, 2012; v1 submitted 25 March, 2012;
originally announced March 2012.
-
Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view
Authors:
Bruno Scherrer
Abstract:
We investigate projection methods, for evaluating a linear approximation of the value function of a policy in a Markov Decision Process context. We consider two popular approaches, the one-step Temporal Difference fix-point computation (TD(0)) and the Bellman Residual (BR) minimization. We describe examples, where each method outperforms the other. We highlight a simple relation between the object…
▽ More
We investigate projection methods, for evaluating a linear approximation of the value function of a policy in a Markov Decision Process context. We consider two popular approaches, the one-step Temporal Difference fix-point computation (TD(0)) and the Bellman Residual (BR) minimization. We describe examples, where each method outperforms the other. We highlight a simple relation between the objective function they minimize, and show that while BR enjoys a performance guarantee, TD(0) does not in general. We then propose a unified view in terms of oblique projections of the Bellman equation, which substantially simplifies and extends the characterization of (schoknecht,2002) and the recent analysis of (Yu & Bertsekas, 2008). Eventually, we describe some simulations that suggest that if the TD(0) solution is usually slightly better than the BR solution, its inherent numerical instability makes it very bad in some cases, and thus worse on average.
△ Less
Submitted 19 November, 2010;
originally announced November 2010.
-
Performance Bounds for Lambda Policy Iteration and Application to the Game of Tetris
Authors:
Bruno Scherrer
Abstract:
We consider the discrete-time infinite-horizon optimal control problem formalized by Markov Decision Processes. We revisit the work of Bertsekas and Ioffe, that introduced $λ$ Policy Iteration, a family of algorithms parameterized by $λ$ that generalizes the standard algorithms Value Iteration and Policy Iteration, and has some deep connections with the Temporal Differences algorithm TD($λ$) descr…
▽ More
We consider the discrete-time infinite-horizon optimal control problem formalized by Markov Decision Processes. We revisit the work of Bertsekas and Ioffe, that introduced $λ$ Policy Iteration, a family of algorithms parameterized by $λ$ that generalizes the standard algorithms Value Iteration and Policy Iteration, and has some deep connections with the Temporal Differences algorithm TD($λ$) described by Sutton and Barto. We deepen the original theory developped by the authors by providing convergence rate bounds which generalize standard bounds for Value Iteration described for instance by Puterman. Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form and show that this is sound. Doing so, we extend and unify the separate analyses developped by Munos for Approximate Value Iteration and Approximate Policy Iteration. Eventually, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe. We provide an original performance bound that can be applied to such an undiscounted control problem. Our empirical results are different from those of Bertsekas and Ioffe (which were originally qualified as "paradoxical" and "intriguing"), and much more conform to what one would expect from a learning experiment. We discuss the possible reason for such a difference.
△ Less
Submitted 11 October, 2011; v1 submitted 5 November, 2007;
originally announced November 2007.
-
Modular self-organization
Authors:
Bruno Scherrer
Abstract:
The aim of this paper is to provide a sound framework for addressing a difficult problem: the automatic construction of an autonomous agent's modular architecture. We combine results from two apparently uncorrelated domains: Autonomous planning through Markov Decision Processes and a General Data Clustering Approach using a kernel-like method. Our fundamental idea is that the former is a good fr…
▽ More
The aim of this paper is to provide a sound framework for addressing a difficult problem: the automatic construction of an autonomous agent's modular architecture. We combine results from two apparently uncorrelated domains: Autonomous planning through Markov Decision Processes and a General Data Clustering Approach using a kernel-like method. Our fundamental idea is that the former is a good framework for addressing autonomy whereas the latter allows to tackle self-organizing problems.
△ Less
Submitted 26 September, 2006;
originally announced September 2006.