Skip to main content

Showing 1–22 of 22 results for author: Scherrer, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2103.09847  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm

    Authors: Lin Chen, Bruno Scherrer, Peter L. Bartlett

    Abstract: In this paper, we investigate the sample complexity of policy evaluation in infinite-horizon offline reinforcement learning (also known as the off-policy evaluation problem) with linear function approximation. We identify a hard regime $dγ^{2}>1$, where $d$ is the dimension of the feature vector and $γ$ is the discount rate. In this regime, for any $q\in[γ^{2},1]$, we can construct a hard instance… ▽ More

    Submitted 17 March, 2021; originally announced March 2021.

  2. arXiv:2003.14089  [pdf, other

    cs.LG stat.ML

    Leverage the Average: an Analysis of KL Regularization in RL

    Authors: Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, Matthieu Geist

    Abstract: Recent Reinforcement Learning (RL) algorithms making use of Kullback-Leibler (KL) regularization as a core component have shown outstanding performance. Yet, only little is understood theoretically about why KL regularization helps, so far. We study KL regularization within an approximate value iteration scheme and show that it implicitly averages q-values. Leveraging this insight, we provide a ve… ▽ More

    Submitted 6 January, 2021; v1 submitted 31 March, 2020; originally announced March 2020.

    Comments: NeurIPS 2020

  3. arXiv:1910.09322  [pdf, other

    cs.LG stat.ML

    Momentum in Reinforcement Learning

    Authors: Nino Vieillard, Bruno Scherrer, Olivier Pietquin, Matthieu Geist

    Abstract: We adapt the optimization's concept of momentum to reinforcement learning. Seeing the state-action value functions as an analog to the gradients in optimization, we interpret momentum as an average of consecutive $q$-functions. We derive Momentum Value Iteration (MoVI), a variation of Value Iteration that incorporates this momentum idea. Our analysis shows that this allows MoVI to average errors o… ▽ More

    Submitted 31 March, 2020; v1 submitted 21 October, 2019; originally announced October 2019.

    Comments: AISTATS 2020

  4. arXiv:1901.11275  [pdf, other

    cs.LG stat.ML

    A Theory of Regularized Markov Decision Processes

    Authors: Matthieu Geist, Bruno Scherrer, Olivier Pietquin

    Abstract: Many recent successful (deep) reinforcement learning algorithms make use of regularization, generally based on entropy or Kullback-Leibler divergence. We propose a general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: we consider a larger class of regularizers, and we consider the general modified policy iteration approach, encompassing both p… ▽ More

    Submitted 4 June, 2019; v1 submitted 31 January, 2019; originally announced January 2019.

    Comments: ICML 2019

  5. arXiv:1809.09501  [pdf, other

    cs.LG stat.ML

    Anderson Acceleration for Reinforcement Learning

    Authors: Matthieu Geist, Bruno Scherrer

    Abstract: Anderson acceleration is an old and simple method for accelerating the computation of a fixed point. However, as far as we know and quite surprisingly, it has never been applied to dynamic programming or reinforcement learning. In this paper, we explain briefly what Anderson acceleration is and how it can be applied to value iteration, this being supported by preliminary experiments showing a sign… ▽ More

    Submitted 25 September, 2018; originally announced September 2018.

    Comments: European Workshop on Reinforcement Learning (EWRL 2018)

  6. arXiv:1809.01843  [pdf, other

    cs.LG cs.AI stat.ML

    How to Combine Tree-Search Methods in Reinforcement Learning

    Authors: Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor

    Abstract: Finite-horizon lookahead policies are abundantly used in Reinforcement Learning and demonstrate impressive empirical success. Usually, the lookahead policies are implemented with specific planning methods such as Monte Carlo Tree Search (e.g. in AlphaZero). Referring to the planning problem as tree search, a reasonable practice in these implementations is to back up the value only at the leaves wh… ▽ More

    Submitted 17 February, 2019; v1 submitted 6 September, 2018; originally announced September 2018.

    Comments: AAAI 2019

  7. arXiv:1805.07956  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning

    Authors: Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor

    Abstract: Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control. In a recent work \cite{efroni2018beyond}, multiple-step greedy policies and their use in vanilla Policy Iteration algorithms were proposed and analyzed. In this work, we study multiple-step greedy algorithms in more practical se… ▽ More

    Submitted 20 September, 2018; v1 submitted 21 May, 2018; originally announced May 2018.

    Comments: NIPS 2018

  8. arXiv:1802.03654  [pdf, other

    cs.AI cs.LG stat.ML

    Beyond the One Step Greedy Approach in Reinforcement Learning

    Authors: Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor

    Abstract: The famous Policy Iteration algorithm alternates between policy improvement and policy evaluation. Implementations of this algorithm with several variants of the latter evaluation stage, e.g, $n$-step and trace-based returns, have been analyzed in previous works. However, the case of multiple-step lookahead policy improvement, despite the recent increase in empirical evidence of its strength, has… ▽ More

    Submitted 30 July, 2018; v1 submitted 10 February, 2018; originally announced February 2018.

    Comments: ICML 2018

  9. arXiv:1405.3229  [pdf, other

    cs.LG cs.AI math.OC math.ST

    Rate of Convergence and Error Bounds for LSTD($λ$)

    Authors: Manel Tagorti, Bruno Scherrer

    Abstract: We consider LSTD($λ$), the least-squares temporal-difference algorithm with eligibility traces algorithm proposed by Boyan (2002). It computes a linear approximation of the value function of a fixed policy in a large Markov Decision Process. Under a $β$-mixing assumption, we derive, for any value of $λ\in (0,1)$, a high-probability estimate of the rate of convergence of this algorithm to its limit… ▽ More

    Submitted 13 May, 2014; originally announced May 2014.

    Comments: (2014)

  10. arXiv:1405.2878  [pdf, other

    cs.AI cs.LG stat.ML

    Approximate Policy Iteration Schemes: A Comparison

    Authors: Bruno Scherrer

    Abstract: We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on several approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration, Conservative Policy Iteration (CPI), a natural adaptation of the Policy Search by Dynamic Programming algorithm to the infinite-horizon case (PSDP$_\infty$), and the recently proposed… ▽ More

    Submitted 12 May, 2014; originally announced May 2014.

    Comments: ICML (2014)

  11. arXiv:1306.1520  [pdf, ps, other

    cs.LG cs.AI cs.RO math.OC

    Policy Search: Any Local Optimum Enjoys a Global Performance Guarantee

    Authors: Bruno Scherrer, Matthieu Geist

    Abstract: Local Policy Search is a popular reinforcement learning approach for handling large state spaces. Formally, it searches locally in a paramet erized policy space in order to maximize the associated value function averaged over some predefined distribution. It is probably commonly b elieved that the best one can hope in general from such an approach is to get a local optimum of this criterion. In th… ▽ More

    Submitted 6 June, 2013; originally announced June 2013.

  12. arXiv:1306.0539  [pdf, other

    cs.AI cs.LG

    On the Performance Bounds of some Policy Search Dynamic Programming Algorithms

    Authors: Bruno Scherrer

    Abstract: We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on Policy Search algorithms, that compute an approximately optimal policy by following the standard Policy Iteration (PI) scheme via an -approximate greedy operator (Kakade and Langford, 2002; Lazaric et al., 2010). We describe existing and a few new performance bounds for Direct P… ▽ More

    Submitted 3 June, 2013; originally announced June 2013.

  13. arXiv:1306.0386  [pdf, ps, other

    math.OC cs.AI cs.DM cs.RO

    Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

    Authors: Bruno Scherrer

    Abstract: Given a Markov Decision Process (MDP) with $n$ states and a totalnumber $m$ of actions, we study the number of iterations needed byPolicy Iteration (PI) algorithms to converge to the optimal$γ$-discounted policy. We consider two variations of PI: Howard'sPI that changes the actions in all states with a positive advantage,and Simplex-PI that only changes the action in the state with maximaladvantag… ▽ More

    Submitted 10 February, 2016; v1 submitted 3 June, 2013; originally announced June 2013.

    Comments: Markov decision processes, Dynamic Programming, Analysis of Algorithms, Mathematics of Operations Research, INFORMS, 2016

  14. arXiv:1304.5610  [pdf, other

    math.OC cs.AI

    Tight Performance Bounds for Approximate Modified Policy Iteration with Non-Stationary Policies

    Authors: Boris Lesner, Bruno Scherrer

    Abstract: We consider approximate dynamic programming for the infinite-horizon stationary $γ$-discounted optimal control problem formalized by Markov Decision Processes. While in the exact case it is known that there always exists an optimal policy that is stationary, we show that when using value function approximation, looking for a non-stationary policy may lead to a better performance guarantee. We defi… ▽ More

    Submitted 20 April, 2013; originally announced April 2013.

  15. arXiv:1304.3999  [pdf, other

    cs.AI cs.RO

    Off-policy Learning with Eligibility Traces: A Survey

    Authors: Matthieu Geist, Bruno Scherrer

    Abstract: In the framework of Markov Decision Processes, off-policy learning, that is the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We briefly review on-policy learning algorithms of the literature (gradient-based and least-squares-based), adopting a unified algorithmic view. Then, we highlight a systema… ▽ More

    Submitted 15 April, 2013; originally announced April 2013.

  16. arXiv:1211.6898  [pdf, ps, other

    cs.LG cs.AI

    On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

    Authors: Bruno Scherrer, Boris Lesner

    Abstract: We consider infinite-horizon stationary $γ$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error $ε$ at each iteration, it is well-known that one can compute stationary policies that are $\frac{2γ}{(1-γ)^2}ε$-optimal. After arguing that this guarantee is tight, we develop variations of Value and… ▽ More

    Submitted 29 November, 2012; originally announced November 2012.

    Journal ref: NIPS 2012 (2012)

  17. arXiv:1206.6480  [pdf

    cs.LG stat.ML

    A Dantzig Selector Approach to Temporal Difference Learning

    Authors: Matthieu Geist, Bruno Scherrer, Alessandro Lazaric, Mohammad Ghavamzadeh

    Abstract: LSTD is a popular algorithm for value function approximation. Whenever the number of features is larger than the number of samples, it must be paired with some form of regularization. In particular, L1-regularization methods tend to perform feature selection by promoting sparsity, and thus, are well-suited for high-dimensional problems. However, since LSTD is not a simple regression algorithm, but… ▽ More

    Submitted 27 June, 2012; originally announced June 2012.

    Comments: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012)

  18. arXiv:1205.3054  [pdf, other

    cs.AI

    Approximate Modified Policy Iteration

    Authors: Bruno Scherrer, Victor Gabillon, Mohammad Ghavamzadeh, Matthieu Geist

    Abstract: Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensio… ▽ More

    Submitted 18 May, 2012; v1 submitted 14 May, 2012; originally announced May 2012.

  19. arXiv:1203.5532  [pdf, ps, other

    cs.AI

    On the Use of Non-Stationary Policies for Infinite-Horizon Discounted Markov Decision Processes

    Authors: Bruno Scherrer

    Abstract: We consider infinite-horizon $γ$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. We consider the algorithm Value Iteration and the sequence of policies $π_1,...,π_k$ it implicitely generates until some iteration $k$. We provide performance bounds for non-stationary policies involving the last $m$ generated policies that reduce the state-of… ▽ More

    Submitted 30 March, 2012; v1 submitted 25 March, 2012; originally announced March 2012.

  20. arXiv:1011.4362  [pdf, ps, other

    cs.AI

    Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view

    Authors: Bruno Scherrer

    Abstract: We investigate projection methods, for evaluating a linear approximation of the value function of a policy in a Markov Decision Process context. We consider two popular approaches, the one-step Temporal Difference fix-point computation (TD(0)) and the Bellman Residual (BR) minimization. We describe examples, where each method outperforms the other. We highlight a simple relation between the object… ▽ More

    Submitted 19 November, 2010; originally announced November 2010.

  21. arXiv:0711.0694  [pdf, ps, other

    cs.AI cs.RO

    Performance Bounds for Lambda Policy Iteration and Application to the Game of Tetris

    Authors: Bruno Scherrer

    Abstract: We consider the discrete-time infinite-horizon optimal control problem formalized by Markov Decision Processes. We revisit the work of Bertsekas and Ioffe, that introduced $λ$ Policy Iteration, a family of algorithms parameterized by $λ$ that generalizes the standard algorithms Value Iteration and Policy Iteration, and has some deep connections with the Temporal Differences algorithm TD($λ$) descr… ▽ More

    Submitted 11 October, 2011; v1 submitted 5 November, 2007; originally announced November 2007.

    Comments: No. RR-6348 (2011)

  22. arXiv:cs/0609142  [pdf, ps, other

    cs.AI

    Modular self-organization

    Authors: Bruno Scherrer

    Abstract: The aim of this paper is to provide a sound framework for addressing a difficult problem: the automatic construction of an autonomous agent's modular architecture. We combine results from two apparently uncorrelated domains: Autonomous planning through Markov Decision Processes and a General Data Clustering Approach using a kernel-like method. Our fundamental idea is that the former is a good fr… ▽ More

    Submitted 26 September, 2006; originally announced September 2006.