Search | arXiv e-print repository

arXiv:2008.03559 [pdf, other]

Convex Q-Learning, Part 1: Deterministic Optimal Control

Authors: Prashant G. Mehta, Sean P. Meyn

Abstract: It is well known that the extension of Watkins' algorithm to general function approximation settings is challenging: does the projected Bellman equation have a solution? If so, is the solution useful in the sense of generating a good policy? And, if the preceding questions are answered in the affirmative, is the algorithm consistent? These questions are unanswered even in the special case of Q-fun… ▽ More It is well known that the extension of Watkins' algorithm to general function approximation settings is challenging: does the projected Bellman equation have a solution? If so, is the solution useful in the sense of generating a good policy? And, if the preceding questions are answered in the affirmative, is the algorithm consistent? These questions are unanswered even in the special case of Q-function approximations that are linear in the parameter. The challenge seems paradoxical, given the long history of convex analytic approaches to dynamic programming. The paper begins with a brief survey of linear programming approaches to optimal control, leading to a particular over parameterization that lends itself to applications in reinforcement learning. The main conclusions are summarized as follows: (i) The new class of convex Q-learning algorithms is introduced based on the convex relaxation of the Bellman equation. Convergence is established under general conditions, including a linear function approximation for the Q-function. (ii) A batch implementation appears similar to the famed DQN algorithm (one engine behind AlphaZero). It is shown that in fact the algorithms are very different: while convex Q-learning solves a convex program that approximates the Bellman equation, theory for DQN is no stronger than for Watkins' algorithm with function approximation: (a) it is shown that both seek solutions to the same fixed point equation, and (b) the ODE approximations for the two algorithms coincide, and little is known about the stability of this ODE. These results are obtained for deterministic nonlinear systems with total cost criterion. Many extensions are proposed, including kernel implementation, and extension to MDP models. △ Less

Submitted 8 August, 2020; originally announced August 2020.

Comments: This pre-print is written in a tutorial style so it is accessible to new-comers. It will be a part of a handout for upcoming short courses on RL. A more compact version suitable for journal submission is in preparation

MSC Class: 68T05 (Primary) 93E35; 49L20 (Secondary)

arXiv:2002.10301 [pdf, other]

Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning

Authors: Adithya M. Devraj, Sean P. Meyn

Abstract: Sample complexity bounds are a common performance metric in the Reinforcement Learning literature. In the discounted cost, infinite horizon setting, all of the known bounds have a factor that is a polynomial in $1/(1-γ)$, where $γ< 1$ is the discount factor. For a large discount factor, these bounds seem to imply that a very large number of samples is required to achieve an $\varepsilon$-optimal p… ▽ More Sample complexity bounds are a common performance metric in the Reinforcement Learning literature. In the discounted cost, infinite horizon setting, all of the known bounds have a factor that is a polynomial in $1/(1-γ)$, where $γ< 1$ is the discount factor. For a large discount factor, these bounds seem to imply that a very large number of samples is required to achieve an $\varepsilon$-optimal policy. The objective of the present work is to introduce a new class of algorithms that have sample complexity uniformly bounded for all $γ< 1$. One may argue that this is impossible, due to a recent min-max lower bound. The explanation is that this previous lower bound is for a specific problem, which we modify, without compromising the ultimate objective of obtaining an $\varepsilon$-optimal policy. Specifically, we show that the asymptotic covariance of the Q-learning algorithm with an optimized step-size sequence is a quadratic function of $1/(1-γ)$; an expected, and essentially known result. The new relative Q-learning algorithm proposed here is shown to have asymptotic covariance that is a quadratic in $1/(1- ρ^* γ)$, where $1 - ρ^* > 0$ is an upper bound on the spectral gap of an optimal transition matrix. △ Less

Submitted 7 July, 2020; v1 submitted 24 February, 2020; originally announced February 2020.

Comments: 33 pages, 4 figures

arXiv:1910.05405 [pdf, other]

Zap Q-Learning With Nonlinear Function Approximation

Authors: Shuhang Chen, Adithya M. Devraj, Fan Lu, Ana Bušić, Sean P. Meyn

Abstract: Zap Q-learning is a recent class of reinforcement learning algorithms, motivated primarily as a means to accelerate convergence. Stability theory has been absent outside of two restrictive classes: the tabular setting, and optimal stop**. This paper introduces a new framework for analysis of a more general class of recursive algorithms known as stochastic approximation. Based on this general the… ▽ More Zap Q-learning is a recent class of reinforcement learning algorithms, motivated primarily as a means to accelerate convergence. Stability theory has been absent outside of two restrictive classes: the tabular setting, and optimal stop**. This paper introduces a new framework for analysis of a more general class of recursive algorithms known as stochastic approximation. Based on this general theory, it is shown that Zap Q-learning is consistent under a non-degeneracy assumption, even when the function approximation architecture is nonlinear. Zap Q-learning with neural network function approximation emerges as a special case, and is tested on examples from OpenAI Gym. Based on multiple experiments with a range of neural network sizes, it is found that the new algorithms converge quickly and are robust to choice of function approximation architecture. △ Less

Submitted 15 July, 2020; v1 submitted 11 October, 2019; originally announced October 2019.

arXiv:1904.11538 [pdf, other]

Zap Q-Learning for Optimal Stop** Time Problems

Authors: Shuhang Chen, Adithya M. Devraj, Ana Bušić, Sean P. Meyn

Abstract: The objective in this paper is to obtain fast converging reinforcement learning algorithms to approximate solutions to the problem of discounted cost optimal stop** in an irreducible, uniformly ergodic Markov chain, evolving on a compact subset of $\mathbb{R}^n$. We build on the dynamic programming approach taken by Tsitsikilis and Van Roy, wherein they propose a Q-learning algorithm to estimate… ▽ More The objective in this paper is to obtain fast converging reinforcement learning algorithms to approximate solutions to the problem of discounted cost optimal stop** in an irreducible, uniformly ergodic Markov chain, evolving on a compact subset of $\mathbb{R}^n$. We build on the dynamic programming approach taken by Tsitsikilis and Van Roy, wherein they propose a Q-learning algorithm to estimate the optimal state-action value function, which then defines an optimal stop** rule. We provide insights as to why the convergence rate of this algorithm can be slow, and propose a fast-converging alternative, the "Zap-Q-learning" algorithm, designed to achieve optimal rate of convergence. For the first time, we prove the convergence of the Zap-Q-learning algorithm under the assumption of linear function approximation setting. We use ODE analysis for the proof, and the optimal asymptotic variance property of the algorithm is reflected via fast convergence in a finance example. △ Less

Submitted 30 September, 2019; v1 submitted 25 April, 2019; originally announced April 2019.

arXiv:1812.11137 [pdf, other]

Differential Temporal Difference Learning

Authors: Adithya M. Devraj, Ioannis Kontoyiannis, Sean P. Meyn

Abstract: Value functions derived from Markov decision processes arise as a central component of algorithms as well as performance metrics in many statistics and engineering applications of machine learning techniques. Computation of the solution to the associated Bellman equations is challenging in most practical cases of interest. A popular class of approximation techniques, known as Temporal Difference (… ▽ More Value functions derived from Markov decision processes arise as a central component of algorithms as well as performance metrics in many statistics and engineering applications of machine learning techniques. Computation of the solution to the associated Bellman equations is challenging in most practical cases of interest. A popular class of approximation techniques, known as Temporal Difference (TD) learning algorithms, are an important sub-class of general reinforcement learning methods. The algorithms introduced in this paper are intended to resolve two well-known difficulties of TD-learning approaches: Their slow convergence due to very high variance, and the fact that, for the problem of computing the relative value function, consistent algorithms exist only in special cases. First we show that the gradients of these value functions admit a representation that lends itself to algorithm design. Based on this result, a new class of differential TD-learning algorithms is introduced. For Markovian models on Euclidean space with smooth dynamics, the algorithms are shown to be consistent under general conditions. Numerical results show dramatic variance reduction when compared to standard methods. △ Less

Submitted 27 February, 2020; v1 submitted 28 December, 2018; originally announced December 2018.

Comments: Preliminary versions of some of the results in this article were submitted as arXiv:1604.01828

MSC Class: 93E20; 93E35; 60J20

arXiv:1707.03770 [pdf, other]

Fastest Convergence for Q-learning

Authors: Adithya M. Devraj, Sean P. Meyn

Abstract: The Zap Q-learning algorithm introduced in this paper is an improvement of Watkins' original algorithm and recent competitors in several respects. It is a matrix-gain algorithm designed so that its asymptotic variance is optimal. Moreover, an ODE analysis suggests that the transient behavior is a close match to a deterministic Newton-Raphson implementation. This is made possible by a two time-scal… ▽ More The Zap Q-learning algorithm introduced in this paper is an improvement of Watkins' original algorithm and recent competitors in several respects. It is a matrix-gain algorithm designed so that its asymptotic variance is optimal. Moreover, an ODE analysis suggests that the transient behavior is a close match to a deterministic Newton-Raphson implementation. This is made possible by a two time-scale update equation for the matrix gain sequence. The analysis suggests that the approach will lead to stable and efficient computation even for non-ideal parameterized settings. Numerical experiments confirm the quick convergence, even in such non-ideal cases. A secondary goal of this paper is tutorial. The first half of the paper contains a survey on reinforcement learning algorithms, with a focus on minimum variance algorithms. △ Less

Submitted 21 March, 2018; v1 submitted 12 July, 2017; originally announced July 2017.

arXiv:1604.01828 [pdf, other]

Differential TD Learning for Value Function Approximation

Authors: Adithya M. Devraj, Sean P. Meyn

Abstract: Value functions arise as a component of algorithms as well as performance metrics in statistics and engineering applications. Computation of the associated Bellman equations is numerically challenging in all but a few special cases. A popular approximation technique is known as Temporal Difference (TD) learning. The algorithm introduced in this paper is intended to resolve two well-known problems… ▽ More Value functions arise as a component of algorithms as well as performance metrics in statistics and engineering applications. Computation of the associated Bellman equations is numerically challenging in all but a few special cases. A popular approximation technique is known as Temporal Difference (TD) learning. The algorithm introduced in this paper is intended to resolve two well-known problems with this approach: In the discounted-cost setting, the variance of the algorithm diverges as the discount factor approaches unity. Second, for the average cost setting, unbiased algorithms exist only in special cases. It is shown that the gradient of any of these value functions admits a representation that lends itself to algorithm design. Based on this result, the new differential TD method is obtained for Markovian models on Euclidean space with smooth dynamics. Numerical examples show remarkable improvements in performance. In application to speed scaling, variance is reduced by two orders of magnitude. △ Less

Submitted 23 December, 2018; v1 submitted 6 April, 2016; originally announced April 2016.

MSC Class: 93E20; 93E35; 60J20

arXiv:1502.03762 [pdf, other]

Rationally inattentive control of Markov processes

Authors: Ehsan Shafieepoorfard, Maxim Raginsky, Sean P. Meyn

Abstract: The article poses a general model for optimal control subject to information constraints, motivated in part by recent work of Sims and others on information-constrained decision-making by economic agents. In the average-cost optimal control framework, the general model introduced in this paper reduces to a variant of the linear-programming representation of the average-cost optimal control problem… ▽ More The article poses a general model for optimal control subject to information constraints, motivated in part by recent work of Sims and others on information-constrained decision-making by economic agents. In the average-cost optimal control framework, the general model introduced in this paper reduces to a variant of the linear-programming representation of the average-cost optimal control problem, subject to an additional mutual information constraint on the randomized stationary policy. The resulting optimization problem is convex and admits a decomposition based on the Bellman error, which is the object of study in approximate dynamic programming. The theory is illustrated through the example of information-constrained linear-quadratic-Gaussian (LQG) control problem. Some results on the infinite-horizon discounted-cost criterion are also presented. △ Less

Submitted 23 February, 2016; v1 submitted 12 February, 2015; originally announced February 2015.

Comments: 30 pages, 2 figures; accepted to SIAM Journal on Control and Optimization

MSC Class: 94A34; 90C40; 90C47

arXiv:1010.4820 [pdf, ps, other]

Random-Time, State-Dependent Stochastic Drift for Markov Chains and Application to Stochastic Stabilization Over Erasure Channels

Authors: Serdar Yüksel, Sean P. Meyn

Abstract: It is known that state-dependent, multi-step Lyapunov bounds lead to greatly simplified verification theorems for stability for large classes of Markov chain models. This is one component of the "fluid model" approach to stability of stochastic networks. In this paper we extend the general theory to randomized multi-step Lyapunov theory to obtain criteria for stability and steady-state performance… ▽ More It is known that state-dependent, multi-step Lyapunov bounds lead to greatly simplified verification theorems for stability for large classes of Markov chain models. This is one component of the "fluid model" approach to stability of stochastic networks. In this paper we extend the general theory to randomized multi-step Lyapunov theory to obtain criteria for stability and steady-state performance bounds, such as finite moments. These results are applied to a remote stabilization problem, in which a controller receives measurements from an erasure channel with limited capacity. Based on the general results in the paper it is shown that stability of the closed loop system is assured provided that the channel capacity is greater than the logarithm of the unstable eigenvalue, plus an additional correction term. The existence of a finite second moment in steady-state is established under additional conditions. △ Less

Submitted 17 May, 2012; v1 submitted 22 October, 2010; originally announced October 2010.

Comments: To appear in IEEE Transactions on Automatic Control

MSC Class: 93E03; 94A15; 60J05

Showing 1–9 of 9 results for author: Meyn, S P