Search | arXiv e-print repository

Revisiting Step-Size Assumptions in Stochastic Approximation

Abstract: Many machine learning and optimization algorithms are built upon the framework of stochastic approximation (SA), for which the selection of step-size (or learning rate) is essential for success. For the sake of clarity, this paper focuses on the special case $α_n = α_0 n^{-ρ}$ at iteration $n$, with $ρ\in [0,1]$ and $α_0>0$ design parameters. It is most common in practice to take $ρ=0$ (constant s… ▽ More Many machine learning and optimization algorithms are built upon the framework of stochastic approximation (SA), for which the selection of step-size (or learning rate) is essential for success. For the sake of clarity, this paper focuses on the special case $α_n = α_0 n^{-ρ}$ at iteration $n$, with $ρ\in [0,1]$ and $α_0>0$ design parameters. It is most common in practice to take $ρ=0$ (constant step-size), while in more theoretically oriented papers a vanishing step-size is preferred. In particular, with $ρ\in (1/2, 1)$ it is known that on applying the averaging technique of Polyak and Ruppert, the mean-squared error (MSE) converges at the optimal rate of $O(1/n)$ and the covariance in the central limit theorem (CLT) is minimal in a precise sense. The paper revisits step-size selection in a general Markovian setting. Under readily verifiable assumptions, the following conclusions are obtained provided $0<ρ<1$: $\bullet$ Parameter estimates converge with probability one, and also in $L_p$ for any $p\ge 1$. $\bullet$ The MSE may converge very slowly for small $ρ$, of order $O(α_n^2)$ even with averaging. $\bullet$ For linear stochastic approximation the source of slow convergence is identified: for any $ρ\in (0,1)$, averaging results in estimates for which the error $\textit{covariance}$ vanishes at the optimal rate, and moreover the CLT covariance is optimal in the sense of Polyak and Ruppert. However, necessary and sufficient conditions are obtained under which the $\textit{bias}$ converges to zero at rate $O(α_n)$. This is the first paper to obtain such strong conclusions while allowing for $ρ\le 1/2$. A major conclusion is that the choice of $ρ=0$ or even $ρ<1/2$ is justified only in select settings -- In general, bias may preclude fast convergence. △ Less

Submitted 3 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

Comments: 30 pages, 5 figures

MSC Class: 62L20; 68T05

arXiv:2309.02944 [pdf, other]

The Curse of Memory in Stochastic Approximation: Extended Version

Authors: Caio Kalil Lauand, Sean Meyn

Abstract: Theory and application of stochastic approximation (SA) has grown within the control systems community since the earliest days of adaptive control. This paper takes a new look at the topic, motivated by recent results establishing remarkable performance of SA with (sufficiently small) constant step-size $α>0$. If averaging is implemented to obtain the final parameter estimate, then the estimates a… ▽ More Theory and application of stochastic approximation (SA) has grown within the control systems community since the earliest days of adaptive control. This paper takes a new look at the topic, motivated by recent results establishing remarkable performance of SA with (sufficiently small) constant step-size $α>0$. If averaging is implemented to obtain the final parameter estimate, then the estimates are asymptotically unbiased with nearly optimal asymptotic covariance. These results have been obtained for random linear SA recursions with i.i.d. coefficients. This paper obtains very different conclusions in the more common case of geometrically ergodic Markovian disturbance: (i) The $\textit{target bias}$ is identified, even in the case of non-linear SA, and is in general non-zero. The remaining results are established for linear SA recursions: (ii) the bivariate parameter-disturbance process is geometrically ergodic in a topological sense; (iii) the representation for bias has a simpler form in this case, and cannot be expected to be zero if there is multiplicative noise; (iv) the asymptotic covariance of the averaged parameters is within $O(α)$ of optimal. The error term is identified, and may be massive if mean dynamics are not well conditioned. The theory is illustrated with application to TD-learning. △ Less

Submitted 17 September, 2023; v1 submitted 6 September, 2023; originally announced September 2023.

Comments: 21 pages, 4 figures

MSC Class: 62L20; 68T05

arXiv:2002.10301 [pdf, other]

Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning

Authors: Adithya M. Devraj, Sean P. Meyn

Abstract: Sample complexity bounds are a common performance metric in the Reinforcement Learning literature. In the discounted cost, infinite horizon setting, all of the known bounds have a factor that is a polynomial in $1/(1-γ)$, where $γ< 1$ is the discount factor. For a large discount factor, these bounds seem to imply that a very large number of samples is required to achieve an $\varepsilon$-optimal p… ▽ More Sample complexity bounds are a common performance metric in the Reinforcement Learning literature. In the discounted cost, infinite horizon setting, all of the known bounds have a factor that is a polynomial in $1/(1-γ)$, where $γ< 1$ is the discount factor. For a large discount factor, these bounds seem to imply that a very large number of samples is required to achieve an $\varepsilon$-optimal policy. The objective of the present work is to introduce a new class of algorithms that have sample complexity uniformly bounded for all $γ< 1$. One may argue that this is impossible, due to a recent min-max lower bound. The explanation is that this previous lower bound is for a specific problem, which we modify, without compromising the ultimate objective of obtaining an $\varepsilon$-optimal policy. Specifically, we show that the asymptotic covariance of the Q-learning algorithm with an optimized step-size sequence is a quadratic function of $1/(1-γ)$; an expected, and essentially known result. The new relative Q-learning algorithm proposed here is shown to have asymptotic covariance that is a quadratic in $1/(1- ρ^* γ)$, where $1 - ρ^* > 0$ is an upper bound on the spectral gap of an optimal transition matrix. △ Less

Submitted 7 July, 2020; v1 submitted 24 February, 2020; originally announced February 2020.

Comments: 33 pages, 4 figures

arXiv:2002.02584 [pdf, other]

Explicit Mean-Square Error Bounds for Monte-Carlo and Linear Stochastic Approximation

Authors: Shuhang Chen, Adithya M. Devraj, Ana Bušić, Sean Meyn

Abstract: This paper concerns error bounds for recursive equations subject to Markovian disturbances. Motivating examples abound within the fields of Markov chain Monte Carlo (MCMC) and Reinforcement Learning (RL), and many of these algorithms can be interpreted as special cases of stochastic approximation (SA). It is argued that it is not possible in general to obtain a Hoeffding bound on the error sequenc… ▽ More This paper concerns error bounds for recursive equations subject to Markovian disturbances. Motivating examples abound within the fields of Markov chain Monte Carlo (MCMC) and Reinforcement Learning (RL), and many of these algorithms can be interpreted as special cases of stochastic approximation (SA). It is argued that it is not possible in general to obtain a Hoeffding bound on the error sequence, even when the underlying Markov chain is reversible and geometrically ergodic, such as the M/M/1 queue. This is motivation for the focus on mean square error bounds for parameter estimates. It is shown that mean square error achieves the optimal rate of $O(1/n)$, subject to conditions on the step-size sequence. Moreover, the exact constants in the rate are obtained, which is of great value in algorithm design. △ Less

Submitted 6 February, 2020; originally announced February 2020.

arXiv:1910.05405 [pdf, other]

Zap Q-Learning With Nonlinear Function Approximation

Authors: Shuhang Chen, Adithya M. Devraj, Fan Lu, Ana Bušić, Sean P. Meyn

Abstract: Zap Q-learning is a recent class of reinforcement learning algorithms, motivated primarily as a means to accelerate convergence. Stability theory has been absent outside of two restrictive classes: the tabular setting, and optimal stop**. This paper introduces a new framework for analysis of a more general class of recursive algorithms known as stochastic approximation. Based on this general the… ▽ More Zap Q-learning is a recent class of reinforcement learning algorithms, motivated primarily as a means to accelerate convergence. Stability theory has been absent outside of two restrictive classes: the tabular setting, and optimal stop**. This paper introduces a new framework for analysis of a more general class of recursive algorithms known as stochastic approximation. Based on this general theory, it is shown that Zap Q-learning is consistent under a non-degeneracy assumption, even when the function approximation architecture is nonlinear. Zap Q-learning with neural network function approximation emerges as a special case, and is tested on examples from OpenAI Gym. Based on multiple experiments with a range of neural network sizes, it is found that the new algorithms converge quickly and are robust to choice of function approximation architecture. △ Less

Submitted 15 July, 2020; v1 submitted 11 October, 2019; originally announced October 2019.

arXiv:1812.11137 [pdf, other]

Differential Temporal Difference Learning

Authors: Adithya M. Devraj, Ioannis Kontoyiannis, Sean P. Meyn

Abstract: Value functions derived from Markov decision processes arise as a central component of algorithms as well as performance metrics in many statistics and engineering applications of machine learning techniques. Computation of the solution to the associated Bellman equations is challenging in most practical cases of interest. A popular class of approximation techniques, known as Temporal Difference (… ▽ More Value functions derived from Markov decision processes arise as a central component of algorithms as well as performance metrics in many statistics and engineering applications of machine learning techniques. Computation of the solution to the associated Bellman equations is challenging in most practical cases of interest. A popular class of approximation techniques, known as Temporal Difference (TD) learning algorithms, are an important sub-class of general reinforcement learning methods. The algorithms introduced in this paper are intended to resolve two well-known difficulties of TD-learning approaches: Their slow convergence due to very high variance, and the fact that, for the problem of computing the relative value function, consistent algorithms exist only in special cases. First we show that the gradients of these value functions admit a representation that lends itself to algorithm design. Based on this result, a new class of differential TD-learning algorithms is introduced. For Markovian models on Euclidean space with smooth dynamics, the algorithms are shown to be consistent under general conditions. Numerical results show dramatic variance reduction when compared to standard methods. △ Less

Submitted 27 February, 2020; v1 submitted 28 December, 2018; originally announced December 2018.

Comments: Preliminary versions of some of the results in this article were submitted as arXiv:1604.01828

MSC Class: 93E20; 93E35; 60J20

arXiv:1808.01665 [pdf, other]

Diffusion approximations and control variates for MCMC

Authors: Nicolas Brosse, Alain Durmus, Sean Meyn, Eric Moulines, Anand Radhakrishnan

Abstract: A new methodology is presented for the construction of control variates to reduce the variance of additive functionals of Markov Chain Monte Carlo (MCMC) samplers. Our control variates are definedthrough the minimization of the asymptotic variance of the Langevin diffusion over a family of functions, which can be seen as a quadratic risk minimization procedure. The use of these control variates is… ▽ More A new methodology is presented for the construction of control variates to reduce the variance of additive functionals of Markov Chain Monte Carlo (MCMC) samplers. Our control variates are definedthrough the minimization of the asymptotic variance of the Langevin diffusion over a family of functions, which can be seen as a quadratic risk minimization procedure. The use of these control variates is theoretically justified. We show that the asymptotic variances of some well-known MCMC algorithms, including the Random Walk Metropolis and the (Metropolis) Unadjusted/Adjusted Langevin Algorithm, are close to the asymptotic variance of the Langevin diffusion. Several examples of Bayesian inference problems demonstrate that the corresponding reduction in the variance is significant. △ Less

Submitted 8 July, 2019; v1 submitted 5 August, 2018; originally announced August 2018.

arXiv:1609.00051 [pdf, other]

Estimation and Control of Quality of Service in Demand Dispatch

Authors: Yue Chen, Ana Bušić, Sean Meyn

Abstract: It is now well known that flexibility of energy consumption can be harnessed for the purposes of grid-level ancillary services. In particular, through distributed control of a collection of loads, a balancing authority regulation signal can be tracked accurately, while ensuring that the quality of service (QoS) for each load is acceptable {\it on average}. In this paper it is argued that a histogr… ▽ More It is now well known that flexibility of energy consumption can be harnessed for the purposes of grid-level ancillary services. In particular, through distributed control of a collection of loads, a balancing authority regulation signal can be tracked accurately, while ensuring that the quality of service (QoS) for each load is acceptable {\it on average}. In this paper it is argued that a histogram of QoS is approximately Gaussian, and consequently each load will eventually receive poor service. Statistical techniques are developed to estimate the mean and variance of QoS as a function of the power spectral density of the regulation signal. It is also shown that additional local control can eliminate risk: The histogram of QoS is {\it truncated} through this local control, so that strict bounds on service quality are guaranteed. While there is a tradeoff between the grid-level tracking performance (capacity and accuracy) and the bounds imposed on QoS, it is found that the loss of capacity is minor in typical cases. △ Less

Submitted 31 August, 2016; originally announced September 2016.

Comments: Submitted for publication, August 2016. arXiv admin note: text overlap with arXiv:1409.6941

MSC Class: 60J20; 68M20

arXiv:1604.04013 [pdf, other]

doi 10.1214/17-AAP1300

Ergodic Theory for Controlled Markov Chains with Stationary Inputs

Authors: Yue Chen, Ana Bušić, Sean Meyn

Abstract: Consider a stochastic process $\{X(t)\}$ on a finite state space $ {\sf X}=\{1,\dots, d\}$. It is conditionally Markov, given a real-valued `input process' $\{ζ(t)\}$. This is assumed to be small, which is modeled through the scaling, \[ ζ_t = \varepsilon ζ^1_t, \qquad 0\le \varepsilon \le 1\,, \] where $\{ζ^1(t)\}$ is a bounded stationary process. The following conclusions are obtained, subject t… ▽ More Consider a stochastic process $\{X(t)\}$ on a finite state space $ {\sf X}=\{1,\dots, d\}$. It is conditionally Markov, given a real-valued `input process' $\{ζ(t)\}$. This is assumed to be small, which is modeled through the scaling, \[ ζ_t = \varepsilon ζ^1_t, \qquad 0\le \varepsilon \le 1\,, \] where $\{ζ^1(t)\}$ is a bounded stationary process. The following conclusions are obtained, subject to smoothness assumptions on the controlled transition matrix and a mixing condition on $\{ζ(t)\}$: (i) A stationary version of the process is constructed, that is coupled with a stationary version of the Markov chain $\{X^\bullet$(t)\}obtained with $\{ζ(t)\}\equiv 0$. The triple $(\{X(t)\}, \{X^\bullet(t)\},\{ζ(t)\})$ is a jointly stationary process satisfying \[ {\sf P}\{X(t) \neq X^\bullet(t)\} = O(\varepsilon) \] Moreover, a second-order Taylor-series approximation is obtained: \[ {\sf P}\{X(t) =i \} ={\sf P}\{X^\bullet(t) =i \} + \varepsilon^2 \varrho(i) + o(\varepsilon^2),\quad 1\le i\le d, \] with an explicit formula for the vector $\varrho\in\mathbb{R}^d$. (ii) For any $m\ge 1$ and any function $f\colon \{1,\dots,d\}\times \mathbb{R}\to\mathbb{R}^m$, the stationary stochastic process $Y(t) = f(X(t),ζ(t))$ has a power spectral density $\text{S}_f$ that admits a second order Taylor series expansion: A function $\text{S}^{(2)}_f\colon [-π,π] \to \mathbb{C}^{ m\times m}$ is constructed such that \[ \text{S}_f(θ) = \text{S}^\bullet_f(θ) + \varepsilon^2 \text{S}_f^{(2)}(θ) + o(\varepsilon^2),\quad θ\in [-π,π] . \] An explicit formula for the function $\text{S}_f^{(2)}$ is obtained, based in part on the bounds in (i). The results are illustrated using a version of the timing channel of Anantharam and Verdu. △ Less

Submitted 18 June, 2016; v1 submitted 13 April, 2016; originally announced April 2016.

MSC Class: 60J20; 60G10; 68M20; 94A15

arXiv:math/0612040 [pdf, ps, other]

doi 10.1214/00-AAP492

Computable exponential bounds for screened estimation and simulation

Authors: Ioannis Kontoyiannis, Sean P. Meyn

Abstract: Suppose the expectation $E(F(X))$ is to be estimated by the empirical averages of the values of $F$ on independent and identically distributed samples $\{X_i\}$. A sampling rule called the "screened" estimator is introduced, and its performance is studied. When the mean $E(U(X))$ of a different function $U$ is known, the estimates are "screened," in that we only consider those which correspond t… ▽ More Suppose the expectation $E(F(X))$ is to be estimated by the empirical averages of the values of $F$ on independent and identically distributed samples $\{X_i\}$. A sampling rule called the "screened" estimator is introduced, and its performance is studied. When the mean $E(U(X))$ of a different function $U$ is known, the estimates are "screened," in that we only consider those which correspond to times when the empirical average of the $\{U(X_i)\}$ is sufficiently close to its known mean. As long as $U$ dominates $F$ appropriately, the screened estimates admit exponential error bounds, even when $F(X)$ is heavy-tailed. The main results are several nonasymptotic, explicit exponential bounds for the screened estimates. A geometric interpretation, in the spirit of Sanov's theorem, is given for the fact that the screened estimates always admit exponential error bounds, even if the standard estimates do not. And when they do, the screened estimates' error probability has a significantly better exponent. This implies that screening can be interpreted as a variance reduction technique. Our main mathematical tools come from large deviations techniques. The results are illustrated by a detailed simulation example. △ Less

Submitted 22 August, 2008; v1 submitted 1 December, 2006; originally announced December 2006.

Comments: Published in at http://dx.doi.org/10.1214/00-AAP492 the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AAP-AAP492 MSC Class: 60C05; 60F10 (Primary) 60G05; 60E15 (Secondary)

Journal ref: Annals of Applied Probability 2008, Vol. 18, No. 4, 1491-1518

Showing 1–10 of 10 results for author: Meyn, S