Search | arXiv e-print repository

Almost sure convergence of stochastic Hamiltonian descent methods

Authors: Måns Williamson, Tony Stillfjord

Abstract: Gradient normalization and soft clip** are two popular techniques for tackling instability issues and improving convergence of stochastic gradient descent (SGD) with momentum. In this article, we study these types of methods through the lens of dissipative Hamiltonian systems. Gradient normalization and certain types of soft clip** algorithms can be seen as (stochastic) implicit-explicit Euler… ▽ More Gradient normalization and soft clip** are two popular techniques for tackling instability issues and improving convergence of stochastic gradient descent (SGD) with momentum. In this article, we study these types of methods through the lens of dissipative Hamiltonian systems. Gradient normalization and certain types of soft clip** algorithms can be seen as (stochastic) implicit-explicit Euler discretizations of dissipative Hamiltonian systems, where the kinetic energy function determines the type of clip** that is applied. We make use of unified theory from dynamical systems to show that all of these schemes converge almost surely to stationary points of the objective function. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.16640 [pdf, other]

Analysis of a Class of Stochastic Component-Wise Soft-Clip** Schemes

Authors: Måns Williamson, Monika Eisenmann, Tony Stillfjord

Abstract: Choosing the optimization algorithm that performs best on a given machine learning problem is often delicate, and there is no guarantee that current state-of-the-art algorithms will perform well across all tasks. Consequently, the more reliable methods that one has at hand, the larger the likelihood of a good end result. To this end, we introduce and analyze a large class of stochastic so-called s… ▽ More Choosing the optimization algorithm that performs best on a given machine learning problem is often delicate, and there is no guarantee that current state-of-the-art algorithms will perform well across all tasks. Consequently, the more reliable methods that one has at hand, the larger the likelihood of a good end result. To this end, we introduce and analyze a large class of stochastic so-called soft-clip** schemes with a broad range of applications. Despite the wide adoption of clip** techniques in practice, soft-clip** methods have not been analyzed to a large extent in the literature. In particular, a rigorous mathematical analysis is lacking in the general, nonlinear case. Our analysis lays a theoretical foundation for a large class of such schemes, and motivates their usage. In particular, under standard assumptions such as Lipschitz continuous gradients of the objective function, we give rigorous proofs of convergence in expectation. These include rates in both the convex and the non-convex case, as well as almost sure convergence to a stationary point in the non-convex case. The computational cost of the analyzed schemes is essentially the same as that of stochastic gradient descent. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2402.13656 [pdf, other]

Numerical methods for closed-loop systems with non-autonomous data

Authors: B. Baran, P. Benner, J. Saak, T. Stillfjord

Abstract: By computing a feedback control via the linear quadratic regulator (LQR) approach and simulating a non-linear non-autonomous closed-loop system using this feedback, we combine two numerically challenging tasks. For the first task, the computation of the feedback control, we use the non-autonomous generalized differential Riccati equation (DRE), whose solution determines the time-varying feedback g… ▽ More By computing a feedback control via the linear quadratic regulator (LQR) approach and simulating a non-linear non-autonomous closed-loop system using this feedback, we combine two numerically challenging tasks. For the first task, the computation of the feedback control, we use the non-autonomous generalized differential Riccati equation (DRE), whose solution determines the time-varying feedback gain matrix. Regarding the second task, we want to be able to simulate non-linear closed-loop systems for which it is known that the regulator is only valid for sufficiently small perturbations. Thus, one easily runs into numerical issues in the integrators when the closed-loop control varies greatly. For these systems, e.g., the A-stable implicit Euler methods fails.\newline On the one hand, we implement non-autonomous versions of splitting schemes and BDF methods for the solution of our non-autonomous DREs. These are well-established DRE solvers in the autonomous case. On the other hand, to tackle the numerical issues in the simulation of the non-linear closed-loop system, we apply a fractional-step-theta scheme with time-adaptivity tuned specifically to this kind of challenge. That is, we additionally base the time-adaptivity on the activity of the control. We compare this approach to the more classical error-based time-adaptivity.\newline We describe techniques to make these two tasks computable in a reasonable amount of time and are able to simulate closed-loop systems with strongly varying controls, while avoiding numerical issues. Our time-adaptivity approach requires fewer time steps than the error-based alternative and is more reliable. △ Less

Submitted 21 February, 2024; originally announced February 2024.

MSC Class: 65F45; 93A15; 93B52; 93C10

arXiv:2310.13462 [pdf, other]

Computing the matrix exponential and the Cholesky factor of a related finite horizon Gramian

Authors: Tony Stillfjord, Filip Tronarp

Abstract: In this article, an efficient numerical method for computing finite-horizon controllability Gramians in Cholesky-factored form is proposed. The method is applicable to general dense matrices of moderate size and produces a Cholesky factor of the Gramian without computing the full product. In contrast to other methods applicable to this task, the proposed method is a generalization of the scaling-a… ▽ More In this article, an efficient numerical method for computing finite-horizon controllability Gramians in Cholesky-factored form is proposed. The method is applicable to general dense matrices of moderate size and produces a Cholesky factor of the Gramian without computing the full product. In contrast to other methods applicable to this task, the proposed method is a generalization of the scaling-and-squaring approach for approximating the matrix exponential. It exploits a similar doubling formula for the Gramian, and thereby keeps the required computational effort modest. Most importantly, a rigorous backward error analysis is provided, which guarantees that the approximation is accurate to the round-off error level in double precision. This accuracy is illustrated in practice on a large number of standard test examples. The method has been implemented in the Julia package FiniteHorizonGramians.jl, which is available online under the MIT license. Code for reproducing the experimental results is included in this package, as well as code for determining the optimal method parameters. The analysis can thus easily be adapted to a different finite-precision arithmetic. △ Less

Submitted 30 April, 2024; v1 submitted 20 October, 2023; originally announced October 2023.

arXiv:2210.05375 [pdf, other]

A randomized operator splitting scheme inspired by stochastic optimization methods

Authors: Monika Eisenmann, Tony Stillfjord

Abstract: In this paper, we combine the operator splitting methodology for abstract evolution equations with that of stochastic methods for large-scale optimization problems. The combination results in a randomized splitting scheme, which in a given time step does not necessarily use all the parts of the split operator. This is in contrast to deterministic splitting schemes which always use every part at le… ▽ More In this paper, we combine the operator splitting methodology for abstract evolution equations with that of stochastic methods for large-scale optimization problems. The combination results in a randomized splitting scheme, which in a given time step does not necessarily use all the parts of the split operator. This is in contrast to deterministic splitting schemes which always use every part at least once, and often several times. As a result, the computational cost can be significantly decreased in comparison to such methods. We rigorously define a randomized operator splitting scheme in an abstract setting and provide an error analysis where we prove that the temporal convergence order of the scheme is at least 1/2. We illustrate the theory by numerical experiments on both linear and quasilinear diffusion problems, using a randomized domain decomposition approach. We conclude that choosing the randomization in certain ways may improve the order to 1. This is as accurate as applying e.g. backward (implicit) Euler to the full problem, without splitting. △ Less

Submitted 11 October, 2022; originally announced October 2022.

MSC Class: 65M12 (Primary) 65C99; 90C15; 65M55 (Secondary)

arXiv:2201.12782 [pdf, other]

SRKCD: a stabilized Runge-Kutta method for stochastic optimization

Authors: Tony Stillfjord, Måns Williamson

Abstract: We introduce a family of stochastic optimization methods based on the Runge-Kutta-Chebyshev (RKC) schemes. The RKC methods are explicit methods originally designed for solving stiff ordinary differential equations by ensuring that their stability regions are of maximal size.In the optimization context, this allows for larger step sizes (learning rates) and better robustness compared to e.g. the po… ▽ More We introduce a family of stochastic optimization methods based on the Runge-Kutta-Chebyshev (RKC) schemes. The RKC methods are explicit methods originally designed for solving stiff ordinary differential equations by ensuring that their stability regions are of maximal size.In the optimization context, this allows for larger step sizes (learning rates) and better robustness compared to e.g. the popular stochastic gradient descent method. Our main contribution is a convergence proof for essentially all stochastic Runge-Kutta optimization methods. This shows convergence in expectation with an optimal sublinear rate under standard assumptions of strong convexity and Lipschitz-continuous gradients. For non-convex objectives, we get convergence to zero in expectation of the gradients. The proof requires certain natural conditions on the Runge-Kutta coefficients, and we further demonstrate that the RKC schemes satisfy these. Finally, we illustrate the improved stability properties of the methods in practice by performing numerical experiments on both a small-scale test example and on a problem arising from an image classification application in machine learning. △ Less

Submitted 30 January, 2022; originally announced January 2022.

MSC Class: 90C15; 65K05; 65L20

arXiv:2106.09286 [pdf, other]

Sub-linear convergence of a tamed stochastic gradient descent method in Hilbert space

Authors: Monika Eisenmann, Tony Stillfjord

Abstract: In this paper, we introduce the tamed stochastic gradient descent method (TSGD) for optimization problems. Inspired by the tamed Euler scheme, which is a commonly used method within the context of stochastic differential equations, TSGD is an explicit scheme that exhibits stability properties similar to those of implicit schemes. As its computational cost is essentially equivalent to that of the w… ▽ More In this paper, we introduce the tamed stochastic gradient descent method (TSGD) for optimization problems. Inspired by the tamed Euler scheme, which is a commonly used method within the context of stochastic differential equations, TSGD is an explicit scheme that exhibits stability properties similar to those of implicit schemes. As its computational cost is essentially equivalent to that of the well-known stochastic gradient descent method (SGD), it constitutes a very competitive alternative to such methods. We rigorously prove (optimal) sub-linear convergence of the scheme for strongly convex objective functions on an abstract Hilbert space. The analysis only requires very mild step size restrictions, which illustrates the good stability properties. The analysis is based on a priori estimates more frequently encountered in a time integration context than in optimization, and this alternative approach provides a different perspective also on the convergence of SGD. Finally, we demonstrate the usability of the scheme on a problem arising in a context of supervised learning. △ Less

Submitted 17 June, 2021; originally announced June 2021.

MSC Class: 46N10; 65K10; 90C15

arXiv:2010.12348 [pdf, other]

Sub-linear convergence of a stochastic proximal iteration method in Hilbert space

Authors: Monika Eisenmann, Tony Stillfjord, Måns Williamson

Abstract: We consider a stochastic version of the proximal point algorithm for optimization problems posed on a Hilbert space. A typical application of this is supervised learning. While the method is not new, it has not been extensively analyzed in this form. Indeed, most related results are confined to the finite-dimensional setting, where error bounds could depend on the dimension of the space. On the ot… ▽ More We consider a stochastic version of the proximal point algorithm for optimization problems posed on a Hilbert space. A typical application of this is supervised learning. While the method is not new, it has not been extensively analyzed in this form. Indeed, most related results are confined to the finite-dimensional setting, where error bounds could depend on the dimension of the space. On the other hand, the few existing results in the infinite-dimensional setting only prove very weak types of convergence, owing to weak assumptions on the problem. In particular, there are no results that show convergence with a rate. In this article, we bridge these two worlds by assuming more regularity of the optimization problem, which allows us to prove convergence with an (optimal) sub-linear rate also in an infinite-dimensional setting. In particular, we assume that the objective function is the expected value of a family of convex differentiable functions. While we require that the full objective function is strongly convex, we do not assume that its constituent parts are so. Further, we require that the gradient satisfies a weak local Lipschitz continuity property, where the Lipschitz constant may grow polynomially given certain guarantees on the variance and higher moments near the minimum. We illustrate these results by discretizing a concrete infinite-dimensional classification problem with varying degrees of accuracy. △ Less

Submitted 27 September, 2021; v1 submitted 23 October, 2020; originally announced October 2020.

Comments: Adjusted the setting, corrected minor typos and made several arguments and motivations more clear

MSC Class: 46N10; 65K10; 90C15

arXiv:2006.05370 [pdf, ps, other]

doi 10.1093/imanum/drab033

A linear implicit Euler method for the finite element discretization of a controlled stochastic heat equation

Authors: Peter Benner, Tony Stillfjord, Christoph Trautwein

Abstract: We consider a numerical approximation of a linear quadratic control problem constrained by the stochastic heat equation with non-homogeneous Neumann boundary conditions. This involves a combination of distributed and boundary control, as well as both distributed and boundary noise. We apply the finite element method for the spatial discretization and the linear implicit Euler method for the tempor… ▽ More We consider a numerical approximation of a linear quadratic control problem constrained by the stochastic heat equation with non-homogeneous Neumann boundary conditions. This involves a combination of distributed and boundary control, as well as both distributed and boundary noise. We apply the finite element method for the spatial discretization and the linear implicit Euler method for the temporal discretization. Due to the low regularity induced by the boundary noise, convergence orders above 1/2 in space and 1/4 in time cannot be expected. We prove such optimal convergence orders for our full discretization when the distributed noise and the initial condition are sufficiently smooth. Under less smooth conditions, the convergence order is further decreased. Our results only assume that the related (deterministic) differential Riccati equation can be approximated with a certain convergence order, which is easy to achieve in practice. We confirm these theoretical results through a numerical experiment in a two dimensional domain. △ Less

Submitted 9 June, 2020; originally announced June 2020.

MSC Class: 65M12 (Primary) 65C30; 49N10 (Secondary)

arXiv:1805.08990 [pdf, ps, other]

GPU acceleration of splitting schemes applied to differential matrix equations

Authors: Hermann Mena, Lena-Maria Pfurtscheller, Tony Stillfjord

Abstract: We consider differential Lyapunov and Riccati equations, and generalized versions thereof. Such equations arise in many different areas and are especially important within the field of optimal control. In order to approximate their solution, one may use several different kinds of numerical methods. Of these, splitting schemes are often a very competitive choice. In this article, we investigate the… ▽ More We consider differential Lyapunov and Riccati equations, and generalized versions thereof. Such equations arise in many different areas and are especially important within the field of optimal control. In order to approximate their solution, one may use several different kinds of numerical methods. Of these, splitting schemes are often a very competitive choice. In this article, we investigate the use of graphical processing units (GPUs) to parallelize such schemes and thereby further increase their effectiveness. According to our numerical experiments, large speed-ups are often observed for sufficiently large matrices. We also provide a comparison between different splitting strategies, demonstrating that splitting the equations into a moderate number of subproblems is generally optimal. △ Less

Submitted 22 October, 2018; v1 submitted 23 May, 2018; originally announced May 2018.

Comments: 21 pages, 17 figures

MSC Class: 65F30; 65Y05; 65F60

arXiv:1804.02197 [pdf, ps, other]

Singular value decay of operator-valued differential Lyapunov and Riccati equations

Authors: Tony Stillfjord

Abstract: We consider operator-valued differential Lyapunov and Riccati equations, where the operators $B$ and $C$ may be relatively unbounded with respect to $A$ (in the standard notation). In this setting, we prove that the singular values of the solutions decay fast under certain conditions. In fact, the decay is exponential in the negative square root if $A$ generates an analytic semigroup and the range… ▽ More We consider operator-valued differential Lyapunov and Riccati equations, where the operators $B$ and $C$ may be relatively unbounded with respect to $A$ (in the standard notation). In this setting, we prove that the singular values of the solutions decay fast under certain conditions. In fact, the decay is exponential in the negative square root if $A$ generates an analytic semigroup and the range of $C$ has finite dimension. This extends previous similar results for algebraic equations to the differential case. When the initial condition is zero, we also show that the singular values converge to zero as time goes to zero, with a certain rate that depends on the degree of unboundedness of $C$. A fast decay of the singular values corresponds to a low numerical rank, which is a critical feature in large-scale applications. The results reported here provide a theoretical foundation for the observation that, in practice, a low-rank factorization usually exists. △ Less

Submitted 25 June, 2018; v1 submitted 6 April, 2018; originally announced April 2018.

Comments: Corrected some misconceptions, which lead to more general results (e.g. exponential stability is no longer required). Also fixed some off-by-one errors, improved the presentation, and added/extended several remarks on possible generalizations. Now 22 pages, 8 figures

MSC Class: 47A62; 47A11; 49N10

arXiv:1706.04380 [pdf, other]

doi 10.1137/17M1134500

Multiscale differential Riccati equations for linear quadratic regulator problems

Authors: Axel Målqvist, Anna Persson, Tony Stillfjord

Abstract: We consider approximations to the solutions of differential Riccati equations in the context of linear quadratic regulator problems, where the state equation is governed by a multiscale operator. Similarly to elliptic and parabolic problems, standard finite element discretizations perform poorly in this setting unless the grid resolves the fine-scale features of the problem. This results in unfeas… ▽ More We consider approximations to the solutions of differential Riccati equations in the context of linear quadratic regulator problems, where the state equation is governed by a multiscale operator. Similarly to elliptic and parabolic problems, standard finite element discretizations perform poorly in this setting unless the grid resolves the fine-scale features of the problem. This results in unfeasible amounts of computation and high memory requirements. In this paper, we demonstrate how the localized orthogonal decomposition method may be used to acquire accurate results also for coarse discretizations, at the low cost of solving a series of small, localized elliptic problems. We prove second-order convergence (except for a logarithmic factor) in the $L^2$ operator norm, and first-order convergence in the corresponding energy norm. These results are both independent of the multiscale variations in the state equation. In addition, we provide a detailed derivation of the fully discrete matrix-valued equations, and show how they can be handled in a low-rank setting for large-scale computations. In connection to this, we also show how to efficiently compute the relevant operator-norm errors. Finally, our theoretical results are validated by several numerical experiments. △ Less

Submitted 18 June, 2018; v1 submitted 14 June, 2017; originally announced June 2017.

Comments: Accepted for publication in SIAM J. Sci. Comput. This version differs from the previous one only by the addition of Remark 7.2 and minor changes in formatting. 21 pages, 12 figures

MSC Class: 49N10; 65N12; 65N30; 93C20

Journal ref: SIAM J. Sci. Comput. 40(4) (2018), pp. A2406--A2426

arXiv:1612.00677 [pdf, other]

doi 10.1007/s11075-017-0416-8

Adaptive high-order splitting schemes for large-scale differential Riccati equations

Authors: Tony Stillfjord

Abstract: We consider high-order splitting schemes for large-scale differential Riccati equations. Such equations arise in many different areas and are especially important within the field of optimal control. In the large-scale case, it is critical to employ structural properties of the matrix-valued solution, or the computational cost and storage requirements become infeasible. Our main contribution is th… ▽ More We consider high-order splitting schemes for large-scale differential Riccati equations. Such equations arise in many different areas and are especially important within the field of optimal control. In the large-scale case, it is critical to employ structural properties of the matrix-valued solution, or the computational cost and storage requirements become infeasible. Our main contribution is therefore to formulate these high-order splitting schemes in a efficient way by utilizing a low-rank factorization. Previous results indicated that this was impossible for methods of order higher than 2, but our new approach overcomes these difficulties. In addition, we demonstrate that the proposed methods contain natural embedded error estimates. These may be used e.g. for time step adaptivity, and our numerical experiments in this direction show promising results. △ Less

Submitted 26 June, 2017; v1 submitted 2 December, 2016; originally announced December 2016.

Comments: 23 pages, 7 figures

MSC Class: 15A24; 49N10; 65L05; 93A15

Journal ref: Numer. Algor. 78(4) (2018), pp. 1129--1151

arXiv:1606.00345 [pdf, other]

doi 10.1007/s10543-017-0653-1

Finite element convergence analysis for the thermoviscoelastic Joule heating problem

Authors: Axel Målqvist, Tony Stillfjord

Abstract: We consider a system of equations that model the temperature, electric potential and deformation of a thermoviscoelastic body. A typical application is a thermistor; an electrical component that can be used e.g. as a surge protector, temperature sensor or for very precise positioning. We introduce a full discretization based on standard finite elements in space and a semi-implicit Euler-type metho… ▽ More We consider a system of equations that model the temperature, electric potential and deformation of a thermoviscoelastic body. A typical application is a thermistor; an electrical component that can be used e.g. as a surge protector, temperature sensor or for very precise positioning. We introduce a full discretization based on standard finite elements in space and a semi-implicit Euler-type method in time. For this method we prove optimal convergence orders, i.e. second-order in space and first-order in time. The theoretical results are verified by several numerical experiments in two and three dimensions. △ Less

Submitted 20 February, 2017; v1 submitted 1 June, 2016; originally announced June 2016.

Comments: 20 pages, 6 figures, 2 tables

MSC Class: 65M12; 65M60; 74D05; 74H15

Journal ref: BIT 57(3) (2017), pp.787-810

Showing 1–14 of 14 results for author: Stillfjord, T