Search | arXiv e-print repository

arXiv:2407.08100 [pdf, ps, other]

Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates

Authors: Steffen Dereich, Robin Graeber, Arnulf Jentzen

Abstract: Deep learning algorithms - typically consisting of a class of deep neural networks trained by a stochastic gradient descent (SGD) optimization method - are nowadays the key ingredients in many artificial intelligence (AI) systems and have revolutionized our ways of working and living in modern societies. For example, SGD methods are used to train powerful large language models (LLMs) such as versi… ▽ More Deep learning algorithms - typically consisting of a class of deep neural networks trained by a stochastic gradient descent (SGD) optimization method - are nowadays the key ingredients in many artificial intelligence (AI) systems and have revolutionized our ways of working and living in modern societies. For example, SGD methods are used to train powerful large language models (LLMs) such as versions of ChatGPT and Gemini, SGD methods are employed to create successful generative AI based text-to-image creation models such as Midjourney, DALL-E, and Stable Diffusion, but SGD methods are also used to train DNNs to approximately solve scientific models such as partial differential equation (PDE) models from physics and biology and optimal control and stop** problems from engineering. It is known that the plain vanilla standard SGD method fails to converge even in the situation of several convex optimization problems if the learning rates are bounded away from zero. However, in many practical relevant training scenarios, often not the plain vanilla standard SGD method but instead adaptive SGD methods such as the RMSprop and the Adam optimizers, in which the learning rates are modified adaptively during the training process, are employed. This naturally rises the question whether such adaptive optimizers, in which the learning rates are modified adaptively during the training process, do converge in the situation of non-vanishing learning rates. In this work we answer this question negatively by proving that adaptive SGD methods such as the popular Adam optimizer fail to converge to any possible random limit point if the learning rates are asymptotically bounded away from zero. In our proof of this non-convergence result we establish suitable pathwise a priori bounds for a class of accelerated and adaptive SGD methods, which are also of independent interest. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: 54 pages

MSC Class: 60J22 (Primary); 65K10; 60J20; 65C40 (Secondary) ACM Class: G.1.6; F.2.0; G.3

arXiv:2406.14340 [pdf, other]

Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses

Authors: Steffen Dereich, Arnulf Jentzen, Adrian Riekert

Abstract: It is known that the standard stochastic gradient descent (SGD) optimization method, as well as accelerated and adaptive SGD optimization methods such as the Adam optimizer fail to converge if the learning rates do not converge to zero (as, for example, in the situation of constant learning rates). Numerical simulations often use human-tuned deterministic learning rate schedules or small constant… ▽ More It is known that the standard stochastic gradient descent (SGD) optimization method, as well as accelerated and adaptive SGD optimization methods such as the Adam optimizer fail to converge if the learning rates do not converge to zero (as, for example, in the situation of constant learning rates). Numerical simulations often use human-tuned deterministic learning rate schedules or small constant learning rates. The default learning rate schedules for SGD optimization methods in machine learning implementation frameworks such as TensorFlow and Pytorch are constant learning rates. In this work we propose and study a learning-rate-adaptive approach for SGD optimization methods in which the learning rate is adjusted based on empirical estimates for the values of the objective function of the considered optimization problem (the function that one intends to minimize). In particular, we propose a learning-rate-adaptive variant of the Adam optimizer and implement it in case of several neural network learning problems, particularly, in the context of deep learning approximation methods for partial differential equations such as deep Kolmogorov methods, physics-informed neural networks, and deep Ritz methods. In each of the presented learning problems the proposed learning-rate-adaptive variant of the Adam optimizer faster reduces the value of the objective function than the Adam optimizer with the default learning rate. For a simple class of quadratic minimization problems we also rigorously prove that a learning-rate-adaptive variant of the SGD optimization method converges to the minimizer of the considered minimization problem. Our convergence proof is based on an analysis of the laws of invariant measures of the SGD method as well as on a more general convergence analysis for SGD with random but predictable learning rates which we develop in this work. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 68 pages, 8 figures

arXiv:2406.10876 [pdf, ps, other]

Deep neural networks with ReLU, leaky ReLU, and softplus activation provably overcome the curse of dimensionality for space-time solutions of semilinear partial differential equations

Authors: Julia Ackermann, Arnulf Jentzen, Benno Kuckuck, Joshua Lee Padgett

Abstract: It is a challenging topic in applied mathematics to solve high-dimensional nonlinear partial differential equations (PDEs). Standard approximation methods for nonlinear PDEs suffer under the curse of dimensionality (COD) in the sense that the number of computational operations of the approximation method grows at least exponentially in the PDE dimension and with such methods it is essentially impo… ▽ More It is a challenging topic in applied mathematics to solve high-dimensional nonlinear partial differential equations (PDEs). Standard approximation methods for nonlinear PDEs suffer under the curse of dimensionality (COD) in the sense that the number of computational operations of the approximation method grows at least exponentially in the PDE dimension and with such methods it is essentially impossible to approximately solve high-dimensional PDEs even when the fastest currently available computers are used. However, in the last years great progress has been made in this area of research through suitable deep learning (DL) based methods for PDEs in which deep neural networks (DNNs) are used to approximate solutions of PDEs. Despite the remarkable success of such DL methods in simulations, it remains a fundamental open problem of research to prove (or disprove) that such methods can overcome the COD in the approximation of PDEs. However, there are nowadays several partial error analysis results for DL methods for high-dimensional nonlinear PDEs in the literature which prove that DNNs can overcome the COD in the sense that the number of parameters of the approximating DNN grows at most polynomially in both the reciprocal of the prescribed approximation accuracy $\varepsilon>0$ and the PDE dimension $d\in\mathbb{N}$. In the main result of this article we prove that for all $T,p\in(0,\infty)$ it holds that solutions $u_d\colon[0,T]\times\mathbb{R}^d\to\mathbb{R}$, $d\in\mathbb{N}$, of semilinear heat equations with Lipschitz continuous nonlinearities can be approximated in the $L^p$-sense on space-time regions without the COD by DNNs with the rectified linear unit (ReLU), the leaky ReLU, or the softplus activation function. In previous articles similar results have been established not for space-time regions but for the solutions $u_d(T,\cdot)$, $d\in\mathbb{N}$, at the terminal time $T$. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 64 pages. arXiv admin note: text overlap with arXiv:2309.13722, arXiv:2310.20360

MSC Class: 65M15; 65C05; 68T07 (Primary) 60H35 (Secondary)

arXiv:2402.05155 [pdf, other]

Non-convergence to global minimizers for Adam and stochastic gradient descent optimization and constructions of local minimizers in the training of artificial neural networks

Authors: Arnulf Jentzen, Adrian Riekert

Abstract: Stochastic gradient descent (SGD) optimization methods such as the plain vanilla SGD method and the popular Adam optimizer are nowadays the method of choice in the training of artificial neural networks (ANNs). Despite the remarkable success of SGD methods in the ANN training in numerical simulations, it remains in essentially all practical relevant scenarios an open problem to rigorously explain… ▽ More Stochastic gradient descent (SGD) optimization methods such as the plain vanilla SGD method and the popular Adam optimizer are nowadays the method of choice in the training of artificial neural networks (ANNs). Despite the remarkable success of SGD methods in the ANN training in numerical simulations, it remains in essentially all practical relevant scenarios an open problem to rigorously explain why SGD methods seem to succeed to train ANNs. In particular, in most practically relevant supervised learning problems, it seems that SGD methods do with high probability not converge to global minimizers in the optimization landscape of the ANN training problem. Nevertheless, it remains an open problem of research to disprove the convergence of SGD methods to global minimizers. In this work we solve this research problem in the situation of shallow ANNs with the rectified linear unit (ReLU) and related activations with the standard mean square error loss by disproving in the training of such ANNs that SGD methods (such as the plain vanilla SGD, the momentum SGD, the AdaGrad, the RMSprop, and the Adam optimizers) can find a global minimizer with high probability. Even stronger, we reveal in the training of such ANNs that SGD methods do with high probability fail to converge to global minimizers in the optimization landscape. The findings of this work do, however, not disprove that SGD methods succeed to train ANNs since they do not exclude the possibility that SGD methods find good local minimizers whose risk values are close to the risk values of the global minimizers. In this context, another key contribution of this work is to establish the existence of a hierarchical structure of local minimizers with distinct risk values in the optimization landscape of ANN training problems with ReLU and related activations. △ Less

Submitted 7 February, 2024; originally announced February 2024.

Comments: 36 pages

arXiv:2310.20360 [pdf, other]

Mathematical Introduction to Deep Learning: Methods, Implementations, and Theory

Authors: Arnulf Jentzen, Benno Kuckuck, Philippe von Wurstemberger

Abstract: This book aims to provide an introduction to the topic of deep learning algorithms. We review essential components of deep learning algorithms in full mathematical detail including different artificial neural network (ANN) architectures (such as fully-connected feedforward ANNs, convolutional ANNs, recurrent ANNs, residual ANNs, and ANNs with batch normalization) and different optimization algorit… ▽ More This book aims to provide an introduction to the topic of deep learning algorithms. We review essential components of deep learning algorithms in full mathematical detail including different artificial neural network (ANN) architectures (such as fully-connected feedforward ANNs, convolutional ANNs, recurrent ANNs, residual ANNs, and ANNs with batch normalization) and different optimization algorithms (such as the basic stochastic gradient descent (SGD) method, accelerated methods, and adaptive methods). We also cover several theoretical aspects of deep learning algorithms such as approximation capacities of ANNs (including a calculus for ANNs), optimization theory (including Kurdyka-Łojasiewicz inequalities), and generalization errors. In the last part of the book some deep learning approximation methods for PDEs are reviewed including physics-informed neural networks (PINNs) and deep Galerkin methods. We hope that this book will be useful for students and scientists who do not yet have any background in deep learning at all and would like to gain a solid foundation as well as for practitioners who would like to obtain a firmer mathematical understanding of the objects and methods considered in deep learning. △ Less

Submitted 31 October, 2023; originally announced October 2023.

Comments: 601 pages, 36 figures, 45 source codes

MSC Class: 68T07

arXiv:2309.13722 [pdf, ps, other]

Deep neural networks with ReLU, leaky ReLU, and softplus activation provably overcome the curse of dimensionality for Kolmogorov partial differential equations with Lipschitz nonlinearities in the $L^p$-sense

Authors: Julia Ackermann, Arnulf Jentzen, Thomas Kruse, Benno Kuckuck, Joshua Lee Padgett

Abstract: Recently, several deep learning (DL) methods for approximating high-dimensional partial differential equations (PDEs) have been proposed. The interest that these methods have generated in the literature is in large part due to simulations which appear to demonstrate that such DL methods have the capacity to overcome the curse of dimensionality (COD) for PDEs in the sense that the number of computa… ▽ More Recently, several deep learning (DL) methods for approximating high-dimensional partial differential equations (PDEs) have been proposed. The interest that these methods have generated in the literature is in large part due to simulations which appear to demonstrate that such DL methods have the capacity to overcome the curse of dimensionality (COD) for PDEs in the sense that the number of computational operations they require to achieve a certain approximation accuracy $\varepsilon\in(0,\infty)$ grows at most polynomially in the PDE dimension $d\in\mathbb N$ and the reciprocal of $\varepsilon$. While there is thus far no mathematical result that proves that one of such methods is indeed capable of overcoming the COD, there are now a number of rigorous results in the literature that show that deep neural networks (DNNs) have the expressive power to approximate PDE solutions without the COD in the sense that the number of parameters used to describe the approximating DNN grows at most polynomially in both the PDE dimension $d\in\mathbb N$ and the reciprocal of the approximation accuracy $\varepsilon>0$. Roughly speaking, in the literature it is has been proved for every $T>0$ that solutions $u_d\colon [0,T]\times\mathbb R^d\to \mathbb R$, $d\in\mathbb N$, of semilinear heat PDEs with Lipschitz continuous nonlinearities can be approximated by DNNs with ReLU activation at the terminal time in the $L^2$-sense without the COD provided that the initial value functions $\mathbb R^d\ni x\mapsto u_d(0,x)\in\mathbb R$, $d\in\mathbb N$, can be approximated by ReLU DNNs without the COD. It is the key contribution of this work to generalize this result by establishing this statement in the $L^p$-sense with $p\in(0,\infty)$ and by allowing the activation function to be more general covering the ReLU, the leaky ReLU, and the softplus activation functions as special cases. △ Less

Submitted 24 September, 2023; originally announced September 2023.

Comments: 52 pages

MSC Class: 65M15; 65C05; 68T07 (Primary) 60H35 (Secondary)

arXiv:2303.03390 [pdf, ps, other]

Nonlinear Monte Carlo methods with polynomial runtime for Bellman equations of discrete time high-dimensional stochastic optimal control problems

Authors: Christian Beck, Arnulf Jentzen, Konrad Kleinberg, Thomas Kruse

Abstract: Discrete time stochastic optimal control problems and Markov decision processes (MDPs), respectively, serve as fundamental models for problems that involve sequential decision making under uncertainty and as such constitute the theoretical foundation of reinforcement learning. In this article we study the numerical approximation of MDPs with infinite time horizon, finite control set, and general s… ▽ More Discrete time stochastic optimal control problems and Markov decision processes (MDPs), respectively, serve as fundamental models for problems that involve sequential decision making under uncertainty and as such constitute the theoretical foundation of reinforcement learning. In this article we study the numerical approximation of MDPs with infinite time horizon, finite control set, and general state spaces. Our set-up in particular covers infinite-horizon optimal stop** problems of discrete time Markov processes. A key tool to solve MDPs are Bellman equations which characterize the value functions of the MDPs and determine the optimal control strategies. By combining ideas from the full-history recursive multilevel Picard approximation method, which was recently introduced to solve certain nonlinear partial differential equations, and ideas from $Q$-learning we introduce a class of suitable nonlinear Monte Carlo methods and prove that the proposed methods do overcome the curse of dimensionality in the numerical approximation of the solutions of Bellman equations and the associated discrete time stochastic optimal control problems. △ Less

Submitted 3 March, 2023; originally announced March 2023.

MSC Class: 90C40; 90C39; 60J05; 93E20; 65C05

arXiv:2302.14690 [pdf, other]

On the existence of minimizers in shallow residual ReLU neural network optimization landscapes

Authors: Steffen Dereich, Arnulf Jentzen, Sebastian Kassing

Abstract: Many mathematical convergence results for gradient descent (GD) based algorithms employ the assumption that the GD process is (almost surely) bounded and, also in concrete numerical simulations, divergence of the GD process may slow down, or even completely rule out, convergence of the error function. In practical relevant learning problems, it thus seems to be advisable to design the ANN architec… ▽ More Many mathematical convergence results for gradient descent (GD) based algorithms employ the assumption that the GD process is (almost surely) bounded and, also in concrete numerical simulations, divergence of the GD process may slow down, or even completely rule out, convergence of the error function. In practical relevant learning problems, it thus seems to be advisable to design the ANN architectures in a way so that GD optimization processes remain bounded. The property of the boundedness of GD processes for a given learning problem seems, however, to be closely related to the existence of minimizers in the optimization landscape and, in particular, GD trajectories may escape to infinity if the infimum of the error function (objective function) is not attained in the optimization landscape. This naturally raises the question of the existence of minimizers in the optimization landscape and, in the situation of shallow residual ANNs with multi-dimensional input layers and multi-dimensional hidden layers with the ReLU activation, the main result of this work answers this question affirmatively for a general class of loss functions and all continuous target functions. In our proof of this statement, we propose a kind of closure of the search space, where the limits are called generalized responses, and, thereafter, we provide sufficient criteria for the loss function and the underlying probability distribution which ensure that all additional artificial generalized responses are suboptimal which finally allows us to conclude the existence of minimizers in the optimization landscape. △ Less

Submitted 28 February, 2023; originally announced February 2023.

MSC Class: Primary 68T07; Secondary 68T05; 41A50

arXiv:2302.03286 [pdf, other]

Algorithmically Designed Artificial Neural Networks (ADANNs): Higher order deep operator learning for parametric partial differential equations

Authors: Arnulf Jentzen, Adrian Riekert, Philippe von Wurstemberger

Abstract: In this article we propose a new deep learning approach to approximate operators related to parametric partial differential equations (PDEs). In particular, we introduce a new strategy to design specific artificial neural network (ANN) architectures in conjunction with specific ANN initialization schemes which are tailor-made for the particular approximation problem under consideration. In the pro… ▽ More In this article we propose a new deep learning approach to approximate operators related to parametric partial differential equations (PDEs). In particular, we introduce a new strategy to design specific artificial neural network (ANN) architectures in conjunction with specific ANN initialization schemes which are tailor-made for the particular approximation problem under consideration. In the proposed approach we combine efficient classical numerical approximation techniques with deep operator learning methodologies. Specifically, we introduce customized adaptions of existing ANN architectures together with specialized initializations for these ANN architectures so that at initialization we have that the ANNs closely mimic a chosen efficient classical numerical algorithm for the considered approximation problem. The obtained ANN architectures and their initialization schemes are thus strongly inspired by numerical algorithms as well as by popular deep learning methodologies from the literature and in that sense we refer to the introduced ANNs in conjunction with their tailor-made initialization schemes as Algorithmically Designed Artificial Neural Networks (ADANNs). We numerically test the proposed ADANN methodology in the case of several parametric PDEs. In the tested numerical examples the ADANN methodology significantly outperforms existing traditional approximation algorithms as well as existing deep operator learning methodologies from the literature. △ Less

Submitted 29 May, 2024; v1 submitted 7 February, 2023; originally announced February 2023.

Comments: 39 pages, 16 Figures

arXiv:2301.08284 [pdf, ps, other]

The necessity of depth for artificial neural networks to approximate certain classes of smooth and bounded functions without the curse of dimensionality

Authors: Lukas Gonon, Robin Graeber, Arnulf Jentzen

Abstract: In this article we study high-dimensional approximation capacities of shallow and deep artificial neural networks (ANNs) with the rectified linear unit (ReLU) activation. In particular, it is a key contribution of this work to reveal that for all $a,b\in\mathbb{R}$ with $b-a\geq 7$ we have that the functions $[a,b]^d\ni x=(x_1,\dots,x_d)\mapsto\prod_{i=1}^d x_i\in\mathbb{R}$ for $d\in\mathbb{N}$ a… ▽ More In this article we study high-dimensional approximation capacities of shallow and deep artificial neural networks (ANNs) with the rectified linear unit (ReLU) activation. In particular, it is a key contribution of this work to reveal that for all $a,b\in\mathbb{R}$ with $b-a\geq 7$ we have that the functions $[a,b]^d\ni x=(x_1,\dots,x_d)\mapsto\prod_{i=1}^d x_i\in\mathbb{R}$ for $d\in\mathbb{N}$ as well as the functions $[a,b]^d\ni x =(x_1,\dots, x_d)\mapsto\sin(\prod_{i=1}^d x_i) \in \mathbb{R} $ for $ d \in \mathbb{N} $ can neither be approximated without the curse of dimensionality by means of shallow ANNs nor insufficiently deep ANNs with ReLU activation but can be approximated without the curse of dimensionality by sufficiently deep ANNs with ReLU activation. We show that the product functions and the sine of the product functions are polynomially tractable approximation problems among the approximating class of deep ReLU ANNs with the number of hidden layers being allowed to grow in the dimension $ d \in \mathbb{N} $. We establish the above outlined statements not only for the product functions and the sine of the product functions but also for other classes of target functions, in particular, for classes of uniformly globally bounded $ C^{ \infty } $-functions with compact support on any $[a,b]^d$ with $a\in\mathbb{R}$, $b\in(a,\infty)$. Roughly speaking, in this work we lay open that simple approximation problems such as approximating the sine or cosine of products cannot be solved in standard implementation frameworks by shallow or insufficiently deep ANNs with ReLU activation in polynomial time, but can be approximated by sufficiently deep ReLU ANNs with the number of parameters growing at most polynomially. △ Less

Submitted 19 January, 2023; originally announced January 2023.

Comments: 101 pages, 1 figure. arXiv admin note: substantial text overlap with arXiv:2112.14523

MSC Class: 65D40; 68T07

arXiv:2212.13111 [pdf, other]

Convergence to good non-optimal critical points in the training of neural networks: Gradient descent optimization with one random initialization overcomes all bad non-global local minima with high probability

Authors: Shokhrukh Ibragimov, Arnulf Jentzen, Adrian Riekert

Abstract: Gradient descent (GD) methods for the training of artificial neural networks (ANNs) belong nowadays to the most heavily employed computational schemes in the digital world. Despite the compelling success of such methods, it remains an open problem to provide a rigorous theoretical justification for the success of GD methods in the training of ANNs. The main difficulty is that the optimization risk… ▽ More Gradient descent (GD) methods for the training of artificial neural networks (ANNs) belong nowadays to the most heavily employed computational schemes in the digital world. Despite the compelling success of such methods, it remains an open problem to provide a rigorous theoretical justification for the success of GD methods in the training of ANNs. The main difficulty is that the optimization risk landscapes associated to ANNs usually admit many non-optimal critical points (saddle points as well as non-global local minima) whose risk values are strictly larger than the optimal risk value. It is a key contribution of this article to overcome this obstacle in certain simplified shallow ANN training situations. In such simplified ANN training scenarios we prove that the gradient flow (GF) dynamics with only one random initialization overcomes with high probability all bad non-global local minima (all non-global local minima whose risk values are much larger than the risk value of the global minima) and converges with high probability to a good critical point (a critical point whose risk value is very close to the optimal risk value of the global minima). This analysis allows us to establish convergence in probability to zero of the risk value of the GF trajectories with convergence rates as the ANN training time and the width of the ANN increase to infinity. We complement the analytical findings of this work with extensive numerical simulations for shallow and deep ANNs: All these numerical simulations strongly suggest that with high probability the considered GD method (stochastic GD or Adam) overcomes all bad non-global local minima, does not converge to a global minimum, but does converge to a good non-optimal critical point whose risk value is very close to the optimal risk value. △ Less

Submitted 26 December, 2022; originally announced December 2022.

Comments: 98 pages, 15 figures, 10 Python codes

MSC Class: 65K10; 65C50; 68T05; 60H35

arXiv:2211.15641 [pdf, ps, other]

Blow up phenomena for gradient descent optimization methods in the training of artificial neural networks

Authors: Davide Gallon, Arnulf Jentzen, Felix Lindner

Abstract: In this article we investigate blow up phenomena for gradient descent optimization methods in the training of artificial neural networks (ANNs). Our theoretical analysis is focused on shallow ANNs with one neuron on the input layer, one neuron on the output layer, and one hidden layer. For ANNs with ReLU activation and at least two neurons on the hidden layer we establish the existence of a target… ▽ More In this article we investigate blow up phenomena for gradient descent optimization methods in the training of artificial neural networks (ANNs). Our theoretical analysis is focused on shallow ANNs with one neuron on the input layer, one neuron on the output layer, and one hidden layer. For ANNs with ReLU activation and at least two neurons on the hidden layer we establish the existence of a target function such that there exists a lower bound for the risk values of the critical points of the associated risk function which is strictly greater than the infimum of the image of the risk function. This allows us to demonstrate that every gradient flow trajectory with an initial risk smaller than this lower bound diverges. Furthermore, we analyze and compare various popular types of activation functions with regard to the divergence of gradient flow trajectories and gradient descent trajectories in the training of ANNs and with regard to the closely related question concerning the existence of global minimum points of the risk function. △ Less

Submitted 28 November, 2022; originally announced November 2022.

Comments: 84 pages, one figure

arXiv:2210.13530 [pdf, other]

doi 10.1016/j.cnsns.2023.107438

An efficient Monte Carlo scheme for Zakai equations

Authors: Christian Beck, Sebastian Becker, Patrick Cheridito, Arnulf Jentzen, Ariel Neufeld

Abstract: In this paper we develop a numerical method for efficiently approximating solutions of certain Zakai equations in high dimensions. The key idea is to transform a given Zakai SPDE into a PDE with random coefficients. We show that under suitable regularity assumptions on the coefficients of the Zakai equation, the corresponding random PDE admits a solution random field which, for almost all realizat… ▽ More In this paper we develop a numerical method for efficiently approximating solutions of certain Zakai equations in high dimensions. The key idea is to transform a given Zakai SPDE into a PDE with random coefficients. We show that under suitable regularity assumptions on the coefficients of the Zakai equation, the corresponding random PDE admits a solution random field which, for almost all realizations of the random coefficients, can be written as a classical solution of a linear parabolic PDE. This makes it possible to apply the Feynman--Kac formula to obtain an efficient Monte Carlo scheme for computing approximate solutions of Zakai equations. The approach achieves good results in up to 25 dimensions with fast run times. △ Less

Submitted 20 August, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

MSC Class: 65C05; 65M75; 60H15; 62M20

arXiv:2208.02083 [pdf, ps, other]

Gradient descent provably escapes saddle points in the training of shallow ReLU networks

Authors: Patrick Cheridito, Arnulf Jentzen, Florian Rossmannek

Abstract: Dynamical systems theory has recently been applied in optimization to prove that gradient descent algorithms avoid so-called strict saddle points of the loss function. However, in many modern machine learning applications, the required regularity conditions are not satisfied. In particular, this is the case for rectified linear unit (ReLU) networks. In this paper, we prove a variant of the relevan… ▽ More Dynamical systems theory has recently been applied in optimization to prove that gradient descent algorithms avoid so-called strict saddle points of the loss function. However, in many modern machine learning applications, the required regularity conditions are not satisfied. In particular, this is the case for rectified linear unit (ReLU) networks. In this paper, we prove a variant of the relevant dynamical systems result, a center-stable manifold theorem, in which we relax some of the regularity requirements. Then, we verify that shallow ReLU networks fit into the new framework. Building on a classification of critical points of the square integral loss of shallow ReLU networks measured against an affine target function, we deduce that gradient descent avoids most saddle points. We proceed to prove convergence to global minima if the initialization is sufficiently good, which is expressed by an explicit threshold on the limiting loss. △ Less

Submitted 3 August, 2022; originally announced August 2022.

MSC Class: 68T07; 37D10 ACM Class: I.2.6; G.1.6

arXiv:2207.06246 [pdf, ps, other]

Normalized gradient flow optimization in the training of ReLU artificial neural networks

Authors: Simon Eberle, Arnulf Jentzen, Adrian Riekert, Georg Weiss

Abstract: The training of artificial neural networks (ANNs) is nowadays a highly relevant algorithmic procedure with many applications in science and industry. Roughly speaking, ANNs can be regarded as iterated compositions between affine linear functions and certain fixed nonlinear functions, which are usually multidimensional versions of a one-dimensional so-called activation function. The most popular ch… ▽ More The training of artificial neural networks (ANNs) is nowadays a highly relevant algorithmic procedure with many applications in science and industry. Roughly speaking, ANNs can be regarded as iterated compositions between affine linear functions and certain fixed nonlinear functions, which are usually multidimensional versions of a one-dimensional so-called activation function. The most popular choice of such a one-dimensional activation function is the rectified linear unit (ReLU) activation function which maps a real number to its positive part $ \mathbb{R} \ni x \mapsto \max\{ x, 0 \} \in \mathbb{R} $. In this article we propose and analyze a modified variant of the standard training procedure of such ReLU ANNs in the sense that we propose to restrict the negative gradient flow dynamics to a large submanifold of the ANN parameter space, which is a strict $ C^{ \infty } $-submanifold of the entire ANN parameter space that seems to enjoy better regularity properties than the entire ANN parameter space but which is also sufficiently large and sufficiently high dimensional so that it can represent all ANN realization functions that can be represented through the entire ANN parameter space. In the special situation of shallow ANNs with just one-dimensional ANN layers we also prove for every Lipschitz continuous target function that every gradient flow trajectory on this large submanifold of the ANN parameter space is globally bounded. For the standard gradient flow on the entire ANN parameter space with Lipschitz continuous target functions it remains an open problem of research to prove or disprove the global boundedness of gradient flow trajectories even in the situation of shallow ANNs with just one-dimensional ANN layers. △ Less

Submitted 13 July, 2022; originally announced July 2022.

Comments: 26 pages, 1 figure

arXiv:2206.13646 [pdf, ps, other]

On bounds for norms of reparameterized ReLU artificial neural network parameters: sums of fractional powers of the Lipschitz norm control the network parameter vector

Authors: Arnulf Jentzen, Timo Kröger

Abstract: It is an elementary fact in the scientific literature that the Lipschitz norm of the realization function of a feedforward fully-connected rectified linear unit (ReLU) artificial neural network (ANN) can, up to a multiplicative constant, be bounded from above by sums of powers of the norm of the ANN parameter vector. Roughly speaking, in this work we reveal in the case of shallow ANNs that the con… ▽ More It is an elementary fact in the scientific literature that the Lipschitz norm of the realization function of a feedforward fully-connected rectified linear unit (ReLU) artificial neural network (ANN) can, up to a multiplicative constant, be bounded from above by sums of powers of the norm of the ANN parameter vector. Roughly speaking, in this work we reveal in the case of shallow ANNs that the converse inequality is also true. More formally, we prove that the norm of the equivalence class of ANN parameter vectors with the same realization function is, up to a multiplicative constant, bounded from above by the sum of powers of the Lipschitz norm of the ANN realization function (with the exponents $ 1/2 $ and $ 1 $). Moreover, we prove that this upper bound only holds when employing the Lipschitz norm but does neither hold for Hölder norms nor for Sobolev-Slobodeckij norms. Furthermore, we prove that this upper bound only holds for sums of powers of the Lipschitz norm with the exponents $ 1/2 $ and $ 1 $ but does not hold for the Lipschitz norm alone. △ Less

Submitted 27 June, 2022; originally announced June 2022.

Comments: 39 pages, 1 figure

arXiv:2205.03672 [pdf, other]

Deep learning approximations for non-local nonlinear PDEs with Neumann boundary conditions

Authors: Victor Boussange, Sebastian Becker, Arnulf Jentzen, Benno Kuckuck, Loïc Pellissier

Abstract: Nonlinear partial differential equations (PDEs) are used to model dynamical processes in a large number of scientific fields, ranging from finance to biology. In many applications standard local models are not sufficient to accurately account for certain non-local phenomena such as, e.g., interactions at a distance. In order to properly capture these phenomena non-local nonlinear PDE models are fr… ▽ More Nonlinear partial differential equations (PDEs) are used to model dynamical processes in a large number of scientific fields, ranging from finance to biology. In many applications standard local models are not sufficient to accurately account for certain non-local phenomena such as, e.g., interactions at a distance. In order to properly capture these phenomena non-local nonlinear PDE models are frequently employed in the literature. In this article we propose two numerical methods based on machine learning and on Picard iterations, respectively, to approximately solve non-local nonlinear PDEs. The proposed machine learning-based method is an extended variant of a deep learning-based splitting-up type approximation method previously introduced in the literature and utilizes neural networks to provide approximate solutions on a subset of the spatial domain of the solution. The Picard iterations-based method is an extended variant of the so-called full history recursive multilevel Picard approximation scheme previously introduced in the literature and provides an approximate solution for a single point of the domain. Both methods are mesh-free and allow non-local nonlinear PDEs with Neumann boundary conditions to be solved in high dimensions. In the two methods, the numerical difficulties arising due to the dimensionality of the PDEs are avoided by (i) using the correspondence between the expected trajectory of reflected stochastic processes and the solution of PDEs (given by the Feynman-Kac formula) and by (ii) using a plain vanilla Monte Carlo integration to handle the non-local term. We evaluate the performance of the two methods on five different PDEs arising in physics and biology. In all cases, the methods yield good results in up to 10 dimensions with short run times. Our work extends recently developed methods to overcome the curse of dimensionality in solving PDEs. △ Less

Submitted 7 May, 2022; originally announced May 2022.

Comments: 59 pages

MSC Class: 35R09 (Primary) 65M75; 45K05; 35K20; 65C05; 65M22; 68T07 (Secondary)

arXiv:2202.11481 [pdf, other]

On the existence of infinitely many realization functions of non-global local minima in the training of artificial neural networks with ReLU activation

Authors: Shokhrukh Ibragimov, Arnulf Jentzen, Timo Kröger, Adrian Riekert

Abstract: Gradient descent (GD) type optimization schemes are the standard instruments to train fully connected feedforward artificial neural networks (ANNs) with rectified linear unit (ReLU) activation and can be considered as temporal discretizations of solutions of gradient flow (GF) differential equations. It has recently been proved that the risk of every bounded GF trajectory converges in the training… ▽ More Gradient descent (GD) type optimization schemes are the standard instruments to train fully connected feedforward artificial neural networks (ANNs) with rectified linear unit (ReLU) activation and can be considered as temporal discretizations of solutions of gradient flow (GF) differential equations. It has recently been proved that the risk of every bounded GF trajectory converges in the training of ANNs with one hidden layer and ReLU activation to the risk of a critical point. Taking this into account it is one of the key research issues in the mathematical convergence analysis of GF trajectories and GD type optimization schemes, respectively, to study sufficient and necessary conditions for critical points of the risk function and, thereby, to obtain an understanding about the appearance of critical points in dependence of the problem parameters such as the target function. In the first main result of this work we prove in the training of ANNs with one hidden layer and ReLU activation that for every $ a, b \in \mathbb{R} $ with $ a < b $ and every arbitrarily large $ δ> 0 $ we have that there exists a Lipschitz continuous target function $ f \colon [a,b] \to \mathbb{R} $ such that for every number $ H > 1 $ of neurons on the hidden layer we have that the risk function has uncountably many different realization functions of non-global local minimum points whose risks are strictly larger than the sum of the risk of the global minimum points and the arbitrarily large $ δ$. In the second main result of this work we show in the training of ANNs with one hidden layer and ReLU activation in the special situation where there is only one neuron on the hidden layer and where the target function is continuous and piecewise polynomial that there exist at most finitely many different realization functions of critical points. △ Less

Submitted 23 February, 2022; originally announced February 2022.

Comments: 49 pages, 1 figure

MSC Class: 68T07

arXiv:2202.02717 [pdf, other]

Learning the random variables in Monte Carlo simulations with stochastic gradient descent: Machine learning for parametric PDEs and financial derivative pricing

Authors: Sebastian Becker, Arnulf Jentzen, Marvin S. Müller, Philippe von Wurstemberger

Abstract: In financial engineering, prices of financial products are computed approximately many times each trading day with (slightly) different parameters in each calculation. In many financial models such prices can be approximated by means of Monte Carlo (MC) simulations. To obtain a good approximation the MC sample size usually needs to be considerably large resulting in a long computing time to obtain… ▽ More In financial engineering, prices of financial products are computed approximately many times each trading day with (slightly) different parameters in each calculation. In many financial models such prices can be approximated by means of Monte Carlo (MC) simulations. To obtain a good approximation the MC sample size usually needs to be considerably large resulting in a long computing time to obtain a single approximation. In this paper we introduce a new approximation strategy for parametric approximation problems including the parametric financial pricing problems described above. A central aspect of the approximation strategy proposed in this article is to combine MC algorithms with machine learning techniques to, roughly speaking, learn the random variables (LRV) in MC simulations. In other words, we employ stochastic gradient descent (SGD) optimization methods not to train parameters of standard artificial neural networks (ANNs) but to learn random variables appearing in MC approximations. We numerically test the LRV strategy on various parametric problems with convincing results when compared with standard MC simulations, Quasi-Monte Carlo simulations, SGD-trained shallow ANNs, and SGD-trained deep ANNs. Our numerical simulations strongly indicate that the LRV strategy might be capable to overcome the curse of dimensionality in the $L^\infty$-norm in several cases where the standard deep learning approach has been proven not to be able to do so. This is not a contradiction to lower bounds established in the scientific literature because this new LRV strategy is outside of the class of algorithms for which lower bounds have been established in the scientific literature. The proposed LRV strategy is of general nature and not only restricted to the parametric financial pricing problems described above, but applicable to a large class of approximation problems. △ Less

Submitted 8 June, 2023; v1 submitted 6 February, 2022; originally announced February 2022.

Comments: 71 pages, 4 Figures, 14 Tables; to appear in Math. Finance

MSC Class: 35K15; 65C05; 65M75; 68T99; 91G20

arXiv:2112.14523 [pdf, ps, other]

Deep neural network approximation theory for high-dimensional functions

Authors: Pierfrancesco Beneventano, Patrick Cheridito, Robin Graeber, Arnulf Jentzen, Benno Kuckuck

Abstract: The purpose of this article is to develop machinery to study the capacity of deep neural networks (DNNs) to approximate high-dimensional functions. In particular, we show that DNNs have the expressive power to overcome the curse of dimensionality in the approximation of a large class of functions. More precisely, we prove that these functions can be approximated by DNNs on compact sets such that t… ▽ More The purpose of this article is to develop machinery to study the capacity of deep neural networks (DNNs) to approximate high-dimensional functions. In particular, we show that DNNs have the expressive power to overcome the curse of dimensionality in the approximation of a large class of functions. More precisely, we prove that these functions can be approximated by DNNs on compact sets such that the number of parameters necessary to represent the approximating DNNs grows at most polynomially in the reciprocal $1/\varepsilon$ of the approximation accuracy $\varepsilon>0$ and in the input dimension $d\in \mathbb{N} =\{1,2,3,\dots\}$. To this end, we introduce certain approximation spaces, consisting of sequences of functions that can be efficiently approximated by DNNs. We then establish closure properties which we combine with known and new bounds on the number of parameters necessary to approximate locally Lipschitz continuous functions, maximum functions, and product functions by DNNs. The main result of this article demonstrates that DNNs have sufficient expressiveness to approximate certain sequences of functions which can be constructed by means of a finite number of compositions using locally Lipschitz continuous functions, maxima, and products without the curse of dimensionality. △ Less

Submitted 29 December, 2021; originally announced December 2021.

Comments: 82 pages, 1 figure

arXiv:2112.09684 [pdf, other]

doi 10.4208/jml.220114a

On the existence of global minima and convergence analyses for gradient descent methods in the training of deep neural networks

Authors: Arnulf Jentzen, Adrian Riekert

Abstract: In this article we study fully-connected feedforward deep ReLU ANNs with an arbitrarily large number of hidden layers and we prove convergence of the risk of the GD optimization method with random initializations in the training of such ANNs under the assumption that the unnormalized probability density function of the probability distribution of the input data of the considered supervised learnin… ▽ More In this article we study fully-connected feedforward deep ReLU ANNs with an arbitrarily large number of hidden layers and we prove convergence of the risk of the GD optimization method with random initializations in the training of such ANNs under the assumption that the unnormalized probability density function of the probability distribution of the input data of the considered supervised learning problem is piecewise polynomial, under the assumption that the target function (describing the relationship between input data and the output data) is piecewise polynomial, and under the assumption that the risk function of the considered supervised learning problem admits at least one regular global minimum. In addition, in the special situation of shallow ANNs with just one hidden layer and one-dimensional input we also verify this assumption by proving in the training of such shallow ANNs that for every Lipschitz continuous target function there exists a global minimum in the risk landscape. Finally, in the training of deep ANNs with ReLU activation we also study solutions of gradient flow (GF) differential equations and we prove that every non-divergent GF trajectory converges with a polynomial rate of convergence to a critical point (in the sense of limiting Fréchet subdifferentiability). Our mathematical convergence analysis builds up on ideas from our previous article Eberle et al., on tools from real algebraic geometry such as the concept of semi-algebraic functions and generalized Kurdyka-Lojasiewicz inequalities, on tools from functional analysis such as the Arzelà-Ascoli theorem, on tools from nonsmooth analysis such as the concept of limiting Fréchet subgradients, as well as on the fact that the set of realization functions of shallow ReLU ANNs with fixed architecture forms a closed subset of the set of continuous functions revealed by Petersen et al. △ Less

Submitted 13 July, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

Comments: 89 pages, 15 figures

Journal ref: Journal of Machine Learning, 1 (2022), pp. 141-246

arXiv:2112.07369 [pdf, other]

Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions

Authors: Martin Hutzenthaler, Arnulf Jentzen, Katharina Pohl, Adrian Riekert, Luca Scarpa

Abstract: In many numerical simulations stochastic gradient descent (SGD) type optimization methods perform very effectively in the training of deep neural networks (DNNs) but till this day it remains an open problem of research to provide a mathematical convergence analysis which rigorously explains the success of SGD type optimization methods in the training of DNNs. In this work we study SGD type optimiz… ▽ More In many numerical simulations stochastic gradient descent (SGD) type optimization methods perform very effectively in the training of deep neural networks (DNNs) but till this day it remains an open problem of research to provide a mathematical convergence analysis which rigorously explains the success of SGD type optimization methods in the training of DNNs. In this work we study SGD type optimization methods in the training of fully-connected feedforward DNNs with rectified linear unit (ReLU) activation. We first establish general regularity properties for the risk functions and their generalized gradient functions appearing in the training of such DNNs and, thereafter, we investigate the plain vanilla SGD optimization method in the training of such DNNs under the assumption that the target function under consideration is a constant function. Specifically, we prove under the assumption that the learning rates (the step sizes of the SGD optimization method) are sufficiently small but not $L^1$-summable and under the assumption that the target function is a constant function that the expectation of the riskof the considered SGD process converges in the training of such DNNs to zero as the number of SGD steps increases to infinity. △ Less

Submitted 22 June, 2023; v1 submitted 13 December, 2021; originally announced December 2021.

Comments: 71 pages, 5 figures, 2 tables, 4 Python source codes. To appear in Electronic Research Archive

arXiv:2110.08297 [pdf, ps, other]

Strong $L^p$-error analysis of nonlinear Monte Carlo approximations for high-dimensional semilinear partial differential equations

Authors: Martin Hutzenthaler, Arnulf Jentzen, Benno Kuckuck, Joshua Lee Padgett

Abstract: Full-history recursive multilevel Picard (MLP) approximation schemes have been shown to overcome the curse of dimensionality in the numerical approximation of high-dimensional semilinear partial differential equations (PDEs) with general time horizons and Lipschitz continuous nonlinearities. However, each of the error analyses for MLP approximation schemes in the existing literature studies the… ▽ More Full-history recursive multilevel Picard (MLP) approximation schemes have been shown to overcome the curse of dimensionality in the numerical approximation of high-dimensional semilinear partial differential equations (PDEs) with general time horizons and Lipschitz continuous nonlinearities. However, each of the error analyses for MLP approximation schemes in the existing literature studies the $L^2$-root-mean-square distance between the exact solution of the PDE under consideration and the considered MLP approximation and none of the error analyses in the existing literature provides an upper bound for the more general $L^p$-distance between the exact solution of the PDE under consideration and the considered MLP approximation. It is the key contribution of this article to extend the $L^2$-error analysis for MLP approximation schemes in the literature to a more general $L^p$-error analysis with $p\in (0,\infty)$. In particular, the main result of this article proves that the proposed MLP approximation scheme indeed overcomes the curse of dimensionality in the numerical approximation of high-dimensional semilinear PDEs with the approximation error measured in the $L^p$-sense with $p \in (0,\infty)$. △ Less

Submitted 15 October, 2021; originally announced October 2021.

Comments: 42 pages.

arXiv:2108.10602 [pdf, ps, other]

Overcoming the curse of dimensionality in the numerical approximation of backward stochastic differential equations

Authors: Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, Tuan Anh Nguyen

Abstract: Backward stochastic differential equations (BSDEs) belong nowadays to the most frequently studied equations in stochastic analysis and computational stochastics. BSDEs in applications are often nonlinear and high-dimensional. In nearly all cases such nonlinear high-dimensional BSDEs cannot be solved explicitly and it has been and still is a very active topic of research to design and analyze numer… ▽ More Backward stochastic differential equations (BSDEs) belong nowadays to the most frequently studied equations in stochastic analysis and computational stochastics. BSDEs in applications are often nonlinear and high-dimensional. In nearly all cases such nonlinear high-dimensional BSDEs cannot be solved explicitly and it has been and still is a very active topic of research to design and analyze numerical approximation methods to approximatively solve nonlinear high-dimensional BSDEs. Although there are a large number of research articles in the scientific literature which analyze numerical approximation methods for nonlinear BSDEs, until today there has been no numerical approximation method in the scientific literature which has been proven to overcome the curse of dimensionality in the numerical approximation of nonlinear BSDEs in the sense that the number of computational operations of the numerical approximation method to approximatively compute one sample path of the BSDE solution grows at most polynomially in both the reciprocal $1/ \varepsilon$ of the prescribed approximation accuracy $\varepsilon \in (0,\infty)$ and the dimension $d\in \mathbb N=\{1,2,3,\ldots\}$ of the BSDE. It is the key contribution of this article to overcome this obstacle by introducing a new Monte Carlo-type numerical approximation method for high-dimensional BSDEs and by proving that this Monte Carlo-type numerical approximation method does indeed overcome the curse of dimensionality in the approximative computation of solution paths of BSDEs. △ Less

Submitted 24 August, 2021; originally announced August 2021.

arXiv:2108.08106 [pdf, other]

doi 10.3934/era.2023128

Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation

Authors: Simon Eberle, Arnulf Jentzen, Adrian Riekert, Georg S. Weiss

Abstract: The training of artificial neural networks (ANNs) with rectified linear unit (ReLU) activation via gradient descent (GD) type optimization schemes is nowadays a common industrially relevant procedure. Till this day in the scientific literature there is in general no mathematical convergence analysis which explains the numerical success of GD type optimization schemes in the training of ANNs with R… ▽ More The training of artificial neural networks (ANNs) with rectified linear unit (ReLU) activation via gradient descent (GD) type optimization schemes is nowadays a common industrially relevant procedure. Till this day in the scientific literature there is in general no mathematical convergence analysis which explains the numerical success of GD type optimization schemes in the training of ANNs with ReLU activation. GD type optimization schemes can be regarded as temporal discretization methods for the gradient flow (GF) differential equations associated to the considered optimization problem and, in view of this, it seems to be a natural direction of research to first aim to develop a mathematical convergence theory for time-continuous GF differential equations and, thereafter, to aim to extend such a time-continuous convergence theory to implementable time-discrete GD type optimization methods. In this article we establish two basic results for GF differential equations in the training of fully-connected feedforward ANNs with one hidden layer and ReLU activation. In the first main result of this article we establish in the training of such ANNs under the assumption that the probability distribution of the input data of the considered supervised learning problem is absolutely continuous with a bounded density function that every GF differential equation admits for every initial value a solution which is also unique among a suitable class of solutions. In the second main result of this article we prove in the training of such ANNs under the assumption that the target function and the density function of the probability distribution of the input data are piecewise polynomial that every non-divergent GF trajectory converges with an appropriate rate of convergence to a critical point and that the risk of the non-divergent GF trajectory converges with rate 1 to the risk of the critical point. △ Less

Submitted 18 August, 2021; originally announced August 2021.

Comments: 30 pages. arXiv admin note: text overlap with arXiv:2107.04479, arXiv:2108.04620

Journal ref: Electronic Research Archive 2023, Volume 31, Issue 5: 2519-2554

arXiv:2108.04620 [pdf, other]

A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions

Authors: Arnulf Jentzen, Adrian Riekert

Abstract: Gradient descent (GD) type optimization methods are the standard instrument to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation. Despite the great success of GD type optimization methods in numerical simulations for the training of ANNs with ReLU activation, it remains - even in the simplest situation of the plain vanilla GD optimization method with random initi… ▽ More Gradient descent (GD) type optimization methods are the standard instrument to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation. Despite the great success of GD type optimization methods in numerical simulations for the training of ANNs with ReLU activation, it remains - even in the simplest situation of the plain vanilla GD optimization method with random initializations and ANNs with one hidden layer - an open problem to prove (or disprove) the conjecture that the risk of the GD optimization method converges in the training of such ANNs to zero as the width of the ANNs, the number of independent random initializations, and the number of GD steps increase to infinity. In this article we prove this conjecture in the situation where the probability distribution of the input data is equivalent to the continuous uniform distribution on a compact interval, where the probability distributions for the random initializations of the ANN parameters are standard normal distributions, and where the target function under consideration is continuous and piecewise affine linear. Roughly speaking, the key ingredients in our mathematical convergence analysis are (i) to prove that suitable sets of global minima of the risk functions are \emph{twice continuously differentiable submanifolds of the ANN parameter spaces}, (ii) to prove that the Hessians of the risk functions on these sets of global minima satisfy an appropriate \emph{maximal rank condition}, and, thereafter, (iii) to apply the machinery in [Fehrman, B., Gess, B., Jentzen, A., Convergence rates for the stochastic gradient descent method for non-convex objective functions. J. Mach. Learn. Res. 21(136): 1--48, 2020] to establish convergence of the GD optimization method with random initializations. △ Less

Submitted 10 August, 2021; originally announced August 2021.

Comments: 44 pages. arXiv admin note: text overlap with arXiv:2107.04479

Journal ref: Journal of Machine Learning Research 23, 260 (2022), pp. 1-50

arXiv:2107.04479 [pdf, ps, other]

doi 10.1016/j.jmaa.2022.126601

Convergence analysis for gradient flows in the training of artificial neural networks with ReLU activation

Authors: Arnulf Jentzen, Adrian Riekert

Abstract: Gradient descent (GD) type optimization schemes are the standard methods to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation. Such schemes can be considered as discretizations of gradient flows (GFs) associated to the training of ANNs with ReLU activation and most of the key difficulties in the mathematical convergence analysis of GD type optimization schemes in… ▽ More Gradient descent (GD) type optimization schemes are the standard methods to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation. Such schemes can be considered as discretizations of gradient flows (GFs) associated to the training of ANNs with ReLU activation and most of the key difficulties in the mathematical convergence analysis of GD type optimization schemes in the training of ANNs with ReLU activation seem to be already present in the dynamics of the corresponding GF differential equations. It is the key subject of this work to analyze such GF differential equations in the training of ANNs with ReLU activation and three layers (one input layer, one hidden layer, and one output layer). In particular, in this article we prove in the case where the target function is possibly multi-dimensional and continuous and in the case where the probability distribution of the input data is absolutely continuous with respect to the Lebesgue measure that the risk of every bounded GF trajectory converges to the risk of a critical point. In addition, in this article we show in the case of a 1-dimensional affine linear target function and in the case where the probability distribution of the input data coincides with the standard uniform distribution that the risk of every bounded GF trajectory converges to zero if the initial risk is sufficiently small. Finally, in the special situation where there is only one neuron on the hidden layer (1-dimensional hidden layer) we strengthen the above named result for affine linear target functions by proving that that the risk of every (not necessarily bounded) GF trajectory converges to zero if the initial risk is sufficiently small. △ Less

Submitted 9 July, 2021; originally announced July 2021.

Comments: 37 pages

Journal ref: Journal of Mathematical Analysis and Applications 517, 2 (2023)

arXiv:2104.00277 [pdf, ps, other]

doi 10.1007/s00033-022-01716-w

A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions

Authors: Arnulf Jentzen, Adrian Riekert

Abstract: In this article we study the stochastic gradient descent (SGD) optimization method in the training of fully-connected feedforward artificial neural networks with ReLU activation. The main result of this work proves that the risk of the SGD process converges to zero if the target function under consideration is constant. In the established convergence result the considered artificial neural network… ▽ More In this article we study the stochastic gradient descent (SGD) optimization method in the training of fully-connected feedforward artificial neural networks with ReLU activation. The main result of this work proves that the risk of the SGD process converges to zero if the target function under consideration is constant. In the established convergence result the considered artificial neural networks consist of one input layer, one hidden layer, and one output layer (with $d \in \mathbb{N}$ neurons on the input layer, $H \in \mathbb{N}$ neurons on the hidden layer, and one neuron on the output layer). The learning rates of the SGD process are assumed to be sufficiently small and the input data used in the SGD process to train the artificial neural networks is assumed to be independent and identically distributed. △ Less

Submitted 1 April, 2021; originally announced April 2021.

Comments: 29 pages

Journal ref: Zeitschrift für angewandte Mathematik und Physik 73 (2022)

arXiv:2103.10922 [pdf, other]

doi 10.1007/s00332-022-09823-8

Landscape analysis for shallow neural networks: complete classification of critical points for affine target functions

Authors: Patrick Cheridito, Arnulf Jentzen, Florian Rossmannek

Abstract: In this paper, we analyze the landscape of the true loss of neural networks with one hidden layer and ReLU, leaky ReLU, or quadratic activation. In all three cases, we provide a complete classification of the critical points in the case where the target function is affine and one-dimensional. In particular, we show that there exist no local maxima and clarify the structure of saddle points. Moreov… ▽ More In this paper, we analyze the landscape of the true loss of neural networks with one hidden layer and ReLU, leaky ReLU, or quadratic activation. In all three cases, we provide a complete classification of the critical points in the case where the target function is affine and one-dimensional. In particular, we show that there exist no local maxima and clarify the structure of saddle points. Moreover, we prove that non-global local minima can only be caused by `dead' ReLU neurons. In particular, they do not appear in the case of leaky ReLU or quadratic activation. Our approach is of a combinatorial nature and builds on a careful analysis of the different types of hidden neurons that can occur. △ Less

Submitted 6 July, 2022; v1 submitted 19 March, 2021; originally announced March 2021.

MSC Class: 68T07 ACM Class: I.2.6

Journal ref: J Nonlinear Sci 32, 64 (2022)

arXiv:2103.04488 [pdf, ps, other]

Lower bounds for artificial neural network approximations: A proof that shallow neural networks fail to overcome the curse of dimensionality

Authors: Philipp Grohs, Shokhrukh Ibragimov, Arnulf Jentzen, Sarah Koppensteiner

Abstract: Artificial neural networks (ANNs) have become a very powerful tool in the approximation of high-dimensional functions. Especially, deep ANNs, consisting of a large number of hidden layers, have been very successfully used in a series of practical relevant computational problems involving high-dimensional input data ranging from classification tasks in supervised learning to optimal decision proble… ▽ More Artificial neural networks (ANNs) have become a very powerful tool in the approximation of high-dimensional functions. Especially, deep ANNs, consisting of a large number of hidden layers, have been very successfully used in a series of practical relevant computational problems involving high-dimensional input data ranging from classification tasks in supervised learning to optimal decision problems in reinforcement learning. There are also a number of mathematical results in the scientific literature which study the approximation capacities of ANNs in the context of high-dimensional target functions. In particular, there are a series of mathematical results in the scientific literature which show that sufficiently deep ANNs have the capacity to overcome the curse of dimensionality in the approximation of certain target function classes in the sense that the number of parameters of the approximating ANNs grows at most polynomially in the dimension $d \in \mathbb{N}$ of the target functions under considerations. In the proofs of several of such high-dimensional approximation results it is crucial that the involved ANNs are sufficiently deep and consist a sufficiently large number of hidden layers which grows in the dimension of the considered target functions. It is the topic of this work to look a bit more detailed to the deepness of the involved ANNs in the approximation of high-dimensional target functions. In particular, the main result of this work proves that there exists a concretely specified sequence of functions which can be approximated without the curse of dimensionality by sufficiently deep ANNs but which cannot be approximated without the curse of dimensionality if the involved ANNs are shallow or not deep enough. △ Less

Submitted 7 March, 2021; originally announced March 2021.

Comments: 53 pages

arXiv:2103.02350 [pdf, ps, other]

Full history recursive multilevel Picard approximations for ordinary differential equations with expectations

Authors: Christian Beck, Martin Hutzenthaler, Arnulf Jentzen, Emilia Magnani

Abstract: We consider ordinary differential equations (ODEs) which involve expectations of a random variable. These ODEs are special cases of McKean-Vlasov stochastic differential equations (SDEs). A plain vanilla Monte Carlo approximation method for such ODEs requires a computational cost of order $\varepsilon^{-3}$ to achieve a root-mean-square error of size $\varepsilon$. In this work we adapt recently i… ▽ More We consider ordinary differential equations (ODEs) which involve expectations of a random variable. These ODEs are special cases of McKean-Vlasov stochastic differential equations (SDEs). A plain vanilla Monte Carlo approximation method for such ODEs requires a computational cost of order $\varepsilon^{-3}$ to achieve a root-mean-square error of size $\varepsilon$. In this work we adapt recently introduced full history recursive multilevel Picard (MLP) algorithms to reduce this computational complexity. Our main result shows for every $δ>0$ that the proposed MLP approximation algorithm requires only a computational effort of order $\varepsilon^{-(2+δ)}$ to achieve a root-mean-square error of size $\varepsilon$. △ Less

Submitted 3 March, 2021; originally announced March 2021.

Comments: 24 pages. arXiv admin note: substantial text overlap with arXiv:1903.05985

MSC Class: 65Lxx; 65Mxx; 65Cxx; 65M75 ACM Class: G.1.0; G.1.7; G.1.m; G.3

arXiv:2102.11840 [pdf, ps, other]

Convergence rates for gradient descent in the training of overparameterized artificial neural networks with biases

Authors: Arnulf Jentzen, Timo Kröger

Abstract: In recent years, artificial neural networks have developed into a powerful tool for dealing with a multitude of problems for which classical solution approaches reach their limits. However, it is still unclear why randomly initialized gradient descent optimization algorithms, such as the well-known batch gradient descent, are able to achieve zero training loss in many situations even though the ob… ▽ More In recent years, artificial neural networks have developed into a powerful tool for dealing with a multitude of problems for which classical solution approaches reach their limits. However, it is still unclear why randomly initialized gradient descent optimization algorithms, such as the well-known batch gradient descent, are able to achieve zero training loss in many situations even though the objective function is non-convex and non-smooth. One of the most promising approaches to solving this problem in the field of supervised learning is the analysis of gradient descent optimization in the so-called overparameterized regime. In this article we provide a further contribution to this area of research by considering overparameterized fully-connected rectified artificial neural networks with biases. Specifically, we show that for a fixed number of training data the mean squared error using batch gradient descent optimization applied to such a randomly initialized artificial neural network converges to zero at a linear convergence rate as long as the width of the artificial neural network is large enough, the learning rate is small enough, and the training input data are pairwise linearly independent. △ Less

Submitted 23 February, 2021; originally announced February 2021.

Comments: 38 pages

arXiv:2102.09924 [pdf, ps, other]

doi 10.1016/j.jco.2022.101646

A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions

Authors: Patrick Cheridito, Arnulf Jentzen, Adrian Riekert, Florian Rossmannek

Abstract: Gradient descent optimization algorithms are the standard ingredients that are used to train artificial neural networks (ANNs). Even though a huge number of numerical simulations indicate that gradient descent optimization methods do indeed convergence in the training of ANNs, until today there is no rigorous theoretical analysis which proves (or disproves) this conjecture. In particular, even in… ▽ More Gradient descent optimization algorithms are the standard ingredients that are used to train artificial neural networks (ANNs). Even though a huge number of numerical simulations indicate that gradient descent optimization methods do indeed convergence in the training of ANNs, until today there is no rigorous theoretical analysis which proves (or disproves) this conjecture. In particular, even in the case of the most basic variant of gradient descent optimization algorithms, the plain vanilla gradient descent method, it remains an open problem to prove or disprove the conjecture that gradient descent converges in the training of ANNs. In this article we solve this problem in the special situation where the target function under consideration is a constant function. More specifically, in the case of constant target functions we prove in the training of rectified fully-connected feedforward ANNs with one-hidden layer that the risk function of the gradient descent method does indeed converge to zero. Our mathematical analysis strongly exploits the property that the rectifier function is the activation function used in the considered ANNs. A key contribution of this work is to explicitly specify a Lyapunov function for the gradient flow system of the ANN parameters. This Lyapunov function is the central tool in our convergence proof of the gradient descent method. △ Less

Submitted 19 February, 2021; originally announced February 2021.

Comments: 23 pages

Journal ref: Journal of Complexity (2022)

arXiv:2012.12348 [pdf, ps, other]

doi 10.3934/dcdsb.2022238

An overview on deep learning-based approximation methods for partial differential equations

Authors: Christian Beck, Martin Hutzenthaler, Arnulf Jentzen, Benno Kuckuck

Abstract: It is one of the most challenging problems in applied mathematics to approximatively solve high-dimensional partial differential equations (PDEs). Recently, several deep learning-based approximation algorithms for attacking this problem have been proposed and tested numerically on a number of examples of high-dimensional PDEs. This has given rise to a lively field of research in which deep learnin… ▽ More It is one of the most challenging problems in applied mathematics to approximatively solve high-dimensional partial differential equations (PDEs). Recently, several deep learning-based approximation algorithms for attacking this problem have been proposed and tested numerically on a number of examples of high-dimensional PDEs. This has given rise to a lively field of research in which deep learning-based methods and related Monte Carlo methods are applied to the approximation of high-dimensional PDEs. In this article we offer an introduction to this field of research by revisiting selected mathematical results related to deep learning approximation methods for PDEs and reviewing the main ideas of their proofs. We also provide a short overview of the recent literature in this area of research. △ Less

Submitted 18 November, 2022; v1 submitted 22 December, 2020; originally announced December 2020.

Comments: 49 pages. Compared to the first version, the manuscript has been significantly expanded. In particular, Python source code implementing several of the presented methods using PyTorch, as well as numerical simulations have been added

MSC Class: 65M99 (Primary); 35-02; 65-02; 68T07 (Secondary)

Journal ref: Discrete Contin. Dyn. Syst. Ser. B 28 (2023), no. 6, 3697-3746

arXiv:2012.08443 [pdf, ps, other]

doi 10.1007/s40304-022-00292-9

Strong overall error analysis for the training of artificial neural networks via random initializations

Authors: Arnulf Jentzen, Adrian Riekert

Abstract: Although deep learning based approximation algorithms have been applied very successfully to numerous problems, at the moment the reasons for their performance are not entirely understood from a mathematical point of view. Recently, estimates for the convergence of the overall error have been obtained in the situation of deep supervised learning, but with an extremely slow rate of convergence. In… ▽ More Although deep learning based approximation algorithms have been applied very successfully to numerous problems, at the moment the reasons for their performance are not entirely understood from a mathematical point of view. Recently, estimates for the convergence of the overall error have been obtained in the situation of deep supervised learning, but with an extremely slow rate of convergence. In this note we partially improve on these estimates. More specifically, we show that the depth of the neural network only needs to increase much slower in order to obtain the same rate of approximation. The results hold in the case of an arbitrary stochastic optimization algorithm with i.i.d.\ random initializations. △ Less

Submitted 15 December, 2020; originally announced December 2020.

Comments: 40 pages

Journal ref: Communications in Mathematics and Statistics (2023)

arXiv:2012.04326 [pdf, other]

High-dimensional approximation spaces of artificial neural networks and applications to partial differential equations

Authors: Pierfrancesco Beneventano, Patrick Cheridito, Arnulf Jentzen, Philippe von Wurstemberger

Abstract: In this paper we develop a new machinery to study the capacity of artificial neural networks (ANNs) to approximate high-dimensional functions without suffering from the curse of dimensionality. Specifically, we introduce a concept which we refer to as approximation spaces of artificial neural networks and we present several tools to handle those spaces. Roughly speaking, approximation spaces consi… ▽ More In this paper we develop a new machinery to study the capacity of artificial neural networks (ANNs) to approximate high-dimensional functions without suffering from the curse of dimensionality. Specifically, we introduce a concept which we refer to as approximation spaces of artificial neural networks and we present several tools to handle those spaces. Roughly speaking, approximation spaces consist of sequences of functions which can, in a suitable way, be approximated by ANNs without curse of dimensionality in the sense that the number of required ANN parameters to approximate a function of the sequence with an accuracy $\varepsilon > 0$ grows at most polynomially both in the reciprocal $1/\varepsilon$ of the required accuracy and in the dimension $d \in \mathbb{N} = \{1, 2, 3, \ldots \}$ of the function. We show under suitable assumptions that these approximation spaces are closed under various operations including linear combinations, formations of limits, and infinite compositions. To illustrate the power of the machinery proposed in this paper, we employ the developed theory to prove that ANNs have the capacity to overcome the curse of dimensionality in the numerical approximation of certain first order transport partial differential equations (PDEs). Under suitable conditions we even prove that approximation spaces are closed under flows of first order transport PDEs. △ Less

Submitted 8 December, 2020; originally announced December 2020.

Comments: 32 pages

arXiv:2012.01194 [pdf, ps, other]

Deep learning based numerical approximation algorithms for stochastic partial differential equations and high-dimensional nonlinear filtering problems

Authors: Christian Beck, Sebastian Becker, Patrick Cheridito, Arnulf Jentzen, Ariel Neufeld

Abstract: In this article we introduce and study a deep learning based approximation algorithm for solutions of stochastic partial differential equations (SPDEs). In the proposed approximation algorithm we employ a deep neural network for every realization of the driving noise process of the SPDE to approximate the solution process of the SPDE under consideration. We test the performance of the proposed app… ▽ More In this article we introduce and study a deep learning based approximation algorithm for solutions of stochastic partial differential equations (SPDEs). In the proposed approximation algorithm we employ a deep neural network for every realization of the driving noise process of the SPDE to approximate the solution process of the SPDE under consideration. We test the performance of the proposed approximation algorithm in the case of stochastic heat equations with additive noise, stochastic heat equations with multiplicative noise, stochastic Black--Scholes equations with multiplicative noise, and Zakai equations from nonlinear filtering. In each of these SPDEs the proposed approximation algorithm produces accurate results with short run times in up to 50 space dimensions. △ Less

Submitted 2 December, 2020; originally announced December 2020.

arXiv:2009.13989 [pdf, ps, other]

Nonlinear Monte Carlo methods with polynomial runtime for high-dimensional iterated nested expectations

Authors: Christian Beck, Arnulf Jentzen, Thomas Kruse

Abstract: The approximative calculation of iterated nested expectations is a recurring challenging problem in applications. Nested expectations appear, for example, in the numerical approximation of solutions of backward stochastic differential equations (BSDEs), in the numerical approximation of solutions of semilinear parabolic partial differential equations (PDEs), in statistical physics, in optimal stop… ▽ More The approximative calculation of iterated nested expectations is a recurring challenging problem in applications. Nested expectations appear, for example, in the numerical approximation of solutions of backward stochastic differential equations (BSDEs), in the numerical approximation of solutions of semilinear parabolic partial differential equations (PDEs), in statistical physics, in optimal stop** problems such as the approximative pricing of American or Bermudan options, in risk measure estimation in mathematical finance, or in decision-making under uncertainty. Nested expectations which arise in the above named applications often consist of a large number of nestings. However, the computational effort of standard nested Monte Carlo approximations for iterated nested expectations grows exponentially in the number of nestings and it remained an open question whether it is possible to approximately calculate multiply iterated high-dimensional nested expectations in polynomial time. In this article we tackle this problem by proposing and studying a new class of full-history recursive multilevel Picard (MLP) approximation schemes for iterated nested expectations. In particular, we prove under suitable assumptions that these MLP approximation schemes can approximately calculate multiply iterated nested expectations with a computational effort growing at most polynomially in the number of nestings $ K \in \mathbb{N} = \{1, 2, 3, \ldots \} $, in the problem dimension $ d \in \mathbb{N} $, and in the reciprocal $\frac{1}{\varepsilon}$ of the desired approximation accuracy $ \varepsilon \in (0, \infty) $. △ Less

Submitted 29 September, 2020; originally announced September 2020.

Comments: 47 pages

MSC Class: 65C05 (Primary) 65M75; 68Q25 (Secondary)

arXiv:2009.02484 [pdf, ps, other]

Multilevel Picard approximations for high-dimensional semilinear second-order PDEs with Lipschitz nonlinearities

Authors: Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, Tuan Anh Nguyen

Abstract: The recently introduced full-history recursive multilevel Picard (MLP) approximation methods have turned out to be quite successful in the numerical approximation of solutions of high-dimensional nonlinear PDEs. In particular, there are mathematical convergence results in the literature which prove that MLP approximation methods do overcome the curse of dimensionality in the numerical approximatio… ▽ More The recently introduced full-history recursive multilevel Picard (MLP) approximation methods have turned out to be quite successful in the numerical approximation of solutions of high-dimensional nonlinear PDEs. In particular, there are mathematical convergence results in the literature which prove that MLP approximation methods do overcome the curse of dimensionality in the numerical approximation of nonlinear second-order PDEs in the sense that the number of computational operations of the proposed MLP approximation method grows at most polynomially in both the reciprocal $1/ε$ of the prescribed approximation accuracy $ε>0$ and the PDE dimension $d\in \mathbb{N}=\{1,2,3, \ldots\}$. However, in each of the convergence results for MLP approximation methods in the literature it is assumed that the coefficient functions in front of the second-order differential operator are affine linear. In particular, until today there is no result in the scientific literature which proves that any semilinear second-order PDE with a general time horizon and a non affine linear coefficient function in front of the second-order differential operator can be approximated without the curse of dimensionality. It is the key contribution of this article to overcome this obstacle and to propose and analyze a new type of MLP approximation method for semilinear second-order PDEs with possibly nonlinear coefficient functions in front of the second-order differential operators. In particular, the main result of this article proves that this new MLP approximation method does indeed overcome the curse of dimensionality in the numerical approximation of semilinear second-order PDEs. △ Less

Submitted 9 October, 2020; v1 submitted 5 September, 2020; originally announced September 2020.

arXiv:2008.13333 [pdf, other]

doi 10.1088/1361-6544/ac337f

Algorithms for Solving High Dimensional PDEs: From Nonlinear Monte Carlo to Machine Learning

Authors: Weinan E, Jiequn Han, Arnulf Jentzen

Abstract: In recent years, tremendous progress has been made on numerical algorithms for solving partial differential equations (PDEs) in a very high dimension, using ideas from either nonlinear (multilevel) Monte Carlo or deep learning. They are potentially free of the curse of dimensionality for many different applications and have been proven to be so in the case of some nonlinear Monte Carlo methods for… ▽ More In recent years, tremendous progress has been made on numerical algorithms for solving partial differential equations (PDEs) in a very high dimension, using ideas from either nonlinear (multilevel) Monte Carlo or deep learning. They are potentially free of the curse of dimensionality for many different applications and have been proven to be so in the case of some nonlinear Monte Carlo methods for nonlinear parabolic PDEs. In this paper, we review these numerical and theoretical advances. In addition to algorithms based on stochastic reformulations of the original problem, such as the multilevel Picard iteration and the Deep BSDE method, we also discuss algorithms based on the more traditional Ritz, Galerkin, and least square formulations. We hope to demonstrate to the reader that studying PDEs as well as control and variational problems in very high dimensions might very well be among the most promising new directions in mathematics and scientific computing in the near future. △ Less

Submitted 11 September, 2020; v1 submitted 30 August, 2020; originally announced August 2020.

MSC Class: 65C05; 65K10; 65M75; 90C06

Journal ref: Nonlinearity 35 (2022) 278-310

arXiv:2007.02723 [pdf, ps, other]

Weak error analysis for stochastic gradient descent optimization algorithms

Authors: Aritz Bercher, Lukas Gonon, Arnulf Jentzen, Diyora Salimova

Abstract: Stochastic gradient descent (SGD) type optimization schemes are fundamental ingredients in a large number of machine learning based algorithms. In particular, SGD type optimization schemes are frequently employed in applications involving natural language processing, object and face recognition, fraud detection, computational advertisement, and numerical approximations of partial differential equa… ▽ More Stochastic gradient descent (SGD) type optimization schemes are fundamental ingredients in a large number of machine learning based algorithms. In particular, SGD type optimization schemes are frequently employed in applications involving natural language processing, object and face recognition, fraud detection, computational advertisement, and numerical approximations of partial differential equations. In mathematical convergence results for SGD type optimization schemes there are usually two types of error criteria studied in the scientific literature, that is, the error in the strong sense and the error with respect to the objective function. In applications one is often not only interested in the size of the error with respect to the objective function but also in the size of the error with respect to a test function which is possibly different from the objective function. The analysis of the size of this error is the subject of this article. In particular, the main result of this article proves under suitable assumptions that the size of this error decays at the same speed as in the special case where the test function coincides with the objective function. △ Less

Submitted 21 July, 2020; v1 submitted 3 July, 2020; originally announced July 2020.

Comments: 123 pages

arXiv:2006.07075 [pdf, ps, other]

doi 10.1016/j.jco.2020.101540

Non-convergence of stochastic gradient descent in the training of deep neural networks

Authors: Patrick Cheridito, Arnulf Jentzen, Florian Rossmannek

Abstract: Deep neural networks have successfully been trained in various application areas with stochastic gradient descent. However, there exists no rigorous mathematical explanation why this works so well. The training of neural networks with stochastic gradient descent has four different discretization parameters: (i) the network architecture; (ii) the amount of training data; (iii) the number of gradien… ▽ More Deep neural networks have successfully been trained in various application areas with stochastic gradient descent. However, there exists no rigorous mathematical explanation why this works so well. The training of neural networks with stochastic gradient descent has four different discretization parameters: (i) the network architecture; (ii) the amount of training data; (iii) the number of gradient steps; and (iv) the number of randomly initialized gradient trajectories. While it can be shown that the approximation error converges to zero if all four parameters are sent to infinity in the right order, we demonstrate in this paper that stochastic gradient descent fails to converge for ReLU networks if their depth is much larger than their width and the number of random initializations does not increase to infinity fast enough. △ Less

Submitted 29 January, 2021; v1 submitted 12 June, 2020; originally announced June 2020.

MSC Class: 68T07 (Primary) 65D15 (Secondary) ACM Class: I.2.6

Journal ref: J. Complexity 64 (2021)

arXiv:2006.02199 [pdf, ps, other]

Space-time deep neural network approximations for high-dimensional partial differential equations

Authors: Fabian Hornung, Arnulf Jentzen, Diyora Salimova

Abstract: It is one of the most challenging issues in applied mathematics to approximately solve high-dimensional partial differential equations (PDEs) and most of the numerical approximation methods for PDEs in the scientific literature suffer from the so-called curse of dimensionality in the sense that the number of computational operations employed in the corresponding approximation scheme to obtain an a… ▽ More It is one of the most challenging issues in applied mathematics to approximately solve high-dimensional partial differential equations (PDEs) and most of the numerical approximation methods for PDEs in the scientific literature suffer from the so-called curse of dimensionality in the sense that the number of computational operations employed in the corresponding approximation scheme to obtain an approximation precision $\varepsilon>0$ grows exponentially in the PDE dimension and/or the reciprocal of $\varepsilon$. Recently, certain deep learning based approximation methods for PDEs have been proposed and various numerical simulations for such methods suggest that deep neural network (DNN) approximations might have the capacity to indeed overcome the curse of dimensionality in the sense that the number of real parameters used to describe the approximating DNNs grows at most polynomially in both the PDE dimension $d\in\mathbb{N}$ and the reciprocal of the prescribed accuracy $\varepsilon>0$. There are now also a few rigorous results in the scientific literature which substantiate this conjecture by proving that DNNs overcome the curse of dimensionality in approximating solutions of PDEs. Each of these results establishes that DNNs overcome the curse of dimensionality in approximating suitable PDE solutions at a fixed time point $T>0$ and on a compact cube $[a,b]^d$ in space but none of these results provides an answer to the question whether the entire PDE solution on $[0,T]\times [a,b]^d$ can be approximated by DNNs without the curse of dimensionality. It is precisely the subject of this article to overcome this issue. More specifically, the main result of this work in particular proves for every $a\in\mathbb{R}$, $ b\in (a,\infty)$ that solutions of certain Kolmogorov PDEs can be approximated by DNNs on the space-time region $[0,T]\times [a,b]^d$ without the curse of dimensionality. △ Less

Submitted 3 June, 2024; v1 submitted 3 June, 2020; originally announced June 2020.

Comments: 56 pages, 2 figures

arXiv:2005.10206 [pdf, other]

doi 10.4208/cicp.OA-2020-0130

Numerical simulations for full history recursive multilevel Picard approximations for systems of high-dimensional partial differential equations

Authors: Sebastian Becker, Ramon Braunwarth, Martin Hutzenthaler, Arnulf Jentzen, Philippe von Wurstemberger

Abstract: One of the most challenging issues in applied mathematics is to develop and analyze algorithms which are able to approximately compute solutions of high-dimensional nonlinear partial differential equations (PDEs). In particular, it is very hard to develop approximation algorithms which do not suffer under the curse of dimensionality in the sense that the number of computational operations needed b… ▽ More One of the most challenging issues in applied mathematics is to develop and analyze algorithms which are able to approximately compute solutions of high-dimensional nonlinear partial differential equations (PDEs). In particular, it is very hard to develop approximation algorithms which do not suffer under the curse of dimensionality in the sense that the number of computational operations needed by the algorithm to compute an approximation of accuracy $ε> 0$ grows at most polynomially in both the reciprocal $1/ε$ of the required accuracy and the dimension $d \in \mathbb{N}$ of the PDE. Recently, a new approximation method, the so-called full history recursive multilevel Picard (MLP) approximation method, has been introduced and, until today, this approximation scheme is the only approximation method in the scientific literature which has been proven to overcome the curse of dimensionality in the numerical approximation of semilinear PDEs with general time horizons. It is a key contribution of this article to extend the MLP approximation method to systems of semilinear PDEs and to numerically test it on several example PDEs. More specifically, we apply the proposed MLP approximation method in the case of Allen-Cahn PDEs, Sine-Gordon-type PDEs, systems of coupled semilinear heat PDEs, and semilinear Black-Scholes PDEs in up to 1000 dimensions. The presented numerical simulation results suggest in the case of each of these example PDEs that the proposed MLP approximation method produces very accurate results in short runtimes and, in particular, the presented numerical simulation results indicate that the proposed MLP approximation scheme significantly outperforms certain deep learning based approximation methods for high-dimensional semilinear PDEs. △ Less

Submitted 25 May, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

Comments: 21 pages

MSC Class: 65M75 ACM Class: G.1.8

Journal ref: Commun. Comput. Phys. 28 (2020), no. 5, 2109-2138

arXiv:2004.03389 [pdf, ps, other]

doi 10.1142/S0219493721500489

On nonlinear Feynman-Kac formulas for viscosity solutions of semilinear parabolic partial differential equations

Authors: Christian Beck, Martin Hutzenthaler, Arnulf Jentzen

Abstract: The classical Feynman-Kac identity builds a bridge between stochastic analysis and partial differential equations (PDEs) by providing stochastic representations for classical solutions of linear Kolmogorov PDEs. This opens the door for the derivation of sampling based Monte Carlo approximation methods, which can be meshfree and thereby stand a chance to approximate solutions of PDEs without suffer… ▽ More The classical Feynman-Kac identity builds a bridge between stochastic analysis and partial differential equations (PDEs) by providing stochastic representations for classical solutions of linear Kolmogorov PDEs. This opens the door for the derivation of sampling based Monte Carlo approximation methods, which can be meshfree and thereby stand a chance to approximate solutions of PDEs without suffering from the curse of dimensionality. In this article we extend the classical Feynman-Kac formula to certain semilinear Kolmogorov PDEs. More specifically, we identify suitable solutions of stochastic fixed point equations (SFPEs), which arise when the classical Feynman-Kac identity is formally applied to semilinear Kolmorogov PDEs, as viscosity solutions of the corresponding PDEs. This justifies, in particular, employing full-history recursive multilevel Picard (MLP) approximation algorithms, which have recently been shown to overcome the curse of dimensionality in the numerical approximation of solutions of SFPEs, in the numerical approximation of semilinear Kolmogorov PDEs. △ Less

Submitted 16 April, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

Comments: 54 pages

Journal ref: Stochastics and Dynamics (2021), 2150048, 68 pages

arXiv:2003.01291 [pdf, other]

Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation

Authors: Arnulf Jentzen, Timo Welti

Abstract: In spite of the accomplishments of deep learning based algorithms in numerous applications and very broad corresponding research interest, at the moment there is still no rigorous understanding of the reasons why such algorithms produce useful results in certain situations. A thorough mathematical analysis of deep learning based algorithms seems to be crucial in order to improve our understanding… ▽ More In spite of the accomplishments of deep learning based algorithms in numerous applications and very broad corresponding research interest, at the moment there is still no rigorous understanding of the reasons why such algorithms produce useful results in certain situations. A thorough mathematical analysis of deep learning based algorithms seems to be crucial in order to improve our understanding and to make their implementation more effective and efficient. In this article we provide a mathematically rigorous full error analysis of deep learning based empirical risk minimisation with quadratic loss function in the probabilistically strong sense, where the underlying deep neural networks are trained using stochastic gradient descent with random initialisation. The convergence speed we obtain is presumably far from optimal and suffers under the curse of dimensionality. To the best of our knowledge, we establish, however, the first full error analysis in the scientific literature for a deep learning based algorithm in the probabilistically strong sense and, moreover, the first full error analysis in the scientific literature for a deep learning based algorithm where stochastic gradient descent with random initialisation is the employed optimisation method. △ Less

Submitted 2 March, 2020; originally announced March 2020.

Comments: 51 pages

MSC Class: 62M45; 68T05; 62L20; 60H30

arXiv:2003.00596 [pdf, ps, other]

Overcoming the curse of dimensionality in the numerical approximation of high-dimensional semilinear elliptic partial differential equations

Authors: Christian Beck, Lukas Gonon, Arnulf Jentzen

Abstract: Recently, so-called full-history recursive multilevel Picard (MLP) approximation schemes have been introduced and shown to overcome the curse of dimensionality in the numerical approximation of semilinear parabolic partial differential equations (PDEs) with Lipschitz nonlinearities. The key contribution of this article is to introduce and analyze a new variant of MLP approximation schemes for cert… ▽ More Recently, so-called full-history recursive multilevel Picard (MLP) approximation schemes have been introduced and shown to overcome the curse of dimensionality in the numerical approximation of semilinear parabolic partial differential equations (PDEs) with Lipschitz nonlinearities. The key contribution of this article is to introduce and analyze a new variant of MLP approximation schemes for certain semilinear elliptic PDEs with Lipschitz nonlinearities and to prove that the proposed approximation schemes overcome the curse of dimensionality in the numerical approximation of such semilinear elliptic PDEs. △ Less

Submitted 1 March, 2020; originally announced March 2020.

Comments: 50 pages

MSC Class: 65Cxx; 65Mxx

arXiv:2001.03472 [pdf, ps, other]

doi 10.3934/dcdsb.2021203

Counterexamples to local Lipschitz and local Hölder continuity with respect to the initial values for additive noise driven SDEs with smooth drift coefficient functions with at most polynomially growing derivatives

Authors: Arnulf Jentzen, Benno Kuckuck, Thomas Müller-Gronbach, Larisa Yaroslavtseva

Abstract: In the recent article [A. Jentzen, B. Kuckuck, T. Müller-Gronbach, and L. Yaroslavtseva, arXiv:1904.05963 (2019)] it has been proved that the solutions to every additive noise driven stochastic differential equation (SDE) which has a drift coefficient function with at most polynomially growing first order partial derivatives and which admits a Lyapunov-type condition (ensuring the the existence of… ▽ More In the recent article [A. Jentzen, B. Kuckuck, T. Müller-Gronbach, and L. Yaroslavtseva, arXiv:1904.05963 (2019)] it has been proved that the solutions to every additive noise driven stochastic differential equation (SDE) which has a drift coefficient function with at most polynomially growing first order partial derivatives and which admits a Lyapunov-type condition (ensuring the the existence of a unique solution to the SDE) depend in a logarithmically Hölder continuous way on their initial values. One might then wonder whether this result can be sharpened and whether in fact, SDEs from this class necessarily have solutions which depend locally Lipschitz continuously on their initial value. The key contribution of this article is to establish that this is not the case. More precisely, we supply a family of examples of additive noise driven SDEs which have smooth drift coefficient functions with at most polynomially growing derivatives whose solutions do not depend on their initial value in a locally Lipschitz continuous, nor even in a locally Hölder continuous way. △ Less

Submitted 10 January, 2020; originally announced January 2020.

Comments: 27 pages

Journal ref: Discrete Contin. Dyn. Syst. Ser. B 27 (2022), no. 7, 3707-3724

arXiv:1912.11060 [pdf, other]

doi 10.3390/jrfm13070158

Pricing and hedging American-style options with deep learning

Authors: Sebastian Becker, Patrick Cheridito, Arnulf Jentzen

Abstract: In this paper we introduce a deep learning method for pricing and hedging American-style options. It first computes a candidate optimal stop** policy. From there it derives a lower bound for the price. Then it calculates an upper bound, a point estimate and confidence intervals. Finally, it constructs an approximate dynamic hedging strategy. We test the approach on different specifications of a… ▽ More In this paper we introduce a deep learning method for pricing and hedging American-style options. It first computes a candidate optimal stop** policy. From there it derives a lower bound for the price. Then it calculates an upper bound, a point estimate and confidence intervals. Finally, it constructs an approximate dynamic hedging strategy. We test the approach on different specifications of a Bermudan max-call option. In all cases it produces highly accurate prices and dynamic hedging strategies with small replication errors. △ Less

Submitted 18 July, 2020; v1 submitted 23 December, 2019; originally announced December 2019.

Journal ref: Journal of Risk and Financial Management 13, 7 (2020)

arXiv:1912.04310 [pdf, other]

doi 10.1109/TNNLS.2021.3049719

Efficient approximation of high-dimensional functions with neural networks

Authors: Patrick Cheridito, Arnulf Jentzen, Florian Rossmannek

Abstract: In this paper, we develop a framework for showing that neural networks can overcome the curse of dimensionality in different high-dimensional approximation problems. Our approach is based on the notion of a catalog network, which is a generalization of a standard neural network in which the nonlinear activation functions can vary from layer to layer as long as they are chosen from a predefined cat… ▽ More In this paper, we develop a framework for showing that neural networks can overcome the curse of dimensionality in different high-dimensional approximation problems. Our approach is based on the notion of a catalog network, which is a generalization of a standard neural network in which the nonlinear activation functions can vary from layer to layer as long as they are chosen from a predefined catalog of functions. As such, catalog networks constitute a rich family of continuous functions. We show that under appropriate conditions on the catalog, catalog networks can efficiently be approximated with rectified linear unit-type networks and provide precise estimates on the number of parameters needed for a given approximation accuracy. As special cases of the general results, we obtain different classes of functions that can be approximated with ReLU networks without the curse of dimensionality. △ Less

Submitted 29 January, 2021; v1 submitted 9 December, 2019; originally announced December 2019.

MSC Class: 68T07 ACM Class: I.2.0

Journal ref: IEEE Trans. Neural Netw. Learn. Syst. (2021)

Showing 1–50 of 127 results for author: Jentzen, A