hypothesisHypothesis \newsiamthmclaimClaim
Fast Iterative Solver for Neural Network Method:
II. 1D diffusion-reaction problems and data fitting
††thanks: This work was supported in part by the National Science Foundation under grant DMS-2110571. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-JRNL-865920).
Abstract
This paper expands the damped block Newton (dBN) method introduced recently in [4] for 1D diffusion-reaction equations and least-squares data fitting problems. To determine the linear parameters (the weights and bias of the output layer) of the neural network (NN), the dBN method requires solving systems of linear equations involving the mass matrix. While the mass matrix for local hat basis functions is tri-diagonal and well-conditioned, the mass matrix for NNs is dense and ill-conditioned. For example, the condition number of the NN mass matrix for quasi-uniform meshes is at least . We present a factorization of the mass matrix that enables solving the systems of linear equations in operations. To determine the non-linear parameters (the weights and bias of the hidden layer), one step of a damped Newton method is employed at each iteration. A Gauss-Newton method is used in place of Newton for the instances in which the Hessian matrices are singular. This modified dBN is referred to as dBGN. For both methods, the computational cost per iteration is . Numerical results demonstrate the ability dBN and dBGN to efficiently achieve accurate results and outperform BFGS for select examples.
keywords:
Fast iterative solvers, Neural network, Ritz formulation, ReLU activation, Diffusion-Reaction problems, Data fitting, Newton’s method, Gauss-Newton’s method1 Introduction
Using neural networks to solve partial differential equations (PDEs) has recently gained traction in the iterative solvers community (see, e.g., [1, 2, 6, 7, 11, 12]). In particular, the damped block Newton (dBN) method presented in [4] is a fast iterative solver for 1D diffusion problems. The descretization from the Ritz formulation of the one-dimensional diffusion equation introduces a high-dimensional, non-convex minimization problem. The dBN method numerically solves this problem using the block Gauss-Seidel method for the linear and non-linear parameters as an outer iteration. For the inner iteration, the corresponding coefficient and Hessian matrices are inverted exactly. The computational cost of the dBN method is per iteration, which is an improvement over for common second order methods. This paper extends the methods in [4] to a broader class of problems, while maintaining the efficiency achieved in [4].
For elliptic PDEs beyond diffusion problems, as well as data fitting problems, the mass matrix must be inverted to solve for the linear parameter. Just as for the coefficient matrix in [4], the mass matrix depends on the non-linear parameter. However, the mass matrix is dense and much more ill-conditioned than the coefficient matrix. Whereas the coefficient matrix has condition number bounded by and has a tri-diagonal inverse [4], the mass matrix has condition number bounded by (see Lemma 2.3). Here, is the number of neurons and is the smallest distance between two neighboring breakpoints. This is completely different from the finite element method, in which the mass matrix is tri-diagonal and the condition number is for local hat basis functions on quasi-uniform meshes. Yet solving the linear systems efficiently is still possible; two representations of the mass matrix in terms of simpler matrices are presented in Section 2. Both methods make the inversion less computationally expensive.
The non-linear parameters for this broader class of problems present further challenges. Unlike in diffusion problems, the Hessian matrices for both diffusion-reaction and non-linear least squares problems are no longer diagonal and depend on the coefficient matrix. However, a factorization is used to compute the inverse of the Hessian efficiently, utilizing the explicit formula for the inverse of the coefficient matrix from [4]. Furthermore, for the cases in which the Hessian matrices are non-invertible, a damped block Gauss-Newton (dBGN) method is presented. The Gauss-Newton matrix is positive-definite, and its inverse is tri-diagonal. Whether using dBN or dBGN, the computational cost per iteration remains , as in [4]. Even faster convergence for the non-linear parameter is possible for diffusion-reaction problems when adding adaptive neuron enhancement (ANE) [10]. Numerical examples demonstrate the ability of the aforementioned methods to move the breakpoints quickly and efficiently and to outperform BFGS for select examples.
The paper is structured as follows: Section 2 introduces the notation for shallow neural networks and the corresponding mass matrix. The condition numbers for both neural network and finite element mass matrices are presented and compared. This is followed by a discussion of two ways in which to decompose the mass matrix in order to more efficiently invert it. Then the problems in which the mass matrix arises are presented in Section 3.1 and Section 3.2. The non-linear least-squares optimization problem using shallow neural neworks is presented in Section 3.1. Then in Section 3.2 the diffusion-reaction equation and the modified Ritz formulation are introduced. Next, the dBN method is reiterated in Section 4, emphasizing the modifications that need to be made to the dBN in [4] in order for it to work for the broader class of problems presented in this paper. For cases in which the Hessian for the non-linear parameter is non-invertible, the dBGN method is outlined. This is followed by Section 4.1, in which we recall the adaptivity scheme (AdBN) from [4], which can also be used for diffusion-reaction problems. Lastly, numerical results are presented in Section 5, demonstrating the performance of the aforementioned methods, as compared to BFGS, for select example problems. The examples in Section 5 highlight the ability of these methods to move mesh points to enhance the approximation. In particular, the results in Section 5.3 demonstrate the ability of dBN to solve the singularly perturbed reaction-diffusion equation.
2 Mass Matrix for Shallow Neural Network
This section studies the mass matrix resulting from a shallow ReLU neural network and computation of its inversion.
As in [4], the set of approximating functions generated by the shallow ReLU neural network with neurons is denoted by
where and is the ReLU activation function. Let be a real-valued function defined on and bounded below by a positive constant almost everywhere.
Consider the following mass matrix associated with the weight function given by
(1) |
and the coefficient matrix associated with given by
for , where is the Heaviside (unit) step function and is the non-linear parameter.
While the coefficient matrix is dense, its inversion is a tri-diagonal matrix with an explicit algebraic formula (see [4]). This property holds for a class of matrices with a special structure.
Lemma 2.1.
For , assume that
for all . Then the matrix
(2) |
is invertible. Moreover, its inverse is symmetric and tri-diagonal with non-zero entries given by
where and .
Proof 2.2.
It is easy to verify that .
The coefficient matrix has the same structure as with , , and . In the case of the mass matrix , it remains dense due to the global support of neurons, and its condition number is very large (see Section 2.1). This section derives inverse formulas of the mass matrix, whose application needs operations. Derivation is given both algebraically in Section 2.2 and geometrically in Section 2.3.
2.1 Condition Number
Let for , and set
It was shown in [4] that the condition number of is bounded by for . The next lemma provides an upper bound for the condition number of the mass matrix.
Lemma 2.3.
Let , then the condition number of the mass matrix is bounded by .
Proof 2.4.
For any vector , denote its magnitude by . By the Cauchy-Schwarz inequality and the fact that for , we have
(3) |
To estimate the lower bound of , let
for . Then . Since is a quadratic function in each sub-interval , Simpson’s Rule implies
where . It is easy to see that , where is a -order lower tri-diagonal matrix given by
It is easy to verify that has spectral norm bounded by
Hence,
which, together with the upper bound in Eq. 3, implies the validity of the lemma.
Lemma 2.5.
Under the assumption on the weight function , the condition number of the mass matrix is bounded by .
Proof 2.6.
Since and almost everywhere, in a similar fashion as the proof of Lemma 2.3, we have
which implies the validity of the lemma.
Whereas the mass matrix associated with the ReLU neural network is very ill-condiditoned, it is well known that the mass matrix for the finite element (FE) method is much better conditioned (see [8] for example). The following Lemma 2.7 reiterates the result in [8] but with an alternate proof in a similar fashion as that of Lemma 2.3.
Assume that , and set
For the partition , denote the hat basis functions for by
Next let . Then the corresponding FE mass matrix for this partition is denoted by
Lemma 2.7.
The condition number of the finite element mass matrix is bounded by
.
Proof 2.8.
For any vector , in a similar fashion as that of Lemma 2.3, we get the equality
with , which leads to the inequalities
This completes the proof of the lemma.
2.2 Algebraic Approach
This section derives an inverse formula of the mass matrix through a decomposition into two matrices. The decomposition is based on the fact that matrices with the structure of in Eq. 2 have tri-diagonal inverses.
For , let be the -element of the mass matrix , then
which implies the following decomposition
Both and have the same structure as in Eq. 2 with
where
Proposition 2.9.
The inverse of the mass matrix is given by
(4) |
Proof 2.10.
Remark 1.
Since and are tri-diagonal, so is . Hence, in Eq. 4 applied to any vector can be computed in operations.
2.3 Geometric Approach
This section presents another way to invert the mass matrix, based on a factorization of into the product of three tri-diagonal matrices. The factorization arises from expressing the global ReLU basis functions in terms of local discontinuous basis functions.
To this end, for , let and define the local basis functions
Since in , we have
(5) |
Set
(6) |
where ; and let ,
Lemma 2.11.
There exist map**s and such that
(7) |
Moreover, we have
where is the -order identity matrix.
Proof 2.12.
Eq. 5 implies that there exist and such that Eq. 7 is valid. To determine and , for any , let , then
On each , using the facts that and are constants, we have
which, together with arbitrariness of , implies that .
By the definitions of and and the fact that , we have
which, together with arbitrariness of , implies that . This completes the proof of the lemma.
For , let
For , let
(8) |
Then, together with , it is easy to see that
Theorem 2.
Proof 2.13.
Remark 3.
Clearly, is tri-diagonal. Hence, and hence applied to any vector can be computed in operations.
3 Applications
This section considers two applications: the least-squares data fitting and the diffusion-reaction equation in one dimension. When using the shallow ReLU neural network, the resulting discretization requires inversion of the corresponding mass matrix.
3.1 Least-Squares Approximation
The first problem type in which the mass matrix arises is least-squares data fitting. Given a function , the best least-squares approximation to in is to find and such that
(11) |
where is the weighted continuous least-squares loss functional given by
Let be a solution of Eq. 11 having the form of
Clearly, the optimality condition on the linear parameter gives
(12) |
where is the mass matrix defined in Eq. 1 and is given by
where is defined in Eq. 6.
Let be a diagonal matrix with the linear parameter, then the optimality condition on the non-linear parameter leads to
(13) |
Eq. 13 is a system of non-linear algebraic equations and will be solved by Newton’s method. Let for . In one dimension, Lemma 4.1 in [3] implies that the corresponding Hessian matrix is of the form
(14) |
where is a diagonal matrix given by
3.2 Diffusion-Reaction Problem
The second application that we consider is the following diffusion-reaction equation in one dimension:
(15) |
where the diffusion coefficient , the reaction coefficient , and are given real-valued functions defined on . Assume that and are bounded below by the respective positive constant and non-negative constant almost everywhere on .
As in [4], the modified Ritz formulation of problem (15) is to find such that
(16) |
where the modified energy functional is given by
(17) |
Here, is a penalization constant. Then the Ritz neural network approximation is to find such that
(18) |
The corresponding bilinear form of the modified enery functional is given by
for any . Denote by the induced norm of the bilinear form.
Proposition 3.1.
Proof 3.2.
3.2.1 System of Algebraic Equations
Let be the solution of problem Eq. 18, then the linear parameter and non-linear parameter satisfy the following optimality conditions
(22) |
where and denote the gradients with respect to and , respectively.
Denote the right-hand side vector by
and let . By the same derivation in [4], the first equation in Eq. 22 becomes
(23) |
Comparing to (3.2) in [4], the additional term in Eq. 23 is resulted from the reaction term.
For , let
Let be the diagonal matrix with the -th diagonal elements .
Lemma 3.3.
The Hessian matrix has the form
(24) |
Proof 3.4.
4 Damped Block Newton and Gauss-Newton Methods
Optimality conditions of the minimization problems in Eq. 11 and Eq. 18 lead to systems of non-linear algebraic equations of the form
(25) |
for the linear and non-linear parameters, respectively, where the first equation is given in Eq. 12 for the least-squares (LS) approximation and in Eq. 23 for the diffusion-reaction (DR) equation with
The respective Hessian matrix is given in Eq. 14 and Eq. 24 with
(26) |
In a similar fashion as in [3], the Gauss-Newton matrix is given by
(27) |
In the case that in Eq. 26 is invertible, the non-linear system in Eq. 25 can be solved by the damped block Newton (dBN) method described in Algorithm 4.1 of [4]. The method employs the block Gauss-Seidel method as an outer iteration for the linear and non-linear parameters. Per each outer iteration, the linear and the non-linear parameters are updated by exact inversion and one step of a damped Newton method, respectively.
To efficiently invert , we use the factorizations of and given in Eq. 9 and Eq. 10, respectively. That is,
(28) |
Since and are tri-diagonal, action of their inversions applied to any vector can be computed in operations, so is the action of . For the diffusion-reaction problem, the Sherman-Morrison formula is needed for a rank-one update.
In the case that in Eq. 26 is singular, the non-linear system in Eq. 25 can be solved by the structure-guided Gauss-Newton (SgGN) method described in Algorithm 4.1 of [3]. This is because the layer Gauss-Newton matrix is always symmetric positive-definite and its inverse is tri-diagonal (see [4]). The SgGN is essentially the damped block Gauss-Newton (dBGN) method, that replaces in the dBN method by in the dBGN method.
Lemma 4.1.
Assume that for all . Then is invertible if and only if is invertible. Moreover, we have
(29) |
Proof 4.2.
Lemma 4.1, together with the fact that is tri-diagonal and the Sherman-Morrison formula, implies that action of applied to any vector can be computed in operations.
4.1 An Adaptivity Scheme
For a fixed number of neurons, the dBN method for the diffusion-reaction equation moves the initial uniformly distributed breakpoints very efficiently to nearly optimal locations as shown in Section 5. However, it was shown in [4] that introducing adaptivity results in a more optimal convergence rate.
In fact, the adaptive neuron enhancement (ANE) method [9, 10] was employed in [4]. The ANE method starts with a relatively small neural network and adaptively adds new neurons based on the previous approximation. Moreover, the newly added neurons are initialized at where the previous approximation is not accurate. At each adaptive step, we use the dBN method to numerically solve the minimization problem in Eq. 18. Section 5 in [4] describes how to introduce adaptivity, and Algorithm 5.1 in [4] describes the adaptive block Newton (AdBN) method.
Here, the only modification is the local indicator. Letting be a subinterval, a modified local indicator of the ZZ type on (see, e.g., [5]) is defined by
where is the projection of onto the space of the continuous piecewise linear functions.
5 Numerical Experiments
This section first presents numerical results of the dBN and dBGN methods for solving Eq. 11. Afterwards, results of the dBN, dBGN and AdBN methods for solving Eq. 15 are shown in Section 5.2 and Section 5.3. For diffusion-reaction problems, the penalization parameter was set to . For the AdBN method, a refinement occurred when the difference of the total estimators for two consecutive iterates was less than .
For each test problem of the diffusion-reaction equation, let and be the exact solution and its approximation in , respectively. Denote the relative error by
5.1 Least-Squares Problem
The first test problem is the function
(30) |
as the target function for problem Eq. 11, with . We aim to test the performance of dBN and dBGN for least-squares data fitting problems. LABEL:example3BFGSdBN presents a comparison between dBN, dBGN and BFGS. In this comparison, we utilized a Python BFGS implementation from ‘scipy.optimize’. The initial network parameters for the three algorithms were set to be the uniform mesh for and given by solving Eq. 12. Recall that the computational cost per iteration of dBN and dBGN is , while each iteration of BFGS has a cost of . In this example our solvers outperform BFGS, achieving smaller losses in fewer and cheaper iterations.
LABEL:example2DF (a) illustrates the neural network approximation of the function in Eq. 30, obtained using uniform breakpoints and determining the linear parameter through the solution of Eq. 12. Clearly, it is more optimal to concentrate more mesh points on the left side, where the curve is steeper. The dBN method is capable of making this adjustment, as illustrated in LABEL:example2DF (b). The loss functions confirm that the approximation improves substantially when the breakpoints are allocated according to the steepness of the function.
5.2 Exponential Solution
Similarly to LABEL:example3BFGSdBN, we start by comparing our two solvers with BFGS. The initial network parameters for all algorithms were set to be the uniform mesh for , with given by the exact solution of equation Eq. 23. We observe in LABEL:example1BFGSdB that in about 25 iterations, both dBN and dBGN achieve an accuracy that BFGS cannot attain.
LABEL:ex1Figure (a) shows the initial neural network approximation of the function in Eq. 31, obtained by using uniform breakpoints and determining the linear parameter through the solution of Eq. 23. The approximation generated by dBN is shown in LABEL:ex1Figure (b), while LABEL:ex1Figure (c) illustrates the approximation obtained by employing dBN with adaptivity. Notably, in both cases, the breakpoints are moved, and the approximation enhances the initial approximation.
Theoretically, from Eq. 21, is the order of convergence of approximating a solution Eq. 31 by functions in . However, since Eq. 18 is a non-convex optimization problem, the existence of local minimums makes it challenging to achieve this order. Therefore, given the neural network approximation to provided by the dBN method, assume that
for some . As in [4], we can use the AdBN method to improve the order of convergence of the dBN method (achieve an closer to 1).
Table 1 illustrates adaptive dBN (AdBN) starting with 20 neurons, refining 8 times, and reaching a final count of 194 neurons. The stop** tolerance was set to . The recorded data in Table 1 includes the relative seminorm error and the relative error estimator for each iteration of the adaptive process. Additionally, Table 1 provides the results for dBN with fixed 144 and 194 neurons. Comparing these results to the adaptive run with the same number of neurons, we observe a significant improvement in rate, error estimator, and seminorm error within the adaptive run.
NN ( neurons) | |||
---|---|---|---|
Adaptive (20) | 0.545 | 0.766 | |
Adaptive (27) | 0.342 | 0.834 | |
Adaptive (32) | 0.259 | 0.830 | |
Adaptive (46) | 0.193 | 0.854 | |
Adaptive (52) | 0.146 | 0.851 | |
Adaptive (71) | 0.107 | 0.853 | |
Adaptive (99) | 0.079 | 0.862 | |
Adaptive (144) | 0.052 | 0.872 | |
Adaptive (194) | 0.037 | 0.898 | |
Fixed (144) | 0.075 | 0.798 | |
Fixed (194) | 0.057 | 0.797 |
5.3 Singularly Perturbed Reaction-Diffusion Equation
The third test problem is a singularly perturbed reaction-diffusion equation:
(32) |
For , problem Eq. 32 has the following exact solution
(33) |
For some , these problems exhibit interior layers that make them challenging for mesh-based methods such as finite element and finite difference, leading to overshooting and oscillations. For , LABEL:example2DR illustrates the neural network approximation of the function described in Eq. 33, using uniform breakpoints (a) and employing dBN to adjust the breakpoints (b). An interesting observation is that the resulting approximation from dBN does not exhibit overshooting or oscillations. This confirms that dBN is capable of successfully adjusting the breakpoints and may have the potential to accurately approximate solutions with boundary and/or interior layers.
It is worth mentioning that the relative -norm error of the approximation depicted in LABEL:example2DR (b) is . In [2], similar errors were obtained using deep neural networks with parameters. In our case, the number of parameters is only .
The resulting relative errors obtained after using dBN for various values of are shown in Table 2. For each value of , dBN considerably improves the initial approximation, and the error does not vary significantly with different values of .
(initial) | (dBN) | |
---|---|---|
We also present the results of using adaptive mesh refinement. LABEL:example22DR shows the neural network approximation obtained by starting with 12 uniform breakpoints. Refinements are performed using the average marking strategy (see equation (5.2) in [4]) to achieve a similar error as the approximation in LABEL:example2DR (b). After each refinement, the linear parameter was computed by solving equation Eq. 23. In LABEL:example22DR (a), the breakpoints were not moved, whereas LABEL:example22DR (b) illustrates the AdBN method where the breakpoints were moved after each refinement.
6 Discussion and Conclusion
The corresponding mass matrix using the shallow ReLU neural network arises in applications such as diffusion-reaction equation, least-squares data fitting, etc. Unlike the finite element mass matrix, the NN mass matrix is dense and very ill-conditioned (see Lemma 2.3). These features hinder efficiency of commonly used numerical methods for solving the resulting system of linear equations.
This difficulty is overcome in one dimension through an especial factorization of the mass matrix, which was done using both algebraic and geometrical approaches. This factorization enables the computational cost for the inversion of the mass matrix. Combining this with the fact that the inversion of the coefficient matrix is tri-diagonal, the resulting damped block Newton (dBN) method is implemented with a computational cost of just per iteration, granted that the corresponding Hessian matrix is invertible. The quadratic form of the objective functions for certain problems allows the construction of damped block Gauss-Newton (dBGN) methods, which benefit from having symmetric positive-definite Gauss-Newton matrices. For diffusion-reaction problems in particular, the addition of adaptive network enhancement (ANE) improves the rate of convergence.
Overall, the numerical results demonstrate the efficiency of the various methods in terms of not only the number of iterations but also the cost per iteration, making a compelling case to pursue the construction of similar solvers for higher dimensional problems. Of particular interest is the application of dBN methods to the singularly perturbed reaction-diffusion problem. For a fixed number of mesh points , dBN appears to achieve an accuracy independent of the diffusion coefficient . Furthermore, when adding in adaptivity, AdBN seems to be comparable to FE methods using mesh refinement.
References
- [1] J. Berg and K. Nyström. A unified deep artificial neural network approach to partial differential equations in complex geometries. Neurocomputing, 317:28–41, 2018.
- [2] Z. Cai, J. Chen, M. Liu, and Xinyu Liu. Deep least-squares methods: An unsupervised learning-based numerical method for solving elliptic PDEs. Journal of Computational Physics, 420:109707, 2020.
- [3] Z. Cai, T. Ding, M. Liu, X. Liu, and J. Xia. A structure-guided gauss-newton method for shallow ReLU neural network. arXiv:2404.05064v1 [cs.LG], 2024.
- [4] Z. Cai, A. Doktorova, R. D. Falgout, and C. Herrera. Fast iterative solver for neural network method: I. 1d diffusion problems. arXiv:2404.17750 [math.NA], 2024.
- [5] Z. Cai and S. Zhang. Recovery-based error estimators for interface problems: conforming linear elements. SIAM Journal on Numerical Analysis, 47(3):2132–2156, 2009.
- [6] T. Dockhorn. A discussion on solving partial differential equations using neural networks. arXiv:1904.07200 [cs.LG], abs/1904.07200, 2019.
- [7] W. E and B. Yu. The deep Ritz method: A deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics, 6(1):1–12, March 2018.
- [8] I. Fried. The and condition numbers of the finite element stiffness and mass matrices, and the pointwise convergence of the method. In J.R. Whiteman, editor, The Mathematics of Finite Elements and Applications, pages 163–174. Academic Press, 1973.
- [9] M. Liu and Z. Cai. Adaptive two-layer ReLU neural network: II. Ritz approximation to elliptic pdes. Computers & Mathematics with Applications, 113:103–116, May 2022.
- [10] M. Liu, Z. Cai, and J. Chen. Adaptive two-layer ReLU neural network: I. best least-squares approximation. Computers & Mathematics with Applications, 113:34–44, May 2022.
- [11] M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, 2019.
- [12] J. Sirignano and K. Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics, 375:1339–1364, 2018.