\newsiamremark

hypothesisHypothesis \newsiamthmclaimClaim

Fast Iterative Solver for Neural Network Method:
II. 1D diffusion-reaction problems and data fitting ^†^†thanks: This work was supported in part by the National Science Foundation under grant DMS-2110571. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-JRNL-865920).

Zhiqiang Cai Department of Mathematics, Purdue University, West Lafayette, IN (, , ). [email protected] [email protected] [email protected] Anastassia Doktorova²²footnotemark: 2 Robert D. Falgout Lawrence Livermore National Laboratory, Livermore, CA () [email protected] César Herrera²²footnotemark: 2

Abstract

This paper expands the damped block Newton (dBN) method introduced recently in [4] for 1D diffusion-reaction equations and least-squares data fitting problems. To determine the linear parameters (the weights and bias of the output layer) of the neural network (NN), the dBN method requires solving systems of linear equations involving the mass matrix. While the mass matrix for local hat basis functions is tri-diagonal and well-conditioned, the mass matrix for NNs is dense and ill-conditioned. For example, the condition number of the NN mass matrix for quasi-uniform meshes is at least ${\cal O}(n^{4})$ . We present a factorization of the mass matrix that enables solving the systems of linear equations in ${\cal O}(n)$ operations. To determine the non-linear parameters (the weights and bias of the hidden layer), one step of a damped Newton method is employed at each iteration. A Gauss-Newton method is used in place of Newton for the instances in which the Hessian matrices are singular. This modified dBN is referred to as dBGN. For both methods, the computational cost per iteration is ${\cal O}(n)$ . Numerical results demonstrate the ability dBN and dBGN to efficiently achieve accurate results and outperform BFGS for select examples.

keywords:

Fast iterative solvers, Neural network, Ritz formulation, ReLU activation, Diffusion-Reaction problems, Data fitting, Newton’s method, Gauss-Newton’s method

1 Introduction

Using neural networks to solve partial differential equations (PDEs) has recently gained traction in the iterative solvers community (see, e.g., [1, 2, 6, 7, 11, 12]). In particular, the damped block Newton (dBN) method presented in [4] is a fast iterative solver for 1D diffusion problems. The descretization from the Ritz formulation of the one-dimensional diffusion equation introduces a high-dimensional, non-convex minimization problem. The dBN method numerically solves this problem using the block Gauss-Seidel method for the linear and non-linear parameters as an outer iteration. For the inner iteration, the corresponding coefficient and Hessian matrices are inverted exactly. The computational cost of the dBN method is $\mathcal{O}(n)$ per iteration, which is an improvement over $\mathcal{O}(n^{2})$ for common second order methods. This paper extends the methods in [4] to a broader class of problems, while maintaining the efficiency achieved in [4].

For elliptic PDEs beyond diffusion problems, as well as data fitting problems, the mass matrix must be inverted to solve for the linear parameter. Just as for the coefficient matrix in [4], the mass matrix depends on the non-linear parameter. However, the mass matrix is dense and much more ill-conditioned than the coefficient matrix. Whereas the coefficient matrix has condition number bounded by $\mathcal{O}(nh_{min}^{-1})$ and has a tri-diagonal inverse [4], the mass matrix has condition number bounded by $\mathcal{O}(nh_{min}^{-3})$ (see Lemma 2.3). Here, $n$ is the number of neurons and $h_{min}$ is the smallest distance between two neighboring breakpoints. This is completely different from the finite element method, in which the mass matrix is tri-diagonal and the condition number is $\mathcal{O}(1)$ for local hat basis functions on quasi-uniform meshes. Yet solving the linear systems efficiently is still possible; two representations of the mass matrix in terms of simpler matrices are presented in Section 2. Both methods make the inversion less computationally expensive.

The non-linear parameters for this broader class of problems present further challenges. Unlike in diffusion problems, the Hessian matrices for both diffusion-reaction and non-linear least squares problems are no longer diagonal and depend on the coefficient matrix. However, a factorization is used to compute the inverse of the Hessian efficiently, utilizing the explicit formula for the inverse of the coefficient matrix from [4]. Furthermore, for the cases in which the Hessian matrices are non-invertible, a damped block Gauss-Newton (dBGN) method is presented. The Gauss-Newton matrix is positive-definite, and its inverse is tri-diagonal. Whether using dBN or dBGN, the computational cost per iteration remains $\mathcal{O}(n)$ , as in [4]. Even faster convergence for the non-linear parameter is possible for diffusion-reaction problems when adding adaptive neuron enhancement (ANE) [10]. Numerical examples demonstrate the ability of the aforementioned methods to move the breakpoints quickly and efficiently and to outperform BFGS for select examples.

The paper is structured as follows: Section 2 introduces the notation for shallow neural networks and the corresponding mass matrix. The condition numbers for both neural network and finite element mass matrices are presented and compared. This is followed by a discussion of two ways in which to decompose the mass matrix in order to more efficiently invert it. Then the problems in which the mass matrix arises are presented in Section 3.1 and Section 3.2. The non-linear least-squares optimization problem using shallow neural neworks is presented in Section 3.1. Then in Section 3.2 the diffusion-reaction equation and the modified Ritz formulation are introduced. Next, the dBN method is reiterated in Section 4, emphasizing the modifications that need to be made to the dBN in [4] in order for it to work for the broader class of problems presented in this paper. For cases in which the Hessian for the non-linear parameter is non-invertible, the dBGN method is outlined. This is followed by Section 4.1, in which we recall the adaptivity scheme (AdBN) from [4], which can also be used for diffusion-reaction problems. Lastly, numerical results are presented in Section 5, demonstrating the performance of the aforementioned methods, as compared to BFGS, for select example problems. The examples in Section 5 highlight the ability of these methods to move mesh points to enhance the approximation. In particular, the results in Section 5.3 demonstrate the ability of dBN to solve the singularly perturbed reaction-diffusion equation.

2 Mass Matrix for Shallow Neural Network

This section studies the mass matrix resulting from a shallow ReLU neural network and computation of its inversion.

As in [4], the set of approximating functions generated by the shallow ReLU neural network with $n$ neurons is denoted by

\displaystyle{\cal M}_{n}(\Omega)

\displaystyle=

\displaystyle\left\{c_{0}+\sum_{i=1}^{n}c_{i}\sigma(x-b_{i})\,:\,c_{i}\in% \mathbb{R},\,0=b_{0}\leq b_{1}<\cdots<b_{n}<b_{n+1}=1\right\},

where $\Omega=(0,1)$ and $\sigma(t)=\max\{0,t\}$ is the ReLU activation function. Let $r(x)\in L^{\infty}(\Omega)$ be a real-valued function defined on $\Omega$ and bounded below by a positive constant $r_{0}>0$ almost everywhere.

Consider the following mass matrix associated with the weight function $r$ given by

(1)

M_{r}({\bf b})=\big{(}m_{ij}\big{)}_{n\times n}\quad\mbox{with }\,m_{ij}=\int_% {0}^{1}r(x)\sigma(x-b_{i})\sigma(x-b_{j})dx

and the coefficient matrix associated with $r$ given by

A_{r}({\bf b})=\big{(}a_{ij}\big{)}_{n\times n}\quad\mbox{with }\,a_{ij}=\int_% {0}^{1}r(x)H(x-b_{i})H(x-b_{j})dx

for $i,j=1,\ldots,n$ , where $H(t)=\sigma^{\prime}(t)$ is the Heaviside (unit) step function and ${\bf b}=\left(b_{1},\ldots,b_{n}\right)^{T}$ is the non-linear parameter.

While the coefficient matrix $A_{r}({\bf b})$ is dense, its inversion is a tri-diagonal matrix with an explicit algebraic formula (see [4]). This property holds for a class of matrices with a special structure.

Lemma 2.1.

For $\left\{\alpha_{i}\right\}_{i=1}^{k},\,\left\{\beta_{i}\right\}_{i=1}^{k}% \subset\mathbb{R}$ , assume that

\alpha_{1}\neq 0,\quad\beta_{k}\neq 0,\quad\mbox{and}\quad\alpha_{i+1}\beta_{i% }-\alpha_{i}\beta_{i+1}\neq 0

for all $i=1,\dots,k-1$ . Then the matrix

(2)

{\cal M}=\begin{pmatrix}\alpha_{1}\beta_{1}&\alpha_{1}\beta_{2}&\alpha_{1}% \beta_{3}&\ldots&\alpha_{1}\beta_{k}\\ \alpha_{1}\beta_{2}&\alpha_{2}\beta_{2}&\alpha_{2}\beta_{3}&\ldots&\alpha_{2}% \beta_{k}\\ \alpha_{1}\beta_{3}&\alpha_{2}\beta_{3}&\alpha_{3}\beta_{3}&\ldots&\alpha_{3}% \beta_{k}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ \alpha_{1}\beta_{k}&\alpha_{2}\beta_{k}&\alpha_{3}\beta_{k}&\ldots&\alpha_{k}% \beta_{k}\end{pmatrix}

is invertible. Moreover, its inverse is symmetric and tri-diagonal with non-zero entries given by

{\cal M}^{-1}_{ii}=\displaystyle\frac{\alpha_{i+1}\beta_{i-1}-\alpha_{i-1}% \beta_{i+1}}{(\alpha_{i}\beta_{i-1}-\alpha_{i-1}\beta_{i})(\alpha_{i+1}\beta_{% i}-\alpha_{i}\beta_{i+1})}\quad\mbox{and}\quad{\cal M}^{-1}_{i,i+1}={\cal M}^{% -1}_{i+1,i}=\displaystyle\frac{-1}{\alpha_{i+1}\beta_{i}-\alpha_{i}\beta_{i+1}},

where $\alpha_{0}=\beta_{k+1}=0$ and $\alpha_{k+1}=\beta_{0}=1$ .

Proof 2.2.

It is easy to verify that ${\cal M}{\cal M}^{-1}=I$ .

The coefficient matrix $A_{r}({\bf b})$ has the same structure as ${\cal M}$ with $k=n$ , $\alpha_{i}=1$ , and $\beta_{i}=\int_{b_{i}}^{1}r(x)dx$ . In the case of the mass matrix $M_{r}({\bf b})$ , it remains dense due to the global support of neurons, and its condition number is very large (see Section 2.1). This section derives inverse formulas of the mass matrix, whose application needs $\mathcal{O}(n)$ operations. Derivation is given both algebraically in Section 2.2 and geometrically in Section 2.3.

2.1 Condition Number

Let $h_{i}=b_{i+1}-b_{i}$ for $i=0,\ldots,n$ , and set

h_{\text{max}}=\max\limits_{1\leq i\leq n}h_{i}\quad\mbox{and}\quad h_{\text{% min}}=\min\limits_{1\leq i\leq n}h_{i}.

It was shown in [4] that the condition number of $A_{r}({\bf b})$ is bounded by $\mathcal{O}(nh_{min}^{-1})$ for $r(x)=1$ . The next lemma provides an upper bound for the condition number of the mass matrix.

Lemma 2.3.

Let $r(x)=1$ , then the condition number of the mass matrix $M_{r}({\bf b})$ is bounded by $\mathcal{O}\left(n/h_{min}^{3}\right)$ .

Proof 2.4.

For any vector $\mbox{\boldmath$\xi$}=(\xi_{1},\ldots,\xi_{n})^{T}\in\mathbb{R}^{n}$ , denote its magnitude by $\big{|}\mbox{\boldmath$\xi$}\big{|}=\left(\sum\limits_{i=1}^{n}\xi^{2}_{i}% \right)^{1/2}$ . By the Cauchy-Schwarz inequality and the fact that $\sigma(x-b_{j})=0$ for $x\leq b_{j}$ , we have

(3)

\displaystyle\begin{split}\mbox{\boldmath$\xi$}^{T}M_{r}({\bf b})\mbox{% \boldmath$\xi$}&=\int_{0}^{1}\left(\sum_{i=1}^{n}\xi_{i}\sigma(x-b_{i})\right)% ^{2}\,dx\leq|\mbox{\boldmath$\xi$}|^{2}\int_{0}^{1}\left(\sum_{i=1}^{n}\sigma(% x-b_{i})^{2}\right)\,dx\\ &=|\mbox{\boldmath$\xi$}|^{2}\sum_{j=1}^{n}\int_{b_{j}}^{b_{j+1}}\sum_{i=1}^{j% }\sigma(x-b_{i})^{2}\,dx=\frac{|\mbox{\boldmath$\xi$}|^{2}}{3}\sum_{j=1}^{n}% \sum_{i=1}^{j}\left\{(b_{j+1}-b_{i})^{3}-(b_{j}-b_{i})^{3}\right\}\\ &=\frac{|\mbox{\boldmath$\xi$}|^{2}}{3}\sum_{i=1}^{n}(b_{n+1}-b_{i})^{3}=\frac% {|\mbox{\boldmath$\xi$}|^{2}}{3}\sum_{i=1}^{n}(1-b_{i})^{3}\leq\frac{n}{3}\,|% \mbox{\boldmath$\xi$}|^{2}.\end{split}

To estimate the lower bound of $\mbox{\boldmath$\xi$}^{T}M_{r}({\bf b})\mbox{\boldmath$\xi$}$ , let

\tau_{i}(x)=\sum\limits_{j=1}^{i}\xi_{j}\sigma(x-b_{j})\quad\mbox{and}\quad a_% {i-1}=\tau_{i}(b_{i})=\sum\limits_{j=1}^{i}\xi_{j}(b_{i}-b_{j}),

for $i=1,\dots,n+1$ . Then $\tau_{i}\left(\frac{b_{i+1}+b_{i}}{2}\right)=\dfrac{a_{i-1}+a_{i}}{2}$ . Since $\tau_{i}^{2}(x)$ is a quadratic function in each sub-interval $[b_{i},b_{i+1}]$ , Simpson’s Rule implies

\displaystyle\begin{split}\mbox{\boldmath$\xi$}^{T}M_{r}({\bf b})\mbox{% \boldmath$\xi$}&=\sum_{i=1}^{n}\int_{b_{i}}^{b_{i+1}}\tau^{2}_{i}(x)\,dx=\frac% {1}{6}\sum_{i=1}^{n}h_{i}\left[\tau_{i}^{2}(b_{i})+4\tau_{i}^{2}\left(\frac{b_% {i+1}+b_{i}}{2}\right)+\tau_{i}^{2}(b_{i+1})\right]\\ &=\frac{1}{6}\sum_{i=1}^{n}h_{i}\left[a_{i-1}^{2}+(a_{i-1}+a_{i})^{2}+a_{i}^{2% }\right]\geq\frac{1}{6}h_{min}\lvert{\bf a}\rvert^{2},\end{split}

where ${\bf a}=(a_{1},\ldots,a_{n})^{T}$ . It is easy to see that $\mbox{\boldmath$\xi$}=Q{\bf a}$ , where $Q$ is a $n$ -order lower tri-diagonal matrix given by

Q=\begin{pmatrix}~{}\frac{1}{h_{1}}&0&0&\dots&0&0\\ -\left(\frac{1}{h_{1}}+\frac{1}{h_{2}}\right)&\frac{1}{h_{2}}&0&\dots&0&0\\[5.% 69054pt] \frac{1}{h_{2}}&-\left(\frac{1}{h_{2}}+\frac{1}{h_{3}}\right)&\frac{1}{h_{3}}&% \dots&0&0\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ 0&0&0&\ldots&\frac{1}{h_{n-1}}&0\\ 0&0&0&\ldots&-\left(\frac{1}{h_{n-1}}+\frac{1}{h_{n}}\right)&\frac{1}{h_{n}}\\ \end{pmatrix}.

It is easy to verify that $Q$ has spectral norm bounded by

\lVert Q\rVert_{2}\leq\sqrt{\lVert Q\rVert_{1}\lVert Q\rVert_{\infty}}\leq 4h_% {min}^{-1}.

Hence,

\mbox{\boldmath$\xi$}^{T}M_{r}({\bf b})\mbox{\boldmath$\xi$}\geq\frac{1}{6}h_{% min}\frac{\lvert{\mbox{\boldmath$\xi$}}\rvert^{2}}{\lVert Q\rVert_{2}^{2}}\geq% \frac{1}{96}h^{3}_{min}\lvert{\mbox{\boldmath$\xi$}}\rvert^{2},

which, together with the upper bound in Eq. 3, implies the validity of the lemma.

Lemma 2.5.

Under the assumption on the weight function $r$ , the condition number of the mass matrix $M_{r}({\bf b})$ is bounded by $\mathcal{O}\left(n\,r^{-1}_{0}h_{min}^{-3}\right)$ .

Proof 2.6.

Since $r\in L^{\infty}(I)$ and $r(x)\geq r_{0}$ almost everywhere, in a similar fashion as the proof of Lemma 2.3, we have

\dfrac{1}{6}r_{0}h_{min}^{3}\lvert{\mbox{\boldmath$\xi$}}\rvert^{2}\leq\mbox{% \boldmath$\xi$}^{T}M_{r}({\bf b})\mbox{\boldmath$\xi$}\leq C\,n\lvert{\mbox{% \boldmath$\xi$}}\rvert^{2},

which implies the validity of the lemma.

Whereas the mass matrix associated with the ReLU neural network is very ill-condiditoned, it is well known that the mass matrix for the finite element (FE) method is much better conditioned (see [8] for example). The following Lemma 2.7 reiterates the result in [8] but with an alternate proof in a similar fashion as that of Lemma 2.3.

Assume that $b_{0}<b_{1}$ , and set

\tilde{h}_{\text{max}}=\max\limits_{0\leq i\leq n}h_{i}\quad\mbox{and}\quad% \tilde{h}_{\text{min}}=\min\limits_{0\leq i\leq n}h_{i}.

For the partition $\{b_{i}\}_{i=0}^{n+1}$ , denote the hat basis functions for $i=1,\dots,n$ by

\varphi_{i}(x)=\left\{\begin{array}[]{ll}(x-b_{i-1})/h_{i},&x\in(b_{i-1},b_{i}% ),\\[5.69054pt] (b_{i+1}-x)/h_{i+1},&x\in(b_{i},b_{i+1}),\\[5.69054pt] 0,&\text{otherwise}.\end{array}\right.

Next let $\mbox{\boldmath${\varphi}$}=\left(\varphi_{1},\dots,\varphi_{n}\right)^{T}$ . Then the corresponding FE mass matrix for this partition is denoted by

\tilde{M}({\bf b})=\int_{0}^{1}\mbox{\boldmath${\varphi}$}\mbox{\boldmath${% \varphi}$}^{T}dx.

Lemma 2.7.

The condition number of the finite element mass matrix $\tilde{M}({\bf b})$ is bounded by
$\mathcal{O}(\tilde{h}_{max}/\tilde{h}_{min})$ .

Proof 2.8.

For any vector $\mbox{\boldmath$\xi$}=(\xi_{1},\ldots,\xi_{n})^{T}\in\mathbb{R}^{n}$ , in a similar fashion as that of Lemma 2.3, we get the equality

\displaystyle\begin{split}\mbox{\boldmath$\xi$}^{T}\tilde{M}({\bf b})\mbox{% \boldmath$\xi$}&=\sum_{j=0}^{n}\int_{b_{j}}^{b_{j+1}}\left(\xi_{j}\varphi_{j}+% \xi_{j+1}\varphi_{j+1}\right)^{2}\,dx=\sum_{j=0}^{n}\frac{h_{j}}{6}\left[\xi_{% j}^{2}+(\xi_{j}+\xi_{j+1})^{2}+\xi_{j+1}^{2}\right]\\ \end{split}

with $\varphi_{0}(x)=\varphi_{n+1}(x)=\xi_{0}=\xi_{n+1}=0$ , which leads to the inequalities

\frac{1}{6}\tilde{h}_{min}|\mbox{\boldmath$\xi$}|^{2}\leq\mbox{\boldmath$\xi$}% ^{T}\tilde{M}({\bf b})\mbox{\boldmath$\xi$}\leq\frac{2}{3}\tilde{h}_{max}|% \mbox{\boldmath$\xi$}|^{2}.

This completes the proof of the lemma.

2.2 Algebraic Approach

This section derives an inverse formula of the mass matrix through a decomposition into two matrices. The decomposition is based on the fact that matrices with the structure of ${\cal M}$ in Eq. 2 have tri-diagonal inverses.

For $1\leq i\leq j\leq n$ , let $m_{ij}$ be the $(i,j)$ -element of the mass matrix $M_{r}({\bf b})$ , then

	$\displaystyle m_{ij}$	$\displaystyle=$	$\displaystyle m_{ji}=\int_{0}^{1}r(x)\sigma(x-b_{i-1})\sigma(x-b_{j})dx=\int_{% b_{j}}^{1}r(x)(x-b_{i})(x-b_{j})dx$
		$\displaystyle=$	$\displaystyle\int_{b_{j}}^{1}r(x)\left(x-1\right)\left(x-b_{j}\right)dx+(1-b_{% i})\int_{b_{j}}^{1}r(x)\left(x-b_{j}\right)dx\equiv m^{1}_{ij}+m^{2}_{ij},$

which implies the following decomposition

M_{r}({\bf b})=M_{1}({\bf b})+M_{2}({\bf b})\equiv\left(m_{ij}^{1}\right)_{n% \times n}+\left(m_{ij}^{2}\right)_{n\times n}.

Both $M_{1}({\bf b})$ and $M_{2}({\bf b})$ have the same structure as ${\cal M}$ in Eq. 2 with

m_{ij}^{1}=\beta^{1}_{\max\{i,j\}}\quad\mbox{and}\quad m_{ij}^{2}=\alpha_{min% \{i,j\}}^{2}\beta^{2}_{max\{i,j\}},

where

\beta^{1}_{k}=\int_{b_{k}}^{1}r(x)\left(x-1\right)\left(x-b_{k}\right)dx,\quad% \alpha^{2}_{k}=1-b_{k},\quad\mbox{and}\quad\beta^{2}_{k}=\int_{b_{k}}^{1}r(x)% \left(x-b_{k}\right)dx.

Proposition 2.9.

The inverse of the mass matrix $M_{r}({\bf b})$ is given by

(4)

M_{r}({\bf b})^{-1}=M_{2}({\bf b})^{-1}(M_{2}({\bf b})^{-1}+M_{1}({\bf b})^{-1% })^{-1}M_{1}({\bf b})^{-1}.

Proof 2.10.

Eq. 4 is a direct consequence of the fact that

M_{r}({\bf b})=M_{1}({\bf b})\left(M_{2}({\bf b})^{-1}+M_{1}({\bf b})^{-1}% \right)M_{2}({\bf b}).

Remark 1.

Since $M_{1}({\bf b})^{-1}$ and $M_{2}({\bf b})^{-1}$ are tri-diagonal, so is $M_{1}({\bf b})^{-1}\!+\!M_{2}({\bf b})^{-1}$ . Hence, $M_{r}({\bf b})^{-1}$ in Eq. 4 applied to any vector can be computed in ${\cal O}(n)$ operations.

2.3 Geometric Approach

This section presents another way to invert the mass matrix, based on a factorization of $M_{r}({\bf b})$ into the product of three tri-diagonal matrices. The factorization arises from expressing the global ReLU basis functions in terms of local discontinuous basis functions.

To this end, for $k=0,\ldots,n$ , let $I_{k}=[b_{k},b_{k+1})$ and define the local basis functions

\displaystyle\varphi_{k}^{0}(x)=\left\{\begin{array}[]{cl}1,&x\in I_{k},\\ 0,&\text{otherwise}\end{array}\right.\quad\mbox{and}\quad\varphi_{k}^{1}(x)=% \left\{\begin{array}[]{cl}h_{k}^{-1}(x-b_{k}),&x\in I_{k},\\ 0,&\text{otherwise}\end{array}\right..

Since $\sum\limits_{i=0}^{n}\varphi_{k}^{0}(x)\equiv 1$ in $\Omega$ , we have

(5)

\mbox{span}\left\{1,\sigma(x-b_{1}),\ldots,\sigma(x-b_{n})\right\}\subset\mbox% {span}\left\{\varphi^{0}_{k}(x)\right\}_{k=1}^{n}\bigcup\mbox{span}\left\{% \varphi^{1}_{k}(x)\right\}_{k=1}^{n}.

Set

(6)

\mbox{\boldmath${\psi}$}(x)=(\psi_{1}(x),\ldots,\psi_{n}(x))^{T}\quad\mbox{and% }\quad\mbox{\boldmath${\varphi}$}_{i}(x)=(\varphi_{1}^{i}(x),\ldots,\varphi^{i% }_{n}(x))^{T},

where $\psi_{k}(x)=\sigma(x-b_{k})$ ; and let $D(\mathbf{h})=\mbox{diag}(h_{1},\ldots,h_{n})$ ,

\displaystyle G=\left(\begin{array}[]{cccc}~{}1&&&\\ \!-1&~{}1&&\\ &\ddots&\ddots&\\ &&\!-1&~{}1\end{array}\right)_{\!\!n\times n},\quad\mbox{and}\quad G^{-1}=% \left(\begin{array}[]{cccc}1&&&\\ 1&~{}1&&\\ \vdots&\vdots&\ddots&\\ 1&1&\ldots&~{}1\end{array}\right)_{\!\!n\times n}.

Lemma 2.11.

There exist map**s $B_{0}:\mathbb{R}^{n}\to\mathbb{R}^{n}$ and $B_{1}:\mathbb{R}^{n}\to\mathbb{R}^{n}$ such that

(7)

\mbox{\boldmath${\psi}$}=B_{0}\mbox{\boldmath${\varphi}$}_{0}+B_{1}\mbox{% \boldmath${\varphi}$}_{1}.

Moreover, we have

B_{0}=G^{-T}D(\mathbf{h})\left(G^{-T}-I\right)\quad\mbox{and}\quad B_{1}=G^{-T% }D(\mathbf{h}),

where $I$ is the $n$ -order identity matrix.

Proof 2.12.

Eq. 5 implies that there exist $B_{0}$ and $B_{1}$ such that Eq. 7 is valid. To determine $B_{0}$ and $B_{1}$ , for any ${\bf c}=(c_{1},\ldots,c_{n})^{T}\in\mathbb{R}^{n}$ , let $v(x)={\bf c}^{T}\mbox{\boldmath${\psi}$}(x)$ , then

v(x)={\bf c}^{T}\mbox{\boldmath${\psi}$}(x)={\bf c}^{T}B_{0}\mbox{\boldmath${% \varphi}$}_{0}(x)+{\bf c}^{T}B_{1}\mbox{\boldmath${\varphi}$}_{1}(x).

On each $I_{k}$ , using the facts that $v^{\prime}(x)$ and ${\bf c}^{T}B_{0}\mbox{\boldmath${\varphi}$}_{0}(x)$ are constants, we have

\left({\bf c}^{T}B_{1}\right)_{k}=v(b_{k+1})-v(b_{k})=\sum_{i=1}^{k}c_{i}\left% (\sum_{j=i}^{k}h_{j}\right)-\sum_{i=1}^{k-1}c_{i}\left(\sum_{j=i}^{k-1}h_{j}% \right)=\sum_{i=1}^{k}c_{i}h_{k}=\left(D(\mathbf{h})G^{-1}\mathbf{c}\right)_{k}

which, together with arbitrariness of ${\bf c}$ , implies that $B_{1}=G^{-T}D(\mathbf{h})$ .

By the definitions of $\mbox{\boldmath${\varphi}$}_{0}(x)$ and $\mbox{\boldmath${\varphi}$}_{1}(x)$ and the fact that $v(b_{1})=0$ , we have

	$\displaystyle\left({\bf c}^{T}B_{0}\right)_{k}$	$\displaystyle=v(b_{k})=\left(v(b_{k})-v(b_{k-1})\right)+\left(v(b_{k-1})-v(b_{% k-2})\right)+\cdots+\left(v(b_{2})-v(b_{1})\right)$
		$\displaystyle=\left(D(\mathbf{h})G^{-1}\mathbf{c}\right)_{k-1}+\left(D(\mathbf% {h})G^{-1}\mathbf{c}\right)_{k-2}+\cdots+\left(D(\mathbf{h})G^{-1}\mathbf{c}% \right)_{1}=\left((G^{-1}-I)D(\mathbf{h})G^{-1}{\bf c}\right)_{k},$

which, together with arbitrariness of ${\bf c}$ , implies that $B_{0}=G^{-T}D(\mathbf{h})\left(G^{-T}-I\right)$ . This completes the proof of the lemma.

For $i,j=0,1$ , let

D_{ij}(r)=\displaystyle\int_{0}^{1}r(x)\mbox{\boldmath${\varphi}$}_{i}\mbox{% \boldmath${\varphi}$}_{j}^{T}dx.

For $k=0,1,2$ , let

(8)

D_{r}({\bf s}^{k})=\text{diag}(s_{0}^{k}(r),\dots,s_{n}^{k}(r))\quad\mbox{with% }\quad s_{i}^{k}(r)=\int_{b_{i}}^{b_{i+1}}r(x)(x-b_{i})^{k}\,dx.

Then, together with $D(\mathbf{h})=\mbox{diag}(h_{1},\ldots,h_{n})$ , it is easy to see that

D_{00}(r)=D_{r}({\bf s}^{0}),\quad D_{01}(r)=D_{10}(r)=D(\mathbf{h})^{-1}D_{r}% ({\bf s}^{1}),\quad\mbox{ and }\,\,D_{11}(r)=D(\mathbf{h})^{-2}D_{r}({\bf s}^{% 2}).

Theorem 2.

Let $Q=GD({\bf h})^{-1}G$ and let

T_{M_{r}}=(I-G^{T})D_{00}(r)(I-G)+(I-G^{T})D_{01}(r)G+G^{T}D_{10}(r)(I-G)+G^{T% }D_{11}(r)G,

then the mass matrix $M_{r}({\bf b})$ defined in Eq. 1 has the following factorization

(9)

M_{r}({\bf b})=Q^{-T}T_{M_{r}}Q^{-1}.

Proof 2.13.

By Eq. 7 and the fact that $B_{0}=B_{1}(G^{-T}-I)$ , we have

	$\displaystyle M_{r}({\bf b})$	$\displaystyle=\int_{0}^{1}r(x)\mbox{\boldmath${\psi}$}\mbox{\boldmath${\psi}$}% ^{T}\,dx=B_{0}D_{00}B_{0}^{T}+B_{0}D_{01}B_{1}^{T}+B_{1}D_{10}B_{0}^{T}+B_{1}D% _{11}B_{1}^{T}$
		$\displaystyle=B_{1}\left\{(G^{-T}-I)D_{00}(G^{-1}-I)+(G^{-T}-I)D_{01}+D_{10}(G% ^{-1}-I)+D_{11}\right\}B_{1}^{T}$
		$\displaystyle=B_{1}G^{-T}\left\{(I-G^{T})D_{00}(I-G)+(I-G^{T})D_{01}G+G^{T}D_{% 10}(I-G)+G^{T}D_{11}G\right\}G^{-1}B_{1}^{T},$

which, together with the fact that $Q^{-1}=G^{-1}B_{1}^{T}$ , implies Eq. 9.

Remark 3.

Clearly, $T_{M_{r}}$ is tri-diagonal. Hence, $T_{M_{r}}^{-1}$ and hence $M_{r}({\bf b})^{-1}$ applied to any vector can be computed in ${\cal O}(n)$ operations.

Remark 4.

The transformation in Eq. 7 leads to a similar factorization of the coefficient matrix as

(10)

A_{a}({\bf b})=Q^{-T}T_{A_{a}}Q^{-1}

with $\displaystyle T_{A_{a}}=G^{T}D(\mathbf{h})^{-2}D_{a}({\bf s}^{0})G$ being tri-diagonal, where $D_{a}({\bf s}^{0})$ is defined similarly as in Eq. 8.

3 Applications

This section considers two applications: the least-squares data fitting and the diffusion-reaction equation in one dimension. When using the shallow ReLU neural network, the resulting discretization requires inversion of the corresponding mass matrix.

3.1 Least-Squares Approximation

The first problem type in which the mass matrix arises is least-squares data fitting. Given a function $f(x)\in L^{2}(\Omega)$ , the best least-squares approximation to $f$ in ${\cal M}_{n}(\Omega)$ is to find $u_{n}\in{\cal M}_{n}(\Omega)$ and $u_{n}(0)=f(0)$ such that

(11)

J(u_{n})=\min_{v\in{\cal M}_{n}(\Omega)\bigcap\{v(0)=f(0)\}}J(v),

where $J(v)$ is the weighted continuous least-squares loss functional given by

\displaystyle J(v)=\dfrac{1}{2}\int_{0}^{1}r(x)\left((v(x)-f(x)\right)^{2}dx.

Let $u_{n}(x)\in{\cal M}_{n}(\Omega)$ be a solution of Eq. 11 having the form of

u_{n}(x)=f(0)+\sum_{i=1}^{n}c_{i}\sigma(x-b_{i}).

Clearly, the optimality condition on the linear parameter ${\bf c}=\left(c_{1},\ldots,c_{n}\right)^{T}$ gives

(12)

M_{r}({\bf b})\,{\bf c}={\bf f}({\bf b}),

where $M_{r}({\bf b})$ is the mass matrix defined in Eq. 1 and ${\bf f}({\bf b})$ is given by

{\bf f}({\bf b})=\int_{0}^{1}r(x)\left(f(x)-f(0)\right)\mbox{\boldmath${\psi}$% }(x)dx,

where $\mbox{\boldmath${\psi}$}(x)$ is defined in Eq. 6.

Let $D({\bf c})=\text{diag}(c_{1},\dots,c_{n})$ be a diagonal matrix with the linear parameter, then the optimality condition on the non-linear parameter leads to

(13)

{\bf 0}=\nabla_{{\bf b}}J\left(u_{n}\right)=-D({\bf c})\left(\int_{b_{1}}^{1}r% (u_{n}-f)dx,\ldots,\int_{b_{n}}^{1}r(u_{n}-f)dx\right)^{T}.

Eq. 13 is a system of non-linear algebraic equations and will be solved by Newton’s method. Let $w_{i}=r(b_{i})\bigl{(}u_{n}(b_{i})-f(b_{i})\bigr{)}$ for $i=1,\dots,n$ . In one dimension, Lemma 4.1 in [3] implies that the corresponding Hessian matrix is of the form

(14)

\nabla_{{\bf b}}^{2}J(u_{n})\equiv{\bf H}({\bf c},{\bf b})=D({\bf w})D({\bf c}% )+D({\bf c})A_{r}({\bf b})D({\bf c}),

where $D({\bf w})$ is a diagonal matrix given by

	$\displaystyle D({\bf w})$	$\displaystyle\!=\!$	$\displaystyle\int_{0}^{1}r(u_{n}-f)\text{diag}(\delta(x-b_{1}),\ldots,\delta(x% -b_{n}))dx$
		$\displaystyle\!=\!$	$\displaystyle\text{diag}(w_{1},\dots,w_{n})$

3.2 Diffusion-Reaction Problem

The second application that we consider is the following diffusion-reaction equation in one dimension:

(15)

\left\{\begin{array}[]{lr}-(a(x)u^{\prime}(x))^{\prime}+r(x)u(x)=f(x),&\mbox{% in }\,\Omega=(0,1),\\[5.69054pt] u(0)=\alpha,\quad u(1)=\beta&\end{array}\right.

where the diffusion coefficient $a(x)$ , the reaction coefficient $r(x)$ , and $f(x)$ are given real-valued functions defined on $\Omega$ . Assume that $a(x)\in L^{\infty}(\Omega)$ and $r(x)\in L^{\infty}(\Omega)$ are bounded below by the respective positive constant $a_{0}>0$ and non-negative constant $r_{0}\geq 0$ almost everywhere on $\Omega$ .

As in [4], the modified Ritz formulation of problem (15) is to find $u\in H^{1}(\Omega)\bigcap\{u(0)=\alpha\}$ such that

(16)

J(u)=\min_{v\in H^{1}(\Omega)\cap\{v(0)=\alpha\}}J(v),

where the modified energy functional is given by

(17)

J(v)=\frac{1}{2}\int_{0}^{1}a(x)(v^{\prime}(x))^{2}dx+\frac{1}{2}\int_{0}^{1}r% (x)(v(x))^{2}dx-\int_{0}^{1}f(x)v(x)dx+\frac{\gamma}{2}(v(1)-\beta)^{2}.

Here, $\gamma>0$ is a penalization constant. Then the Ritz neural network approximation is to find $u_{n}\in{\cal M}_{n}(\Omega)\cap\{u_{n}(0)=\alpha\}$ such that

(18)

J(u_{n})=\min_{v\in{\cal M}_{n}(\Omega)\cap\{v(0)=\alpha\}}J(v).

The corresponding bilinear form of the modified enery functional is given by

a(u,v):=\int_{0}^{1}a(x)u^{\prime}(x)v^{\prime}(x)dx+\int_{0}^{1}r(x)u(x)v(x)% dx+\gamma u(1)v(1)

for any $u,\,v\in H^{1}(\Omega)$ . Denote by $\|\cdot\|_{a}$ the induced norm of the bilinear form.

Proposition 3.1.

Let $u$ and $u_{n}$ be the solutions of problems Eq. 16 and Eq. 18, respectively. Then

(19)

\|u-u_{n}\|_{a}\leq\sqrt{3}\inf_{v\in{\cal M}_{n}(\Omega)\cap\{v(0)=\alpha\}}% \|u-v\|_{a}+\sqrt{2}\,\big{|}a(1)u^{\prime}(1)\big{|}\,\gamma^{-1/2}.

Moreover, if ${\cal M}_{n}(\Omega)$ has the following approximation property

(20)

\inf_{v\in{\cal M}_{n}(\Omega)}\|u-v\|_{H^{1}(\Omega)}\leq C(u)\,n^{-1},

then there exists a constant $C$ depending on $u$ such that

(21)

\|u-u_{n}\|_{a}\leq C\left(n^{-1}+\gamma^{-1/2}\right).

Proof 3.2.

Eq. 19 may be proved in a similar fashion as that of Lemma 2.1 in [4], and Eq. 21 is a direct consequence of Eq. 19 and Eq. 20.

3.2.1 System of Algebraic Equations

Let $u_{n}(x)$ be the solution of problem Eq. 18, then the linear parameter ${\bf c}=\left(c_{1},\ldots,c_{n}\right)^{T}$ and non-linear parameter ${\bf b}=\left(b_{1},\ldots,b_{n}\right)^{T}$ satisfy the following optimality conditions

(22)

\nabla_{{\bf c}}J\left(u_{n}\right)={\bf 0}\quad\mbox{and}\quad\nabla_{{\bf b}% }J\left(u_{n}\right)={\bf 0},

where $\nabla_{{\bf c}}$ and $\nabla_{{\bf b}}$ denote the gradients with respect to ${\bf c}$ and ${\bf b}$ , respectively.

Denote the right-hand side vector by

{\bf f}({\bf b})=\int_{0}^{1}\left(f(x)-\alpha\right)\nabla_{{\bf c}}u_{n}(x)dx,

and let ${\bf d}=\nabla_{{\bf c}}u_{n}(1)$ . By the same derivation in [4], the first equation in Eq. 22 becomes

(23)

\left(A_{a}({\bf b})+M_{r}({\bf b})+\gamma{\bf d}{\bf d}^{T}\right){\bf c}={% \bf f}({\bf b})+\gamma(\beta-\alpha){\bf d}.

Comparing to (3.2) in [4], the additional term $M_{r}({\bf b}){\bf c}$ in Eq. 23 is resulted from the reaction term.

For $j=1,\ldots,n$ , let

g_{j}=r(b_{j})u_{n}(b_{j})-f(b_{j})-a^{\prime}(b_{j})\left(\sum_{k=1}^{j-1}c_{% k}+\frac{c_{j}}{2}\right).

Let $D({\bf g})=\text{\em diag}(g_{1},\dots,g_{n})$ be the diagonal matrix with the $i$ -th diagonal elements $g(b_{i})$ .

Lemma 3.3.

The Hessian matrix $\nabla^{2}_{{\bf b}}J(u_{n})$ has the form

(24)

\mathbf{H}({\bf c},{\bf b})=D({\bf g})D({\bf c})+D({\bf c})A_{r}({\bf b})D({% \bf c})+\gamma{\bf c}{\bf c}^{T}.

Proof 3.4.

Eq. 24 can be derived in a similar fashion to Lemma 3.2 in [4]. The only difference here is the additional reaction term in Eq. 17. For that term, the computations shown in Lemma 4.1 from [3] can be used to obtain the second-order derivatives with respect to ${\bf b}$ .

4 Damped Block Newton and Gauss-Newton Methods

Optimality conditions of the minimization problems in Eq. 11 and Eq. 18 lead to systems of non-linear algebraic equations of the form

(25)

{\cal A}({\bf b})\,{\bf c}={\cal F}({\bf b})\quad\mbox{and}\quad\nabla_{{\bf b% }}J(u_{n})={\bf 0},

for the linear and non-linear parameters, respectively, where the first equation is given in Eq. 12 for the least-squares (LS) approximation and in Eq. 23 for the diffusion-reaction (DR) equation with

{\cal A}({\bf b})=\left\{\begin{array}[]{ll}M_{r}({\bf b}),&\mbox{LS},\\[5.690% 54pt] A_{a}({\bf b})+M_{r}({\bf b})+\gamma{\bf d}{\bf d}^{T},&\mbox{DR}.\end{array}\right.

The respective Hessian matrix ${\bf H}({\bf c},{\bf b})=\nabla^{2}_{{\bf b}}J(u_{n})$ is given in Eq. 14 and Eq. 24 with

(26)

{\bf H}({\bf c},{\bf b})=\left\{\begin{array}[]{ll}D({\bf w})D({\bf c})+D({\bf c% })A_{r}({\bf b})D({\bf c}),&\mbox{LS},\\[5.69054pt] D({\bf g})D({\bf c})+D({\bf c})A_{r}({\bf b})D({\bf c})+\gamma{\bf c}{\bf c}^{% T},&\mbox{DR}.\end{array}\right.

In a similar fashion as in [3], the Gauss-Newton matrix is given by

(27)

{\bf H}_{GN}({\bf c},{\bf b})=\left\{\begin{array}[]{ll}D({\bf c})A_{r}({\bf b% })D({\bf c}),&\mbox{LS},\\[5.69054pt] D({\bf c})A_{r}({\bf b})D({\bf c})+\gamma{\bf c}{\bf c}^{T},&\mbox{DR}.\end{% array}\right.

In the case that ${\bf H}({\bf c},{\bf b})$ in Eq. 26 is invertible, the non-linear system in Eq. 25 can be solved by the damped block Newton (dBN) method described in Algorithm 4.1 of [4]. The method employs the block Gauss-Seidel method as an outer iteration for the linear and non-linear parameters. Per each outer iteration, the linear and the non-linear parameters are updated by exact inversion and one step of a damped Newton method, respectively.

To efficiently invert ${\cal A}({\bf b})$ , we use the factorizations of $M_{r}({\bf b})$ and $A_{a}({\bf b})$ given in Eq. 9 and Eq. 10, respectively. That is,

(28)

M_{r}({\bf b})^{-1}=Q\,T_{M_{r}}^{-1}\,Q^{T}\quad\mbox{and}\quad\left(A_{a}(% \textbf{b})+M_{r}({\bf b})\right)^{-1}=Q\,(T_{A_{a}}+T_{M_{r}})^{-1}\,Q^{T}.

Since $T_{M_{r}}$ and $T_{A_{a}}+T_{M_{r}}$ are tri-diagonal, action of their inversions applied to any vector can be computed in ${\cal O}(n)$ operations, so is the action of ${\cal A}({\bf b})^{-1}$ . For the diffusion-reaction problem, the Sherman-Morrison formula is needed for a rank-one update.

In the case that ${\bf H}({\bf c},{\bf b})$ in Eq. 26 is singular, the non-linear system in Eq. 25 can be solved by the structure-guided Gauss-Newton (SgGN) method described in Algorithm 4.1 of [3]. This is because the layer Gauss-Newton matrix $A_{r}({\bf b})$ is always symmetric positive-definite and its inverse is tri-diagonal (see [4]). The SgGN is essentially the damped block Gauss-Newton (dBGN) method, that replaces ${\bf H}({\bf c},{\bf b})^{-1}$ in the dBN method by ${\bf H}_{GN}({\bf c},{\bf b})^{-1}$ in the dBGN method.

Lemma 4.1.

Assume that $c_{i}\neq 0$ for all $i=1,\dots,n$ . Then $D({\bf s})D({\bf c})+D({\bf c})A_{r}({\bf b})D({\bf c})$ is invertible if and only if $I+D({\bf s})A_{r}({\bf b})^{-1}D({\bf c})^{-1}$ is invertible. Moreover, we have

(29)

\left(D({\bf s})D({\bf c})+D({\bf c})A_{r}({\bf b})D({\bf c})\right)^{-1}=% \left(I+D({\bf c})^{-1}A_{r}({\bf b})^{-1}D({\bf s})\right)^{-1}D({\bf c})^{-1% }A_{r}({\bf b})^{-1}D({\bf c})^{-1}.

Proof 4.2.

Under the assumption, Eq. 29 follows that

D({\bf s})D({\bf c})+D({\bf c})A_{r}({\bf b})D({\bf c})=\left(I+D({\bf s})A_{r% }({\bf b})^{-1}D({\bf c})^{-1}\right)D({\bf c})A_{r}({\bf b})D({\bf c}),

which proves the lemma.

Lemma 4.1, together with the fact that $I+D({\bf s})A_{r}({\bf b})^{-1}D({\bf c})^{-1}$ is tri-diagonal and the Sherman-Morrison formula, implies that action of ${\bf H}({\bf c},{\bf b})^{-1}$ applied to any vector can be computed in ${\cal O}(n)$ operations.

4.1 An Adaptivity Scheme

For a fixed number of neurons, the dBN method for the diffusion-reaction equation moves the initial uniformly distributed breakpoints very efficiently to nearly optimal locations as shown in Section 5. However, it was shown in [4] that introducing adaptivity results in a more optimal convergence rate.

In fact, the adaptive neuron enhancement (ANE) method [9, 10] was employed in [4]. The ANE method starts with a relatively small neural network and adaptively adds new neurons based on the previous approximation. Moreover, the newly added neurons are initialized at where the previous approximation is not accurate. At each adaptive step, we use the dBN method to numerically solve the minimization problem in Eq. 18. Section 5 in [4] describes how to introduce adaptivity, and Algorithm 5.1 in [4] describes the adaptive block Newton (AdBN) method.

Here, the only modification is the local indicator. Letting $\mathcal{K}=[c,d]\subseteq[0,1]$ be a subinterval, a modified local indicator of the ZZ type on $\mathcal{K}$ (see, e.g., [5]) is defined by

\xi_{\mathcal{K}}^{2}=\lVert a^{-1/2}\left(G(au_{n}^{\prime})-au_{n}^{\prime}% \right)\rVert_{L^{2}(\mathcal{K})}^{2}+(d-c)^{2}\lVert-G^{\prime}(a^{2}u_{n}^{% \prime})+u_{n}-f\rVert_{L^{2}(\mathcal{K})}^{2},

where $G(v)$ is the projection of $v$ onto the space of the continuous piecewise linear functions.

5 Numerical Experiments

This section first presents numerical results of the dBN and dBGN methods for solving Eq. 11. Afterwards, results of the dBN, dBGN and AdBN methods for solving Eq. 15 are shown in Section 5.2 and Section 5.3. For diffusion-reaction problems, the penalization parameter $\gamma$ was set to $10^{4}$ . For the AdBN method, a refinement occurred when the difference of the total estimators for two consecutive iterates was less than $10^{-7}$ .

For each test problem of the diffusion-reaction equation, let $u$ and $u_{n}$ be the exact solution and its approximation in $\mathcal{M}_{n}(\Omega)$ , respectively. Denote the relative error by

e_{n}=\frac{|u-u_{n}|_{H^{1}(\Omega)}}{|u|_{H^{1}(\Omega)}}.

5.1 Least-Squares Problem

The first test problem is the function

(30)

u(x)=\sqrt{x}.

as the target function for problem Eq. 11, with $r(x)=1$ . We aim to test the performance of dBN and dBGN for least-squares data fitting problems. LABEL:example3BFGSdBN presents a comparison between dBN, dBGN and BFGS. In this comparison, we utilized a Python BFGS implementation from ‘scipy.optimize’. The initial network parameters for the three algorithms were set to be the uniform mesh for ${\bf b}^{(0)}$ and ${\bf c}^{(0)}$ given by solving Eq. 12. Recall that the computational cost per iteration of dBN and dBGN is ${\cal O}(n)$ , while each iteration of BFGS has a cost of ${\cal O}(n^{2})$ . In this example our solvers outperform BFGS, achieving smaller losses in fewer and cheaper iterations.

LABEL:example2DF (a) illustrates the neural network approximation of the function in Eq. 30, obtained using uniform breakpoints and determining the linear parameter through the solution of Eq. 12. Clearly, it is more optimal to concentrate more mesh points on the left side, where the curve is steeper. The dBN method is capable of making this adjustment, as illustrated in LABEL:example2DF (b). The loss functions confirm that the approximation improves substantially when the breakpoints are allocated according to the steepness of the function.

5.2 Exponential Solution

The second test problem involves the function

(31)

u(x)=x\left(\exp\left(-\frac{{(x-\frac{1}{3})^{2}}}{{0.01}}\right)-\exp\left(-% \frac{{4}}{{9\times 0.01}}\right)\right),

serving as a solution of Eq. 15 for $a(x)=r(x)=1$ and $\alpha=\beta=0$ .

Similarly to LABEL:example3BFGSdBN, we start by comparing our two solvers with BFGS. The initial network parameters for all algorithms were set to be the uniform mesh for ${\bf b}^{(0)}$ , with ${\bf c}^{(0)}$ given by the exact solution of equation Eq. 23. We observe in LABEL:example1BFGSdB that in about 25 iterations, both dBN and dBGN achieve an accuracy that BFGS cannot attain.

LABEL:ex1Figure (a) shows the initial neural network approximation of the function in Eq. 31, obtained by using uniform breakpoints and determining the linear parameter through the solution of Eq. 23. The approximation generated by dBN is shown in LABEL:ex1Figure (b), while LABEL:ex1Figure (c) illustrates the approximation obtained by employing dBN with adaptivity. Notably, in both cases, the breakpoints are moved, and the approximation enhances the initial approximation.

Theoretically, from Eq. 21, $\frac{1}{n}$ is the order of convergence of approximating a solution Eq. 31 by functions in ${\cal M}_{n}$ . However, since Eq. 18 is a non-convex optimization problem, the existence of local minimums makes it challenging to achieve this order. Therefore, given the neural network approximation $u_{n}$ to $u$ provided by the dBN method, assume that

e_{n}=\left(\frac{1}{n}\right)^{r},

for some $r>0$ . As in [4], we can use the AdBN method to improve the order of convergence of the dBN method (achieve an $r$ closer to 1).

Table 1 illustrates adaptive dBN (AdBN) starting with 20 neurons, refining 8 times, and reaching a final count of 194 neurons. The stop** tolerance was set to $\epsilon=0.05$ . The recorded data in Table 1 includes the relative seminorm error and the relative error estimator for each iteration of the adaptive process. Additionally, Table 1 provides the results for dBN with fixed 144 and 194 neurons. Comparing these results to the adaptive run with the same number of neurons, we observe a significant improvement in rate, error estimator, and seminorm error within the adaptive run.

NN ( $n$ neurons)	$e_{n}$	$\xi_{n}$	$r$
Adaptive (20)	$1.01\times 10^{-1}$	0.545	0.766
Adaptive (27)	$6.40\times 10^{-2}$	0.342	0.834
Adaptive (32)	$5.64\times 10^{-2}$	0.259	0.830
Adaptive (46)	$3.81\times 10^{-2}$	0.193	0.854
Adaptive (52)	$3.47\times 10^{-2}$	0.146	0.851
Adaptive (71)	$2.64\times 10^{-2}$	0.107	0.853
Adaptive (99)	$1.90\times 10^{-2}$	0.079	0.862
Adaptive (144)	$1.31\times 10^{-2}$	0.052	0.872
Adaptive (194)	$8.83\times 10^{-3}$	0.037	0.898
Fixed (144)	$1.90\times 10^{-2}$	0.075	0.798
Fixed (194)	$1.50\times 10^{-2}$	0.057	0.797

Table 1: Comparison of an adaptive network with fixed networks for relative error

e_{n}

, relative error estimators

\xi_{n}

, and powers

r

5.3 Singularly Perturbed Reaction-Diffusion Equation

The third test problem is a singularly perturbed reaction-diffusion equation:

(32)

\left\{\begin{array}[]{lr}-\varepsilon^{2}u^{\prime\prime}(x)+u(x)=f(x),&x\in% \Omega=(-1,1),\\[5.69054pt] u(-1)=u(1)=0.\end{array}\right.

For $f(x)=-2\left(\varepsilon-4x^{2}\tanh{\left(\frac{1}{\varepsilon}(x^{2}-\frac{1% }{4})\right)}\right)\left(1/\cosh{\left(\frac{1}{\varepsilon}(x^{2}-\frac{1}{4% })\right)}\right)^{2}+\tanh{\left(\frac{1}{\varepsilon}(x^{2}-\frac{1}{4})% \right)}-\tanh{\left(\frac{3}{4\varepsilon}\right)}$ , problem Eq. 32 has the following exact solution

(33)

u(x)=\tanh{\left(\frac{1}{\varepsilon}(x^{2}-\frac{1}{4})\right)}-\tanh{\left(% \frac{3}{4\varepsilon}\right)}.

For some $\nu=\varepsilon^{2}$ , these problems exhibit interior layers that make them challenging for mesh-based methods such as finite element and finite difference, leading to overshooting and oscillations. For $\nu=10^{-4}$ , LABEL:example2DR illustrates the neural network approximation of the function described in Eq. 33, using uniform breakpoints (a) and employing dBN to adjust the breakpoints (b). An interesting observation is that the resulting approximation from dBN does not exhibit overshooting or oscillations. This confirms that dBN is capable of successfully adjusting the breakpoints and may have the potential to accurately approximate solutions with boundary and/or interior layers.

It is worth mentioning that the relative $L^{2}$ -norm error of the approximation depicted in LABEL:example2DR (b) is $8.85\times 10^{-4}$ . In [2], similar errors were obtained using deep neural networks with $2962$ parameters. In our case, the number of parameters is only $65$ .

The resulting relative errors obtained after using dBN for various values of $\nu$ are shown in Table 2. For each value of $\nu$ , dBN considerably improves the initial approximation, and the error does not vary significantly with different values of $\nu$ .

$\nu$	$e_{n}$ (initial)	$e_{n}$ (dBN)
$10^{-2}$	$1.63\times 10^{-1}$	$6.72\times 10^{-2}$
$10^{-3}$	$5.53\times 10^{-1}$	$8.08\times 10^{-2}$
$10^{-4}$	$8.89\times 10^{-1}$	$7.65\times 10^{-2}$
$10^{-5}$	$9.69\times 10^{-1}$	$8.58\times 10^{-2}$
$10^{-6}$	$9.90\times 10^{-1}$	$8.09\times 10^{-2}$

Table 2: Relative errors

e_{n}

obtained by using ReLU networks to approximate the function in Eq. 33 for different

\nu=\varepsilon^{2}

. Initial: NN model with 32 uniform breakpoints. dBN: optimized NN model with 32 breakpoints after 200 iterations

We also present the results of using adaptive mesh refinement. LABEL:example22DR shows the neural network approximation obtained by starting with 12 uniform breakpoints. Refinements are performed using the average marking strategy (see equation (5.2) in [4]) to achieve a similar error as the approximation in LABEL:example2DR (b). After each refinement, the linear parameter was computed by solving equation Eq. 23. In LABEL:example22DR (a), the breakpoints were not moved, whereas LABEL:example22DR (b) illustrates the AdBN method where the breakpoints were moved after each refinement.

6 Discussion and Conclusion

The corresponding mass matrix $M_{r}({\bf b})$ using the shallow ReLU neural network arises in applications such as diffusion-reaction equation, least-squares data fitting, etc. Unlike the finite element mass matrix, the NN mass matrix is dense and very ill-conditioned (see Lemma 2.3). These features hinder efficiency of commonly used numerical methods for solving the resulting system of linear equations.

This difficulty is overcome in one dimension through an especial factorization of the mass matrix, which was done using both algebraic and geometrical approaches. This factorization enables the ${\cal O}(n)$ computational cost for the inversion of the mass matrix. Combining this with the fact that the inversion of the coefficient matrix $A_{r}({\bf b})$ is tri-diagonal, the resulting damped block Newton (dBN) method is implemented with a computational cost of just ${\cal O}(n)$ per iteration, granted that the corresponding Hessian matrix is invertible. The quadratic form of the objective functions for certain problems allows the construction of damped block Gauss-Newton (dBGN) methods, which benefit from having symmetric positive-definite Gauss-Newton matrices. For diffusion-reaction problems in particular, the addition of adaptive network enhancement (ANE) improves the rate of convergence.

Overall, the numerical results demonstrate the efficiency of the various methods in terms of not only the number of iterations but also the cost per iteration, making a compelling case to pursue the construction of similar solvers for higher dimensional problems. Of particular interest is the application of dBN methods to the singularly perturbed reaction-diffusion problem. For a fixed number of mesh points $n$ , dBN appears to achieve an accuracy independent of the diffusion coefficient $\varepsilon^{2}$ . Furthermore, when adding in adaptivity, AdBN seems to be comparable to FE methods using mesh refinement.

References

[1] J. Berg and K. Nyström. A unified deep artificial neural network approach to partial differential equations in complex geometries. Neurocomputing, 317:28–41, 2018.
[2] Z. Cai, J. Chen, M. Liu, and Xinyu Liu. Deep least-squares methods: An unsupervised learning-based numerical method for solving elliptic PDEs. Journal of Computational Physics, 420:109707, 2020.
[3] Z. Cai, T. Ding, M. Liu, X. Liu, and J. Xia. A structure-guided gauss-newton method for shallow ReLU neural network. arXiv:2404.05064v1 [cs.LG], 2024.
[4] Z. Cai, A. Doktorova, R. D. Falgout, and C. Herrera. Fast iterative solver for neural network method: I. 1d diffusion problems. arXiv:2404.17750 [math.NA], 2024.
[5] Z. Cai and S. Zhang. Recovery-based error estimators for interface problems: conforming linear elements. SIAM Journal on Numerical Analysis, 47(3):2132–2156, 2009.
[6] T. Dockhorn. A discussion on solving partial differential equations using neural networks. arXiv:1904.07200 [cs.LG], abs/1904.07200, 2019.
[7] W. E and B. Yu. The deep Ritz method: A deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics, 6(1):1–12, March 2018.
[8] I. Fried. The $l_{2}$ and $l_{\infty}$ condition numbers of the finite element stiffness and mass matrices, and the pointwise convergence of the method. In J.R. Whiteman, editor, The Mathematics of Finite Elements and Applications, pages 163–174. Academic Press, 1973.
[9] M. Liu and Z. Cai. Adaptive two-layer ReLU neural network: II. Ritz approximation to elliptic pdes. Computers & Mathematics with Applications, 113:103–116, May 2022.
[10] M. Liu, Z. Cai, and J. Chen. Adaptive two-layer ReLU neural network: I. best least-squares approximation. Computers & Mathematics with Applications, 113:34–44, May 2022.
[11] M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, 2019.
[12] J. Sirignano and K. Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics, 375:1339–1364, 2018.