Structured and Balanced Multi-component
and Multi-layer Networks

Shijun Zhang Department of Mathematics, Duke University, Durham, NC 27708; [email protected]    Hongkai Zhao Department of Mathematics, Duke University, Durham, NC 27708; [email protected]    Yimin Zhong Department of Mathematics and Statistics, Auburn University, Auburn, AL 36830; [email protected]    Haomin Zhou School of Mathematics, Georgia Institute of Technology, Atlanta, GA 30332; [email protected]

Structured and Balanced Multi-component
and Multi-layer Neural Networks

Shijun Zhang Department of Mathematics, Duke University, Durham, NC 27708; [email protected]    Hongkai Zhao Department of Mathematics, Duke University, Durham, NC 27708; [email protected]    Yimin Zhong Department of Mathematics and Statistics, Auburn University, Auburn, AL 36830; [email protected]    Haomin Zhou School of Mathematics, Georgia Institute of Technology, Atlanta, GA 30332; [email protected]
Abstract

In this work, we propose a balanced multi-component and multi-layer neural network (MMNN) structure to approximate functions with complex features with both accuracy and efficiency in terms of degrees of freedom and computation cost. The main idea is motivated by a multi-component, each of which can be approximated effectively by a single-layer network, and multi-layer decomposition in a “divide-and-conquer” type of strategy to deal with a complex function. While an easy modification to fully connected neural networks (FCNNs) or multi-layer perceptrons (MLPs) through the introduction of balanced multi-component structures in the network, MMNNs achieve a significant reduction of training parameters, a much more efficient training process, and a much improved accuracy compared to FCNNs or MLPs. Extensive numerical experiments are presented to illustrate the effectiveness of MMNNs in approximating high oscillatory functions and its automatic adaptivity in capturing localized features. Our codes and implementation details are available here.

Key words. deep neural networks, rectified linear unit, function compositions, Fourier series

1 Introduction

The key use of neural networks is to approximate an input-to-output relation, i.e., a map** or a function in the mathematics term. In this work, we continue our study of numerical understanding of neural network approximation of functions from representation to learning dynamics. In our earlier study [38], we demonstrated that a one-hidden-layer (also known as a two-layer or shallow) network is essentially a “low-pass filter” when approximating a function in practice. Due to the strong correlation among the family of activation functions (parameterized by the weight and bias), such as 𝚁𝚎𝙻𝚄𝚁𝚎𝙻𝚄\mathtt{ReLU}typewriter_ReLU  (rectified linear unit), the Gram matrix, the element of which is the pairwise correlation (inner product) of the activation functions, has a fast spectral decay. If initialized randomly, the eigenvectors of the Gram matrix correspond to generalized Fourier modes from low frequency to high frequency ordered corresponding to decreasing eigenvalues. Due to the ill-conditioning of the representation, no matter how wide a one-hidden-layer network is, it can only learn and approximate smooth functions or sample low-frequency modes effectively and stably (with respect to noise or machine round-off errors).

In this work, we propose a balanced multi-component and multi-layer neural network (MMNN) structure based on our previous understanding of a one-hidden-layer network. First, we show that a multi-layer network with a multi-component structure, each of which can be approximated well and effectively by a one-hidden-layer network, can overcome the limitation of a shallow network by smooth decomposition and transformation. Compared to a fully connected neural network of a similar structure, our proposed MMNN is much more effective in terms of representation, training, and accuracy in approximating functions, especially for functions containing complex features, e.g., high-frequency modes. The key idea of MMNNs is to view a linear combination of activation functions as randomly parameterized basis functions, called a component, as a whole to represent a smooth function. Each layer has multiple components all sharing the common basis functions with different linear combinations. The number of components, called rank, is typically much smaller than the layer’s width and increases to enhance the flexibility of decomposition when dealing with more complex functions. These components are combined and composed (through layers) in a structured and balanced way in terms of network width, rank, and depth to approximate a complicated function effectively. Another important feature we used in practice is that weights and biases inside each activation function are randomly assigned and fixed during the optimization while the linear combination weights of activation functions in each component are trained. This leads to more efficient training processes motivated by our finding that a one-hidden-layer neural network can be trained effectively to approximate a smooth function well using random basis functions. We also demonstrate interesting learning dynamics based on Adam optimizer [14], which is crucial for the successful and efficient training of MMNNs. An important remark is that a balanced and holistic approach needs to consider both representation and optimization as well as their interplay altogether.

The outline of this paper is summarized as follows. In Section 2, the design of MMNNs is proposed and explained. Then, in Section 3, a mathematical framework for smooth decomposition and transformation using the MMNN architecture is presented, demonstrating that each component can be effectively approximated by a one-hidden-layer network. Extensive numerical experiments are presented in Section 4 to verify the analysis and demonstrate the effectiveness of MMNNs in the approximation of more complicated functions. Further discussion is presented in Section 5, where more insights and implementation guidelines of MMNNs are provided. Finally, remarks and conclusions are provided in Section 6.

2 Multi-component and multi-layer neural network (MMNN)

This section begins with an overview of the main notations used in this paper, as detailed in Section 2.1. Subsequently, we introduce a novel network architecture, the Multi-component and Multi-layer Neural Network (MMNN), which is developed based on the balanced decomposition principle discussed in Section 2.2. Following this, in Section 2.3, we outline the learning strategy of MMNN and highlight its advantages over other methods. Finally, in Section 2.4, we compare the numerical performance of MMNNs and FCNNs.

2.1 Notations

The following is an overview of the basic notations used in this paper.

  • The symbols {\mathbb{N}}blackboard_N, {\mathbb{Z}}blackboard_Z, {\mathbb{Q}}blackboard_Q, and {\mathbb{R}}blackboard_R are used to denote the sets of natural numbers (including 00), integers, rational numbers, and real numbers, respectively. The set of positive natural numbers is denoted as +=\{0}superscript\0{\mathbb{N}}^{+}={\mathbb{N}}\backslash\{0\}blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = blackboard_N \ { 0 }.

  • The indicator (or characteristic) function of a set A𝐴Aitalic_A, denoted by 𝟙Asubscript1𝐴{\mathds{1}}_{A}blackboard_1 start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, is a function that takes the value 1111 for elements of A𝐴Aitalic_A and 00 for elements not in A𝐴Aitalic_A.

  • The floor and ceiling functions of a real number x𝑥xitalic_x can be represented as x=max{n:nx,n}𝑥:𝑛formulae-sequence𝑛𝑥𝑛\lfloor x\rfloor=\max\{n:n\leq x,\ n\in{\mathbb{Z}}\}⌊ italic_x ⌋ = roman_max { italic_n : italic_n ≤ italic_x , italic_n ∈ blackboard_Z } and x=min{n:nx,n}𝑥:𝑛formulae-sequence𝑛𝑥𝑛\lceil x\rceil=\min\{n:n\geq x,\ n\in{\mathbb{Z}}\}⌈ italic_x ⌉ = roman_min { italic_n : italic_n ≥ italic_x , italic_n ∈ blackboard_Z }.

  • Vectors are denoted by bold lowercase letters, such as 𝒂=(a1,,an)n𝒂subscript𝑎1subscript𝑎𝑛superscript𝑛{\bm{a}}=(a_{1},\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu% \cdotp\mkern-0.1mu},a_{n})\in{\mathbb{R}}^{n}bold_italic_a = ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , start_ATOM ⋅ ⋅ ⋅ end_ATOM , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. On the other hand, matrices are represented by bold uppercase letters. For example, 𝑨m×n𝑨superscript𝑚𝑛\bm{A}\in\mathbb{R}^{m\times n}bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT refers to a real matrix of size m×n𝑚𝑛m\times nitalic_m × italic_n, and 𝑨Tsuperscript𝑨T\bm{A}^{\textsf{T}}bold_italic_A start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT denotes the transpose of matrix 𝑨𝑨\bm{A}bold_italic_A.

  • Slicing notation is used for a vector 𝒙=(x1,,xd)d𝒙subscript𝑥1subscript𝑥𝑑superscript𝑑{\bm{x}}=(x_{1},\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu% \cdotp\mkern-0.1mu},x_{d})\in{\mathbb{R}}^{d}bold_italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , start_ATOM ⋅ ⋅ ⋅ end_ATOM , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, where 𝒙[n:m]{\bm{x}}{[n:m]}bold_italic_x [ italic_n : italic_m ] denotes a slice of 𝒙𝒙{\bm{x}}bold_italic_x from its n𝑛nitalic_n-th to the m𝑚mitalic_m-th entries for any n,m{1,2,,d}𝑛𝑚12𝑑n,m\in\{1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp% \mkern-0.1mu},d\}italic_n , italic_m ∈ { 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , italic_d } with nm𝑛𝑚n\leq mitalic_n ≤ italic_m and 𝒙[n]𝒙delimited-[]𝑛{\bm{x}}{[n]}bold_italic_x [ italic_n ] denotes the n𝑛nitalic_n-th entry of 𝒙𝒙{\bm{x}}bold_italic_x. For example, if 𝒙=(x1,x2,x3)3𝒙subscript𝑥1subscript𝑥2subscript𝑥3superscript3{\bm{x}}=(x_{1},x_{2},x_{3})\in{\mathbb{R}}^{3}bold_italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, then (5𝒙)[2:3]=(5x2,5x3)(5{\bm{x}}){[2:3]}=(5x_{2},5x_{3})( 5 bold_italic_x ) [ 2 : 3 ] = ( 5 italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 5 italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) and (6𝒙+1)[3]=6x3+16𝒙1delimited-[]36subscript𝑥31(6{\bm{x}}+1){[3]}=6x_{3}+1( 6 bold_italic_x + 1 ) [ 3 ] = 6 italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + 1. A similar notation is employed for matrices. For instance, 𝑨[:,i]𝑨:𝑖{\bm{A}}[:,i]bold_italic_A [ : , italic_i ] refers to the i𝑖iitalic_i-th column of 𝑨𝑨{\bm{A}}bold_italic_A, whereas 𝑨[i,:]𝑨𝑖:{\bm{A}}[i,:]bold_italic_A [ italic_i , : ] indicates the i𝑖iitalic_i-th row of 𝑨𝑨{\bm{A}}bold_italic_A. Additionally, 𝑨[i,n:m]{\bm{A}}[i,n:m]bold_italic_A [ italic_i , italic_n : italic_m ] corresponds to (𝑨[i,:])[n:m]({\bm{A}}[i,:])[n:m]( bold_italic_A [ italic_i , : ] ) [ italic_n : italic_m ], which means it extracts the entries from the n𝑛nitalic_n-th to the m𝑚mitalic_m-th in the i𝑖iitalic_i-th row.

2.2 Architecture of MMNNs

In this section, we introduce our Multi-component and Multi-layer Neural Network (MMNN). Each layer of MMNN is a (shallow) neural network of the form

𝒉(𝒙)=𝑨σ(𝑾𝒙+𝒃)+𝒄𝒉𝒙𝑨𝜎𝑾𝒙𝒃𝒄{\bm{h}}({\bm{x}})={\bm{A}}\sigma({\bm{W}}{\bm{x}}+{\bm{b}})+{\bm{c}}bold_italic_h ( bold_italic_x ) = bold_italic_A italic_σ ( bold_italic_W bold_italic_x + bold_italic_b ) + bold_italic_c

to approximate a vector-valued function 𝒇C([0,1]din;dout)𝒇𝐶superscript01subscript𝑑insuperscriptsubscript𝑑out{\bm{f}}\in C([0,1]^{d_{\textnormal{in}}};{\mathbb{R}}^{d_{\textnormal{out}}})bold_italic_f ∈ italic_C ( [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ; blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ), where 𝑾n×din,𝑨dout×nformulae-sequence𝑾superscript𝑛subscript𝑑in𝑨superscriptsubscript𝑑out𝑛{\bm{W}}\in{\mathbb{R}}^{n\times{d_{\textnormal{in}}}},{\bm{A}}\in{\mathbb{R}}% ^{{d_{\textnormal{out}}}\times n}bold_italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT out end_POSTSUBSCRIPT × italic_n end_POSTSUPERSCRIPT, and n𝑛nitalic_n is the width of this network. Here, σ::𝜎\sigma:\mathbb{R}\to\mathbb{R}italic_σ : blackboard_R → blackboard_R represents the activation function that can be applied elementwise to vector inputs. Throughout this paper, the activation function is 𝚁𝚎𝙻𝚄𝚁𝚎𝙻𝚄\mathtt{ReLU}typewriter_ReLU, unless otherwise specified. One can also write it in a more compact form,

𝒉=𝑨σ(𝑾𝒙+𝒃)+𝒄=𝑨~[σ(𝑾~𝒙~)1]=𝑨~σ(𝑾~𝒙~),𝒉𝑨𝜎𝑾𝒙𝒃𝒄~𝑨delimited-[]matrix𝜎~𝑾~𝒙1~𝑨𝜎~𝑾~𝒙{\bm{h}}={\bm{A}}\sigma({\bm{W}}{\bm{x}}+{\bm{b}})+{\bm{c}}={\widetilde{\bm{A}% }}\left[\begin{matrix}\sigma({\widetilde{\bm{W}}}{\widetilde{\bm{x}}})\\ 1\end{matrix}\right]={\widetilde{\bm{A}}}\sigma({\widetilde{\bm{W}}}{% \widetilde{\bm{x}}}),bold_italic_h = bold_italic_A italic_σ ( bold_italic_W bold_italic_x + bold_italic_b ) + bold_italic_c = over~ start_ARG bold_italic_A end_ARG [ start_ARG start_ROW start_CELL italic_σ ( over~ start_ARG bold_italic_W end_ARG over~ start_ARG bold_italic_x end_ARG ) end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ] = over~ start_ARG bold_italic_A end_ARG italic_σ ( over~ start_ARG bold_italic_W end_ARG over~ start_ARG bold_italic_x end_ARG ) , (1)

where

𝑾~=[𝑾,𝒃],𝑨~=[𝑨,𝒄],𝒙~=[𝒙1].formulae-sequence~𝑾delimited-[]matrix𝑾𝒃formulae-sequence~𝑨delimited-[]matrix𝑨𝒄~𝒙delimited-[]matrix𝒙1{\widetilde{\bm{W}}}=\left[\begin{matrix}{\bm{W}},{\bm{b}}\end{matrix}\right],% \quad{\widetilde{\bm{A}}}=\left[\begin{matrix}{\bm{A}},{\bm{c}}\end{matrix}% \right],\quad{\widetilde{\bm{x}}}=\left[\begin{matrix}{\bm{x}}\\ 1\end{matrix}\right].over~ start_ARG bold_italic_W end_ARG = [ start_ARG start_ROW start_CELL bold_italic_W , bold_italic_b end_CELL end_ROW end_ARG ] , over~ start_ARG bold_italic_A end_ARG = [ start_ARG start_ROW start_CELL bold_italic_A , bold_italic_c end_CELL end_ROW end_ARG ] , over~ start_ARG bold_italic_x end_ARG = [ start_ARG start_ROW start_CELL bold_italic_x end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ] .

We call each element of 𝒉𝒉{\bm{h}}bold_italic_h, i.e., 𝒉[i]=𝑨~[i,:][σ(𝑾~𝒙~)1]𝒉delimited-[]𝑖~𝑨𝑖:delimited-[]matrix𝜎~𝑾~𝒙1{\bm{h}}{[i]}=\widetilde{\bm{A}}{[i,:]}\cdot\left[\begin{matrix}\sigma({% \widetilde{\bm{W}}}{\widetilde{\bm{x}}})\\ 1\end{matrix}\right]bold_italic_h [ italic_i ] = over~ start_ARG bold_italic_A end_ARG [ italic_i , : ] ⋅ [ start_ARG start_ROW start_CELL italic_σ ( over~ start_ARG bold_italic_W end_ARG over~ start_ARG bold_italic_x end_ARG ) end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ] for i=1,2,,dout𝑖12subscript𝑑𝑜𝑢𝑡i=1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-% 0.1mu},d_{out}italic_i = 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT, a component. Here are a few key features of 𝒉𝒉{\bm{h}}bold_italic_h:

  1. 1.

    Each component is viewed as a linear combination of basis functions σ(𝑾[i,:]𝒙+𝒃[i]),i=1,2,,nformulae-sequence𝜎𝑾𝑖:𝒙𝒃delimited-[]𝑖𝑖12𝑛\sigma({\bm{W}}{[i,:]}\cdot{\bm{x}}+{\bm{b}}[i]),\,i=1,2,\mathinner{\mkern-0.1% mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-0.1mu},nitalic_σ ( bold_italic_W [ italic_i , : ] ⋅ bold_italic_x + bold_italic_b [ italic_i ] ) , italic_i = 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , italic_n, which is a function in 𝒙𝒙{\bm{x}}bold_italic_x, as a whole.

  2. 2.

    Different components of 𝒉𝒉{\bm{h}}bold_italic_h share the same set of basis with different coefficients 𝑨~[i,:]~𝑨𝑖:\widetilde{\bm{A}}[i,:]over~ start_ARG bold_italic_A end_ARG [ italic_i , : ].

  3. 3.

    Only 𝑨~~𝑨{\widetilde{\bm{A}}}over~ start_ARG bold_italic_A end_ARG are trained while 𝑾~~𝑾{\widetilde{\bm{W}}}over~ start_ARG bold_italic_W end_ARG are randomly assigned and fixed.

  4. 4.

    The output dimension doutsubscript𝑑out{d_{\textnormal{out}}}italic_d start_POSTSUBSCRIPT out end_POSTSUBSCRIPT and network width n𝑛nitalic_n can be tuned according to the intrinsic dimension and complexity of the problem.

In comparison, each layer in a typical deep FCNN takes the form σ(𝑾~𝒙~)𝜎~𝑾~𝒙\sigma({\widetilde{\bm{W}}}{\widetilde{\bm{x}}})italic_σ ( over~ start_ARG bold_italic_W end_ARG over~ start_ARG bold_italic_x end_ARG ), and each hidden neuron is individually a function of the input 𝒙𝒙{\bm{x}}bold_italic_x or each point 𝒙din𝒙superscriptsubscript𝑑𝑖𝑛{\bm{x}}\in{\mathbb{R}}^{d_{in}}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is mapped to nsuperscript𝑛{\mathbb{R}}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, where n𝑛nitalic_n is the layer width. All weights 𝑾~~𝑾{\widetilde{\bm{W}}}over~ start_ARG bold_italic_W end_ARG are training parameters. In MMNN, each layer is composed of multiple components 𝑨~σ(𝑾~𝒙~)~𝑨𝜎~𝑾~𝒙{\widetilde{\bm{A}}}\sigma({\widetilde{\bm{W}}}{\widetilde{\bm{x}}})over~ start_ARG bold_italic_A end_ARG italic_σ ( over~ start_ARG bold_italic_W end_ARG over~ start_ARG bold_italic_x end_ARG ). Each component is a linear combination of randomly parameterized hidden neurons σ(𝑾~𝒙~)𝜎~𝑾~𝒙\sigma({\widetilde{\bm{W}}}{\widetilde{\bm{x}}})italic_σ ( over~ start_ARG bold_italic_W end_ARG over~ start_ARG bold_italic_x end_ARG ), which can be more effectively and stably trained through 𝑨~~𝑨{\widetilde{\bm{A}}}over~ start_ARG bold_italic_A end_ARG as a smooth decomposition/transformation. Typically the number of components doutsubscript𝑑𝑜𝑢𝑡d_{out}italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT is (much) smaller than the layer width n𝑛nitalic_n in our experiments.

A MMNN is a multi-layer composition of 𝒉isubscript𝒉𝑖{\bm{h}}_{i}bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i.e., 𝒉:dindout:𝒉maps-tosuperscriptsubscript𝑑insuperscriptsubscript𝑑out{\bm{h}}:{\mathbb{R}}^{d_{\textnormal{in}}}\mapsto{\mathbb{R}}^{d_{\textnormal% {out}}}bold_italic_h : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

𝒉=𝒉m𝒉2𝒉1,𝒉subscript𝒉𝑚subscript𝒉2subscript𝒉1{\bm{h}}={\bm{h}}_{m}\circ\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp% \mkern-0.3mu\cdotp\mkern-0.1mu}\circ{\bm{h}}_{2}\circ{\bm{h}}_{1},bold_italic_h = bold_italic_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∘ start_ATOM ⋅ ⋅ ⋅ end_ATOM ∘ bold_italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

where each 𝒉i:di1di:subscript𝒉𝑖maps-tosuperscriptsubscript𝑑𝑖1superscriptsubscript𝑑𝑖{\bm{h}}_{i}:{\mathbb{R}}^{d_{i-1}}\mapsto{\mathbb{R}}^{d_{i}}bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a multi-component shallow network defined in (1) of width nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where

d0=din,d1,,dm1ni,dm=dout.formulae-sequencesubscript𝑑0subscript𝑑insubscript𝑑1formulae-sequencemuch-less-thansubscript𝑑𝑚1subscript𝑛𝑖subscript𝑑𝑚subscript𝑑outd_{0}={d_{\textnormal{in}}},\qquad d_{1},\mathinner{\mkern-0.1mu\cdotp\mkern-0% .3mu\cdotp\mkern-0.3mu\cdotp\mkern-0.1mu},d_{m-1}\ll n_{i},\qquad d_{m}={d_{% \textnormal{out}}}.italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , start_ATOM ⋅ ⋅ ⋅ end_ATOM , italic_d start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ≪ italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT out end_POSTSUBSCRIPT .

The width of this MMNN is defined as max{ni:i=1,2,,m1}:subscript𝑛𝑖𝑖12𝑚1\max\{n_{i}:i=1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu% \cdotp\mkern-0.1mu},m-1\}roman_max { italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_i = 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , italic_m - 1 }, the rank as max{di:i=1,2,,m1}:subscript𝑑𝑖𝑖12𝑚1\max\{d_{i}:i=1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu% \cdotp\mkern-0.1mu},m-1\}roman_max { italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_i = 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , italic_m - 1 }, and the depth as m𝑚mitalic_m. To simplify, we denote a network with width w𝑤witalic_w, rank r𝑟ritalic_r, and depth l𝑙litalic_l using the compact notation (w,r,l)𝑤𝑟𝑙(w,r,l)( italic_w , italic_r , italic_l ). See Figure 1(a) for an illustration of MMNN of size (4,2,2)422(4,2,2)( 4 , 2 , 2 ). In contrast, an FCNN ϕbold-italic-ϕ{\bm{\phi}}bold_italic_ϕ can be expressed in the following composition form

ϕ=𝓛Lσ𝓛L1σ𝓛1σ𝓛0,bold-italic-ϕsubscript𝓛𝐿𝜎subscript𝓛𝐿1𝜎subscript𝓛1𝜎subscript𝓛0{\bm{\phi}}={\bm{\mathcal{L}}}_{L}\circ\sigma\circ{\bm{\mathcal{L}}}_{L-1}% \circ\ \mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern% -0.1mu}\ \circ\sigma\circ{\bm{\mathcal{L}}}_{1}\circ\sigma\circ{\bm{\mathcal{L% }}}_{0},bold_italic_ϕ = bold_caligraphic_L start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∘ italic_σ ∘ bold_caligraphic_L start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT ∘ start_ATOM ⋅ ⋅ ⋅ end_ATOM ∘ italic_σ ∘ bold_caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ italic_σ ∘ bold_caligraphic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ,

where 𝓛isubscript𝓛𝑖{\bm{\mathcal{L}}}_{i}bold_caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an affine linear map given by 𝓛i(𝒚)=𝑾i𝒚+𝒃isubscript𝓛𝑖𝒚subscript𝑾𝑖𝒚subscript𝒃𝑖{\bm{\mathcal{L}}}_{i}({\bm{y}})={\bm{W}}_{i}\cdot{\bm{y}}+{\bm{b}}_{i}bold_caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_y ) = bold_italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_italic_y + bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Readers are referred to Figure 1(b) for an illustration and also a comparison with the MMNN.

For very deep MMNNs, one can borrow ideas from ResNets [8] to address the gradient vanishing issue, making training more efficient. Incorporating this idea, we propose a new architecture given by a multi-layer composition of 𝑰+𝒉i𝑰subscript𝒉𝑖{\bm{I}}+{\bm{h}}_{i}bold_italic_I + bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i.e., 𝒉:dindout:𝒉maps-tosuperscriptsubscript𝑑insuperscriptsubscript𝑑out{\bm{h}}:{\mathbb{R}}^{d_{\textnormal{in}}}\mapsto{\mathbb{R}}^{d_{\textnormal% {out}}}bold_italic_h : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

𝒉=𝒉m(𝑰+𝒉m1)(𝑰+𝒉3)(𝑰+𝒉2)𝒉1,𝒉subscript𝒉𝑚𝑰subscript𝒉𝑚1𝑰subscript𝒉3𝑰subscript𝒉2subscript𝒉1{\bm{h}}={\bm{h}}_{m}\circ({\bm{I}}+{\bm{h}}_{m-1})\circ\mathinner{\mkern-0.1% mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-0.1mu}\circ({\bm{I}}+{\bm{h% }}_{3})\circ({\bm{I}}+{\bm{h}}_{2})\circ{\bm{h}}_{1},bold_italic_h = bold_italic_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∘ ( bold_italic_I + bold_italic_h start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ) ∘ start_ATOM ⋅ ⋅ ⋅ end_ATOM ∘ ( bold_italic_I + bold_italic_h start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ∘ ( bold_italic_I + bold_italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∘ bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

where each 𝒉i:di1di:subscript𝒉𝑖maps-tosuperscriptsubscript𝑑𝑖1superscriptsubscript𝑑𝑖{\bm{h}}_{i}:{\mathbb{R}}^{d_{i-1}}\mapsto{\mathbb{R}}^{d_{i}}bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a multi-component shallow network defined in (1) with width nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT,

d0=din,d1==dm1=rni,dm=dout,formulae-sequenceformulae-sequencesubscript𝑑0subscript𝑑insubscript𝑑1subscript𝑑𝑚1𝑟much-less-thansubscript𝑛𝑖subscript𝑑𝑚subscript𝑑outd_{0}={d_{\textnormal{in}}},\qquad d_{1}=\mathinner{\mkern-0.1mu\cdotp\mkern-0% .3mu\cdotp\mkern-0.3mu\cdotp\mkern-0.1mu}=d_{m-1}=r\ll n_{i},\qquad d_{m}={d_{% \textnormal{out}}},italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = start_ATOM ⋅ ⋅ ⋅ end_ATOM = italic_d start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT = italic_r ≪ italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ,

and 𝑰𝑰{\bm{I}}bold_italic_I is the identity map. We call this architecture ResMMNN. See Figure 1(c) for an illustration of a ResMMNN of size (4,2,3).

The above definition of ResMMNNs requires d1==dm1=rsubscript𝑑1subscript𝑑𝑚1𝑟d_{1}=\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-% 0.1mu}=d_{m-1}=ritalic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = start_ATOM ⋅ ⋅ ⋅ end_ATOM = italic_d start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT = italic_r. If this condition does not hold, we can alternatively define ResMMNN via

𝒉=(𝑰𝒉m)(𝑰𝒉m1)(𝑰𝒉3)(𝑰𝒉2)(𝑰𝒉1),𝒉direct-sum𝑰subscript𝒉𝑚direct-sum𝑰subscript𝒉𝑚1direct-sum𝑰subscript𝒉3direct-sum𝑰subscript𝒉2direct-sum𝑰subscript𝒉1{\bm{h}}=({\bm{I}}\oplus{\bm{h}}_{m})\circ({\bm{I}}\oplus{\bm{h}}_{m-1})\circ% \mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-0.1mu}% \circ({\bm{I}}\oplus{\bm{h}}_{3})\circ({\bm{I}}\oplus{\bm{h}}_{2})\circ({\bm{I% }}\oplus{\bm{h}}_{1}),bold_italic_h = ( bold_italic_I ⊕ bold_italic_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∘ ( bold_italic_I ⊕ bold_italic_h start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ) ∘ start_ATOM ⋅ ⋅ ⋅ end_ATOM ∘ ( bold_italic_I ⊕ bold_italic_h start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ∘ ( bold_italic_I ⊕ bold_italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∘ ( bold_italic_I ⊕ bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ,

where direct-sum\oplus is an operation defined as follows. For any functions 𝒇:dd𝒇:𝒇maps-tosuperscript𝑑superscriptsubscript𝑑𝒇{\bm{f}}:\mathbb{R}^{d}\mapsto\mathbb{R}^{d_{\bm{f}}}bold_italic_f : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT bold_italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒈:dd𝒈:𝒈maps-tosuperscript𝑑superscriptsubscript𝑑𝒈{\bm{g}}:\mathbb{R}^{d}\mapsto\mathbb{R}^{d_{\bm{g}}}bold_italic_g : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT bold_italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, the direct-sum\oplus operation is given by

𝒇𝒈(𝒇~+𝒈~)[1:d𝒈],where𝒇~=[𝒇𝟎]max{d𝒇,d𝒈}and𝒈~=[𝒈𝟎]max{d𝒇,d𝒈}.{\bm{f}}\oplus{\bm{g}}\coloneqq({\widetilde{\bm{f}}}+{\widetilde{\bm{g}}})[1:d% _{\bm{g}}],\quad\textnormal{where}\quad{\widetilde{\bm{f}}}=\begin{bmatrix}{% \bm{f}}\\ {\bm{0}}\end{bmatrix}\in\mathbb{R}^{\max\{d_{\bm{f}},d_{\bm{g}}\}}\quad% \textnormal{and}\quad{\widetilde{\bm{g}}}=\begin{bmatrix}{\bm{g}}\\ {\bm{0}}\end{bmatrix}\in\mathbb{R}^{\max\{d_{\bm{f}},d_{\bm{g}}\}}.bold_italic_f ⊕ bold_italic_g ≔ ( over~ start_ARG bold_italic_f end_ARG + over~ start_ARG bold_italic_g end_ARG ) [ 1 : italic_d start_POSTSUBSCRIPT bold_italic_g end_POSTSUBSCRIPT ] , where over~ start_ARG bold_italic_f end_ARG = [ start_ARG start_ROW start_CELL bold_italic_f end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT roman_max { italic_d start_POSTSUBSCRIPT bold_italic_f end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT bold_italic_g end_POSTSUBSCRIPT } end_POSTSUPERSCRIPT and over~ start_ARG bold_italic_g end_ARG = [ start_ARG start_ROW start_CELL bold_italic_g end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT roman_max { italic_d start_POSTSUBSCRIPT bold_italic_f end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT bold_italic_g end_POSTSUBSCRIPT } end_POSTSUPERSCRIPT .

It is noteworthy that such an operation is both straightforward and cost-effective to implement. For example, in Python, one can use the following code:

y = f(x);  z = g(x);  n = min(len(y), len(z));  z[:n] = y[:n] + z[:n]

After executing this code, z will be the result of the map fgdirect-sumfg\textsf{f}\oplus\textsf{g}f ⊕ g at x. We remark that the definition of ResMMNN can be generalized to only adding identity maps to certain specific layers, which we still refer to as ResMMNN.

Refer to caption
(a) MMNN of size (4,2,2)422(4,2,2)( 4 , 2 , 2 ), i.e., width 4444, rank 2, and depth 2222.
Refer to caption
(b) FCNN of size (4,2,3)423(4,2,3)( 4 , 2 , 3 ), i.e., width 4444 and depth 2222.
Refer to caption
(c) ResMMNN of size (4,2,3)423(4,2,3)( 4 , 2 , 3 ), i.e., width 4444, rank 2, and depth 3333.
Figure 1: Illustrations of σ𝜎\sigmaitalic_σ-activated MMNN, FCNN, and ResMMNN.

2.3 Learning strategy of MMNNs

Our learning strategy is motivated by the following basic principle: a function can be decomposed in a multi-component and multi-layer structure each component of which can be approximated and trained effectively using a one-hidden-layer network, which is a linear combination of random basis functions (e.g., of the form σ(𝑾i𝒙+𝒃i)𝜎subscript𝑾𝑖𝒙subscript𝒃𝑖\sigma({\bm{W}}_{i}\cdot{\bm{x}}+{\bm{b}}_{i})italic_σ ( bold_italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_italic_x + bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), see Section 3). Hence optimizing the linear combination weights of the random basis functions, i.e., 𝑨i,𝒄isubscript𝑨𝑖subscript𝒄𝑖{\bm{A}}_{i},{\bm{c}}_{i}bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is both efficient and adequate. On the other hand, optimizing the weights (orientations of the basis functions) 𝑾isubscript𝑾𝑖{\bm{W}}_{i}bold_italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s and biases 𝒃isubscript𝒃𝑖{\bm{b}}_{i}bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s to make the basis functions more adaptive to fine features of the target function, which would require capturing high-frequency information by a single layer network, leads to not only significantly more parameters to optimize but also difficulties in training as shown in [38]. Specifically, for each layer of MMNN, we fix the activation function parameters (𝑾isubscript𝑾𝑖\bm{W}_{i}bold_italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s and 𝒃isubscript𝒃𝑖\bm{b}_{i}bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s) as per PyTorch’s default setting during the training process. This entails initializing both weights and biases uniformly from the distribution 𝒰(k,k)𝒰𝑘𝑘\mathcal{U}(-\sqrt{k},\sqrt{k})caligraphic_U ( - square-root start_ARG italic_k end_ARG , square-root start_ARG italic_k end_ARG ), where k=1in_features𝑘1in_featuresk=\frac{1}{\text{in\_features}}italic_k = divide start_ARG 1 end_ARG start_ARG in_features end_ARG.111It is noteworthy that this initialization approach is similar to the widely used Xavier initialization [4], which draws weights from the distribution 𝒰(k,k)𝒰𝑘𝑘\mathcal{U}(-\sqrt{k},\sqrt{k})caligraphic_U ( - square-root start_ARG italic_k end_ARG , square-root start_ARG italic_k end_ARG ) with k=6in_features+out_features𝑘6in_featuresout_featuresk=\frac{\sqrt{6}}{\text{in\_features}+\text{out\_features}}italic_k = divide start_ARG square-root start_ARG 6 end_ARG end_ARG start_ARG in_features + out_features end_ARG and sets the bias to 𝟎0{\bm{0}}bold_0. The whole training process optimizes all 𝑨isubscript𝑨𝑖{\bm{A}}_{i}bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s and 𝒄isubscript𝒄𝑖{\bm{c}}_{i}bold_italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s simultaneously using Adam. Note that it is important to have a uniform sampling of orientations 𝑾isubscript𝑾𝑖{\bm{W}}_{i}bold_italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and biases 𝒃isubscript𝒃𝑖{\bm{b}}_{i}bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the random basis functions to be able to approximate an arbitrary smooth function well. Unless stated otherwise, parameter initialization adheres to the default settings provided by PyTorch in our experiments.

To demonstrate the advantages of our training approach (labeled S1), we conduct a comparison with the typical strategy in deep neural networks, denoted as Strategy S2, which uses the default PyTorch initialization and optimizes all parameters during training. In our tests, we select an oscillatory target function f(x)=cos(36πx2)0.6cos(12πx2)𝑓𝑥36𝜋superscript𝑥20.612𝜋superscript𝑥2f(x)=\cos(36\pi x^{2})-0.6\cos(12\pi x^{2})italic_f ( italic_x ) = roman_cos ( 36 italic_π italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - 0.6 roman_cos ( 12 italic_π italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and use fairly compact networks. The tests are performed on a total of 1000100010001000 uniform samples in [1,1]11[-1,1][ - 1 , 1 ] with a mini-batch size of 100100100100 and a learning rate for epoch-k𝑘kitalic_k set at 0.001×0.9k/4000.001superscript0.9𝑘4000.001\times 0.9^{\lfloor k/400\rfloor}0.001 × 0.9 start_POSTSUPERSCRIPT ⌊ italic_k / 400 ⌋ end_POSTSUPERSCRIPT for k=1,2,,20000𝑘1220000k=1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-% 0.1mu},20000italic_k = 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , 20000, where \lfloor\cdot\rfloor⌊ ⋅ ⌋ denotes the floor operation. The Adam optimizer [14] is applied throughout the training process.

Table 1: Comparison of test errors averaged over the last 100 epochs.
network (width, rank, depth) #parameters (trained / all) test error (MSE) test error (MAX) training time
MMNN1 (S1) (400, 20, 6) 40501 / 83301 2.01×1052.01superscript1052.01\times 10^{-5}2.01 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 4.36×1024.36superscript1024.36\times 10^{-2}4.36 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 23.9s / 1000 epochs
MMNN1 (S2) (400, 20, 6) 83301 / 83301 4.26×1054.26superscript1054.26\times 10^{-5}4.26 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 4.71×1024.71superscript1024.71\times 10^{-2}4.71 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 30.2s / 1000 epochs
MMNN2 (S1) (590, 28, 6) 83331 / 170061 1.39×1051.39superscript1051.39\times 10^{-5}1.39 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 2.80×1022.80superscript1022.80\times 10^{-2}2.80 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 25.2s / 1000 epochs
Refer to caption
Refer to caption
Refer to caption
Figure 2: Left: target function f(x)=cos(36πx2)0.6cos(12πx2)𝑓𝑥36𝜋superscript𝑥20.612𝜋superscript𝑥2f(x)=\cos(36\pi x^{2})-0.6\cos(12\pi x^{2})italic_f ( italic_x ) = roman_cos ( 36 italic_π italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - 0.6 roman_cos ( 12 italic_π italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Middle: logarithm of test errors vs. epoch. Right: logarithm of “test-error-aver” vs. epoch, where “test-error-aver” for epoch k𝑘kitalic_k is calculated by averaging the errors in epochs max{1,k100}1𝑘100\max\{1,k-100\}roman_max { 1 , italic_k - 100 } to min{k+100,#epochs}𝑘100#epochs\min\{k+100,\#\text{epochs}\}roman_min { italic_k + 100 , # epochs }.

As illustrated in Table 1 and Figure 2, our learning strategy S1111 is significantly more effective than strategy S2222 with comparable accuracy. There are two main advantages of S1. First, S1 requires training only about half the number of parameters compared to S2, which results in time savings. Second, S1 converges more quickly and performs significantly better when the training is not sufficient. We would like to note that in certain specific cases, S2 may outperform S1, particularly when the network size is relatively small and S2 is well-trained. This is expected since S2222 trains all parameters, whereas S1111 only trains a subset. Based on our experience, S1 is more effective in practice, particularly for sufficiently large networks. Alternatively, one might consider a hybrid learning strategy.

2.4 MMNNs versus FCNNs

Previously in Section 1, we discussed the theoretical differences between MMNNs and FCNNs. Now, let’s explore and compare their numerical performance. To ensure a fair comparison, we will use networks with a similar number of parameters, ensuring that all networks have sufficient parameters to learn the target function effectively. Typically, when training an FCNN, all parameters are optimized. For a thorough comparison, we will employ two learning strategies for MMNNs as detailed in Section 2.3: S1 and S2. S1 involves training approximately half the number of parameters of the MMNN, while S2 involves training all parameters.

We choose a 1D function f1(x)=cos(20π|x|1.4)+0.5cos(12π|x|1.6)subscript𝑓1𝑥20𝜋superscript𝑥1.40.512𝜋superscript𝑥1.6f_{1}(x)=\cos(20\pi|x|^{1.4})+0.5\cos(12\pi|x|^{1.6})italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) = roman_cos ( 20 italic_π | italic_x | start_POSTSUPERSCRIPT 1.4 end_POSTSUPERSCRIPT ) + 0.5 roman_cos ( 12 italic_π | italic_x | start_POSTSUPERSCRIPT 1.6 end_POSTSUPERSCRIPT ) and a 2D function

f2(x1,x2)=i=12j=12aijsin(sbixi+scijxixj)cos(sbjxj+sdijxi2),subscript𝑓2subscript𝑥1subscript𝑥2superscriptsubscript𝑖12superscriptsubscript𝑗12subscript𝑎𝑖𝑗𝑠subscript𝑏𝑖subscript𝑥𝑖𝑠subscript𝑐𝑖𝑗subscript𝑥𝑖subscript𝑥𝑗𝑠subscript𝑏𝑗subscript𝑥𝑗𝑠subscript𝑑𝑖𝑗superscriptsubscript𝑥𝑖2f_{2}(x_{1},x_{2})=\sum_{i=1}^{2}\sum_{j=1}^{2}a_{ij}\sin(sb_{i}x_{i}+sc_{ij}x% _{i}x_{j})\cos(sb_{j}x_{j}+sd_{ij}x_{i}^{2}),italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_sin ( italic_s italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_s italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) roman_cos ( italic_s italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_s italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where s=2𝑠2s=2italic_s = 2 and

(ai,j)=[0.30.20.20.3],(bi)=[2π4π],(ci,j)=[2π4π8π4π],and(di,j)=[4π6π8π6π].formulae-sequencesubscript𝑎𝑖𝑗matrix0.30.20.20.3formulae-sequencesubscript𝑏𝑖matrix2𝜋4𝜋formulae-sequencesubscript𝑐𝑖𝑗matrix2𝜋4𝜋8𝜋4𝜋andsubscript𝑑𝑖𝑗matrix4𝜋6𝜋8𝜋6𝜋(a_{i,j})=\begin{bmatrix}0.3&0.2\\ 0.2&0.3\end{bmatrix},\qquad(b_{i})=\begin{bmatrix}2\pi\\ 4\pi\end{bmatrix},\qquad(c_{i,j})=\begin{bmatrix}2\pi&4\pi\\ 8\pi&4\pi\end{bmatrix},\quad\textnormal{and}\quad(d_{i,j})=\begin{bmatrix}4\pi% &6\pi\\ 8\pi&6\pi\end{bmatrix}.( italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) = [ start_ARG start_ROW start_CELL 0.3 end_CELL start_CELL 0.2 end_CELL end_ROW start_ROW start_CELL 0.2 end_CELL start_CELL 0.3 end_CELL end_ROW end_ARG ] , ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = [ start_ARG start_ROW start_CELL 2 italic_π end_CELL end_ROW start_ROW start_CELL 4 italic_π end_CELL end_ROW end_ARG ] , ( italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) = [ start_ARG start_ROW start_CELL 2 italic_π end_CELL start_CELL 4 italic_π end_CELL end_ROW start_ROW start_CELL 8 italic_π end_CELL start_CELL 4 italic_π end_CELL end_ROW end_ARG ] , and ( italic_d start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) = [ start_ARG start_ROW start_CELL 4 italic_π end_CELL start_CELL 6 italic_π end_CELL end_ROW start_ROW start_CELL 8 italic_π end_CELL start_CELL 6 italic_π end_CELL end_ROW end_ARG ] .

Refer to Figures 4 and 4 for illustrations of f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, respectively.

Refer to caption
Figure 3: f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.
Refer to caption
Refer to caption
Figure 4: f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

We select large network sizes (see Table 2) to ensure that all networks possess sufficient parameters to learn the target functions.222FCNNs perform poorly if the network size is small. For a fair comparison, we choose relatively large network sizes for FCNNs and MMNNs, where both perform reasonably well. For training the 1D function, we sample a total of 1000 data points on a uniform grid within [1,1]11[-1,1][ - 1 , 1 ], using a mini-batch size of 100 and a learning rate of 0.001×0.9k/4000.001superscript0.9𝑘4000.001\times 0.9^{\lfloor k/400\rfloor}0.001 × 0.9 start_POSTSUPERSCRIPT ⌊ italic_k / 400 ⌋ end_POSTSUPERSCRIPT for epochs k=1,2,,20000𝑘1220000k=1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-% 0.1mu},20000italic_k = 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , 20000. For training the 2D function, we sample a total of 6002superscript6002600^{2}600 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT data points on a uniform grid within [1,1]2superscript112[-1,1]^{2}[ - 1 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, using a mini-batch size of 1000 and a learning rate of 0.001×0.9k/160.001superscript0.9𝑘160.001\times 0.9^{\lfloor k/16\rfloor}0.001 × 0.9 start_POSTSUPERSCRIPT ⌊ italic_k / 16 ⌋ end_POSTSUPERSCRIPT for epochs k=1,2,,800𝑘12800k=1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-% 0.1mu},800italic_k = 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , 800.

Table 2: Comparison of test errors averaged over the last 100 epochs.
target function network (width, rank, depth) #parameters (trained / all) test error (MSE) test error (MAX) training time
f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT MMNN1 (S1) (388, 18, 6) 35399 / 73035 2.49×1062.49superscript1062.49\times 10^{-6}2.49 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 9.93×1039.93superscript1039.93\times 10^{-3}9.93 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 23.3s / 1000 epochs
f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT FCNN1-1 (83, –, 6) 35110 / 35110 2.43×1042.43superscript1042.43\times 10^{-4}2.43 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1.87×1011.87superscript1011.87\times 10^{-1}1.87 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 19.5s / 1000 epochs
f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT MMNN1 (S2) (388, 18, 6) 73035 / 73035 2.05×1062.05superscript1062.05\times 10^{-6}2.05 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 1.88×1021.88superscript1021.88\times 10^{-2}1.88 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 27.4s / 1000 epochs
f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT FCNN1-2 (120, –, 6) 72961 / 72961 1.73×1041.73superscript1041.73\times 10^{-4}1.73 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1.14×1011.14superscript1011.14\times 10^{-1}1.14 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 22.3s / 1000 epochs
f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT MMNN2 (S1) (789, 36, 12) 313630 / 637120 4.61×1064.61superscript1064.61\times 10^{-6}4.61 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 1.55×1021.55superscript1021.55\times 10^{-2}1.55 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 30.3s / 10 epochs
f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT FCNN2-1 (168, –, 12) 312985 / 312985 2.42×1042.42superscript1042.42\times 10^{-4}2.42 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 2.75×1012.75superscript1012.75\times 10^{-1}2.75 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 26.7s / 10 epochs
f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT MMNN2 (S2) (789, 36, 12) 637120 / 637120 6.17×1066.17superscript1066.17\times 10^{-6}6.17 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 6.05×1026.05superscript1026.05\times 10^{-2}6.05 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 35.8s / 10 epochs
f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT FCNN2-2 (240, –, 12) 637201 / 637201 3.28×1053.28superscript1053.28\times 10^{-5}3.28 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 1.39×1011.39superscript1011.39\times 10^{-1}1.39 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 29.3s / 10 epochs
Refer to caption
(a) f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.
Refer to caption
(b) f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.
Refer to caption
(c) MMNN1 (S1).
Refer to caption
(d) FCNN1-1.
Refer to caption
(e) MMNN1 (S2).
Refer to caption
(f) FCNN1-2.
Refer to caption
(g) MMNN2 (S1).
Refer to caption
(h) FCNN2-1.
Refer to caption
(i) MMNN2 (S2).
Refer to caption
(j) FCNN2-2.
Figure 5: First row: logarithm of “test-error-aver” vs. epoch, where “test-error-aver” for epoch k𝑘kitalic_k is calculated by averaging the errors in epochs max{1,k100}1𝑘100\max\{1,k-100\}roman_max { 1 , italic_k - 100 } to min{k+100,#epochs}𝑘100#epochs\min\{k+100,\#\text{epochs}\}roman_min { italic_k + 100 , # epochs }. Second row: differences between learned networks and f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Second row: differences between learned networks and f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

As illustrated in Table 2 and Figure 5, MMNNs outperform FCNNs when both have the same depth and a comparable number of parameters, particularly for relatively oscillatory target functions. Moreover, as indicated in Table 2, the training time for MMNN (S1) is similar to that of FCNN, while MMNN (S2) takes a bit more time. We remark that the primary advantage of MMNNs lies in capturing high-frequency components. As we can see from Figure 5, the differences between network approximations and the corresponding target functions show that FCNNs approximate high-frequency parts of the target functions poorly. In contrast, the approximation errors for MMNNs, especially with the S1 learning strategy, are more evenly distributed across the entire domain, indicating their effectiveness in capturing high-frequency components. The Adam optimizer [14] is applied throughout the training process.

3 Multi-component and multi-layer decomposition

Although a one-hidden-layer neural network is a low-pass filter that can not represent and learn high-frequency features effectively [38], we use mathematical construction to show that MMNNs, which are composed of one-hidden-layer neural networks, can overcome this difficulty by decomposition of the complexity through components and/or depth. We emphasize that the decomposition is highly non-unique. Our construction is “man-made” which can be different from the one by computer through an optimization (learning) process. Our discussion begins with one-dimensional construction in Section 3.1 and later extends to higher dimensions in Section 3.2.

3.1 One dimensional construction

We begin with a two-component decomposition in 1D as both an illustration and an example in Section 3.1.1. Later in Section 3.1.2, we introduce the general multi-component decomposition. Finally in Section 3.1.3, we use concrete examples for demonstration.

3.1.1 Two-component decomposition

We show a simple “divide and conquer” strategy for a target function (example) f(x)=cos(2nπx)𝑓𝑥2𝑛𝜋𝑥f(x)=\cos(2n\pi x)italic_f ( italic_x ) = roman_cos ( 2 italic_n italic_π italic_x ), a high frequency Fourier mode when n𝑛nitalic_n is large. Define

𝒇1:[1,1][1,1]2,𝒇1=[f1,1f1,2],:subscript𝒇1formulae-sequencemaps-to11superscript112subscript𝒇1matrixsubscript𝑓11subscript𝑓12{\bm{f}}_{1}:[-1,1]\mapsto[-1,1]^{2},\qquad{\bm{f}}_{1}=\begin{bmatrix}f_{1,1}% \\ f_{1,2}\end{bmatrix},bold_italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : [ - 1 , 1 ] ↦ [ - 1 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_f start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ,
f1,1(x)=𝚁𝚎𝙻𝚄(2x)1={1forx[1,0),2x1forx[0,1],subscript𝑓11𝑥𝚁𝚎𝙻𝚄2𝑥1cases1for𝑥102𝑥1for𝑥01f_{1,1}(x)={\mathtt{ReLU}}(2x)-1=\begin{cases}-1&\textnormal{for}\ x\in[-1,0),% \\ 2x-1&\textnormal{for}\ x\in[0,1],\end{cases}italic_f start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ( italic_x ) = typewriter_ReLU ( 2 italic_x ) - 1 = { start_ROW start_CELL - 1 end_CELL start_CELL for italic_x ∈ [ - 1 , 0 ) , end_CELL end_ROW start_ROW start_CELL 2 italic_x - 1 end_CELL start_CELL for italic_x ∈ [ 0 , 1 ] , end_CELL end_ROW
f1,2(x)=𝚁𝚎𝙻𝚄(2x)+1={2x+1forx[1,0),1forx[0,1],subscript𝑓12𝑥𝚁𝚎𝙻𝚄2𝑥1cases2𝑥1for𝑥101for𝑥01f_{1,2}(x)=-{\mathtt{ReLU}}(-2x)+1=\begin{cases}2x+1&\textnormal{for}\ x\in[-1% ,0),\\ 1&\textnormal{for}\ x\in[0,1],\end{cases}italic_f start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT ( italic_x ) = - typewriter_ReLU ( - 2 italic_x ) + 1 = { start_ROW start_CELL 2 italic_x + 1 end_CELL start_CELL for italic_x ∈ [ - 1 , 0 ) , end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL for italic_x ∈ [ 0 , 1 ] , end_CELL end_ROW

and

f2:(u,v)[1,1]2cos(nπ(u+1))+cos(nπ(v1)).:subscript𝑓2𝑢𝑣superscript112maps-to𝑛𝜋𝑢1𝑛𝜋𝑣1f_{2}:(u,v)\in[-1,1]^{2}\mapsto\cos\big{(}n\pi(u+1)\big{)}+\cos\big{(}n\pi(v-1% )\big{)}\in{\mathbb{R}}.italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : ( italic_u , italic_v ) ∈ [ - 1 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ↦ roman_cos ( italic_n italic_π ( italic_u + 1 ) ) + roman_cos ( italic_n italic_π ( italic_v - 1 ) ) ∈ blackboard_R .

Then for any x[1,1]𝑥11x\in[-1,1]italic_x ∈ [ - 1 , 1 ] we have

f(x)=cos(nπ𝚁𝚎𝙻𝚄(2x))+cos(nπ𝚁𝚎𝙻𝚄(2x))=cos(nπ(f1,1(x)+1))+cos(nπ(f1,2(x)1))=f2𝒇1(x)𝑓𝑥𝑛𝜋𝚁𝚎𝙻𝚄2𝑥𝑛𝜋𝚁𝚎𝙻𝚄2𝑥𝑛𝜋subscript𝑓11𝑥1𝑛𝜋subscript𝑓12𝑥1subscript𝑓2subscript𝒇1𝑥\begin{split}f(x)&=\cos\Big{(}n\pi\cdot{\mathtt{ReLU}}(2x)\Big{)}+\cos\Big{(}-% n\pi\cdot{\mathtt{ReLU}}(-2x)\Big{)}\\ &=\cos\Big{(}n\pi\big{(}f_{1,1}(x)+1\big{)}\Big{)}+\cos\Big{(}n\pi\big{(}f_{1,% 2}(x)-1\big{)}\Big{)}=f_{2}\circ{\bm{f}}_{1}(x)\end{split}start_ROW start_CELL italic_f ( italic_x ) end_CELL start_CELL = roman_cos ( italic_n italic_π ⋅ typewriter_ReLU ( 2 italic_x ) ) + roman_cos ( - italic_n italic_π ⋅ typewriter_ReLU ( - 2 italic_x ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = roman_cos ( italic_n italic_π ( italic_f start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ( italic_x ) + 1 ) ) + roman_cos ( italic_n italic_π ( italic_f start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT ( italic_x ) - 1 ) ) = italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ bold_italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) end_CELL end_ROW

Through this decomposition and piecewise linear transformation, which can be approximated easily by a single layer of 𝚁𝚎𝙻𝚄𝚁𝚎𝙻𝚄\mathtt{ReLU}typewriter_ReLU network, one only needs to approximate a function that is smoother than the original f𝑓fitalic_f: 𝒇1subscript𝒇1\bm{f}_{1}bold_italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is simplified, while f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is reduced to half of the frequency of the original target function f𝑓fitalic_f.

We observe that this decomposition approach is universally applicable for any function f:[1,1]:𝑓maps-to11f:[-1,1]\mapsto\mathbb{R}italic_f : [ - 1 , 1 ] ↦ blackboard_R. Specifically, the decomposition is defined as

𝒇1:[1,1][1,1]2,𝒇1=[f1,1f1,2],:subscript𝒇1formulae-sequencemaps-to11superscript112subscript𝒇1matrixsubscript𝑓11subscript𝑓12{\bm{f}}_{1}:[-1,1]\mapsto[-1,1]^{2},\qquad{\bm{f}}_{1}=\begin{bmatrix}f_{1,1}% \\ f_{1,2}\end{bmatrix},bold_italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : [ - 1 , 1 ] ↦ [ - 1 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_f start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ,

where the component functions f1,1subscript𝑓11f_{1,1}italic_f start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT and f1,2subscript𝑓12f_{1,2}italic_f start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT are defined by

f1,1(x)=𝚁𝚎𝙻𝚄(2x)1={1forx[1,0),2x1forx[0,1],subscript𝑓11𝑥𝚁𝚎𝙻𝚄2𝑥1cases1for𝑥102𝑥1for𝑥01f_{1,1}(x)={\mathtt{ReLU}}(2x)-1=\begin{cases}-1&\textnormal{for}\ x\in[-1,0),% \\ 2x-1&\textnormal{for}\ x\in[0,1],\end{cases}italic_f start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ( italic_x ) = typewriter_ReLU ( 2 italic_x ) - 1 = { start_ROW start_CELL - 1 end_CELL start_CELL for italic_x ∈ [ - 1 , 0 ) , end_CELL end_ROW start_ROW start_CELL 2 italic_x - 1 end_CELL start_CELL for italic_x ∈ [ 0 , 1 ] , end_CELL end_ROW

and

f1,2(x)=𝚁𝚎𝙻𝚄(2x)+1={2x+1forx[1,0),1forx[0,1].subscript𝑓12𝑥𝚁𝚎𝙻𝚄2𝑥1cases2𝑥1for𝑥101for𝑥01f_{1,2}(x)=-{\mathtt{ReLU}}(-2x)+1=\begin{cases}2x+1&\textnormal{for}\ x\in[-1% ,0),\\ 1&\textnormal{for}\ x\in[0,1].\end{cases}italic_f start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT ( italic_x ) = - typewriter_ReLU ( - 2 italic_x ) + 1 = { start_ROW start_CELL 2 italic_x + 1 end_CELL start_CELL for italic_x ∈ [ - 1 , 0 ) , end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL for italic_x ∈ [ 0 , 1 ] . end_CELL end_ROW

Moreover,

f2:(u,v)[1,1]2f(u+12)+f(v12)f(0).:subscript𝑓2𝑢𝑣superscript112maps-to𝑓𝑢12𝑓𝑣12𝑓0f_{2}:(u,v)\in[-1,1]^{2}\mapsto f\big{(}\tfrac{u+1}{2}\big{)}+f\big{(}\tfrac{v% -1}{2}\big{)}-f(0)\in{\mathbb{R}}.italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : ( italic_u , italic_v ) ∈ [ - 1 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ↦ italic_f ( divide start_ARG italic_u + 1 end_ARG start_ARG 2 end_ARG ) + italic_f ( divide start_ARG italic_v - 1 end_ARG start_ARG 2 end_ARG ) - italic_f ( 0 ) ∈ blackboard_R .

Hence, for any x[1,1]𝑥11x\in[-1,1]italic_x ∈ [ - 1 , 1 ], we achieve the following reconstruction of f(x)𝑓𝑥f(x)italic_f ( italic_x ):

f(x)=f(𝚁𝚎𝙻𝚄(2x)2)+f(𝚁𝚎𝙻𝚄(2x)2)f(0)=f(f1,1(x)+12)+f(f1,2(x)12)f(0)=f2𝒇1(x)𝑓𝑥𝑓𝚁𝚎𝙻𝚄2𝑥2𝑓𝚁𝚎𝙻𝚄2𝑥2𝑓0𝑓subscript𝑓11𝑥12𝑓subscript𝑓12𝑥12𝑓0subscript𝑓2subscript𝒇1𝑥\begin{split}f(x)&=f\Big{(}\tfrac{{\mathtt{ReLU}}(2x)}{2}\Big{)}+f\Big{(}% \tfrac{-{\mathtt{ReLU}}(-2x)}{2}\Big{)}-f(0)\\ &=f\Big{(}\tfrac{f_{1,1}(x)+1}{2}\Big{)}+f\Big{(}\tfrac{f_{1,2}(x)-1}{2}\Big{)% }-f(0)=f_{2}\circ{\bm{f}}_{1}(x)\end{split}start_ROW start_CELL italic_f ( italic_x ) end_CELL start_CELL = italic_f ( divide start_ARG typewriter_ReLU ( 2 italic_x ) end_ARG start_ARG 2 end_ARG ) + italic_f ( divide start_ARG - typewriter_ReLU ( - 2 italic_x ) end_ARG start_ARG 2 end_ARG ) - italic_f ( 0 ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_f ( divide start_ARG italic_f start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ( italic_x ) + 1 end_ARG start_ARG 2 end_ARG ) + italic_f ( divide start_ARG italic_f start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT ( italic_x ) - 1 end_ARG start_ARG 2 end_ARG ) - italic_f ( 0 ) = italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ bold_italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) end_CELL end_ROW

demonstrating a structured decomposition that allows the function to be expressed through the composition of a smoother function with a piecewise (component-wise) transformation and rescaling.

3.1.2 General multi-component decomposition

Now we propose a general multi-component adaptive decomposition, a “divide and conquer” strategy, that can distribute the complexity of a target function evenly to multiple components.

Given a sequence x0<x1<<xnsubscript𝑥0subscript𝑥1subscript𝑥𝑛x_{0}<x_{1}<\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp% \mkern-0.1mu}<x_{n}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < start_ATOM ⋅ ⋅ ⋅ end_ATOM < italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT where the target function is defined on the interval [x0,xn]subscript𝑥0subscript𝑥𝑛[x_{0},x_{n}][ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], we will demonstrate how our new architecture allows us to partition the complexities of the function f𝑓fitalic_f into smaller intervals [xi1,xi]subscript𝑥𝑖1subscript𝑥𝑖[x_{i-1},x_{i}][ italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ]. By rescaling each subinterval, one only needs to deal with a much smoother function in each interval. This approach enables us to effectively approximate the target function over the entire interval [x0,xn]subscript𝑥0subscript𝑥𝑛[x_{0},x_{n}][ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ].

Let i:[ai,bi][xi1,xi]:subscript𝑖subscript𝑎𝑖subscript𝑏𝑖subscript𝑥𝑖1subscript𝑥𝑖{\mathcal{L}}_{i}:[a_{i},b_{i}]\to[x_{i-1},x_{i}]caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : [ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] → [ italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] be the linear map with

i(ai)=xi1andi(bi)=xi.formulae-sequencesubscript𝑖subscript𝑎𝑖subscript𝑥𝑖1andsubscript𝑖subscript𝑏𝑖subscript𝑥𝑖{\mathcal{L}}_{i}(a_{i})=x_{i-1}\quad\textnormal{and}\quad{\mathcal{L}}_{i}(b_% {i})=x_{i}.caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT and caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (2)

Define

fi=fi:[ai,bi].:subscript𝑓𝑖𝑓subscript𝑖subscript𝑎𝑖subscript𝑏𝑖f_{i}=f\circ{\mathcal{L}}_{i}:[a_{i},b_{i}]\to{\mathbb{R}}.italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ∘ caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : [ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] → blackboard_R . (3)

To decompose the target function into smoother pieces, we define a piecewise linear transformation ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using a linear combination of two ReLU functions (or a simple single layer network),

ψi(x)=si𝚁𝚎𝙻𝚄(xxi1)si𝚁𝚎𝙻𝚄(xxi)+ai.subscript𝜓𝑖𝑥subscript𝑠𝑖𝚁𝚎𝙻𝚄𝑥subscript𝑥𝑖1subscript𝑠𝑖𝚁𝚎𝙻𝚄𝑥subscript𝑥𝑖subscript𝑎𝑖\psi_{i}(x)=s_{i}\cdot{\mathtt{ReLU}}\left({x-x_{i-1}}\right)-s_{i}\cdot{% \mathtt{ReLU}}\left({x-x_{i}}\right)+a_{i}.italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ typewriter_ReLU ( italic_x - italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ typewriter_ReLU ( italic_x - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (4)

Here si=biaixixi1subscript𝑠𝑖subscript𝑏𝑖subscript𝑎𝑖subscript𝑥𝑖subscript𝑥𝑖1s_{i}=\frac{b_{i}-a_{i}}{x_{i}-x_{i-1}}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_ARG is the “slope” of i1superscriptsubscript𝑖1{\mathcal{L}}_{i}^{-1}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, which is a local rescaling. For example, fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT becomes a smoother function than f𝑓fitalic_f after stretching [xi1,xi]subscript𝑥𝑖1subscript𝑥𝑖[x_{i-1},x_{i}][ italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] to a larger domain [ai,bi]subscript𝑎𝑖subscript𝑏𝑖[a_{i},b_{i}][ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ]. See an illustration of ψi(x)subscript𝜓𝑖𝑥\psi_{i}(x)italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) in Figure 6.

Refer to caption
Figure 6: An illustration of ψi(x)subscript𝜓𝑖𝑥\psi_{i}(x)italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ).
Refer to caption
(a) Decompostion of target function f=c+i=1nfiψi𝑓𝑐superscriptsubscript𝑖1𝑛subscript𝑓𝑖subscript𝜓𝑖f=c+\sum_{i=1}^{n}f_{i}\circ\psi_{i}italic_f = italic_c + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: oscillatory f𝑓fitalic_f to smooth fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s.
Refer to caption
(b) Neural network architecture of h=c+i=1nhiψi𝑐superscriptsubscript𝑖1𝑛subscript𝑖subscript𝜓𝑖h=c+\sum_{i=1}^{n}h_{i}\circ\psi_{i}italic_h = italic_c + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by using hifisubscript𝑖subscript𝑓𝑖h_{i}\approx f_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≈ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.
Figure 7: Visual representations of the decompositions of f𝑓fitalic_f and hhitalic_h are provided with c=i=0n1f(xi)𝑐superscriptsubscript𝑖0𝑛1𝑓subscript𝑥𝑖c=\sum_{i=0}^{n-1}f(x_{i})italic_c = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) being a constant and sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT being the slope. Here, the function f𝑓fitalic_f is dissected into several simpler functions, labeled as fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Each fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents a simplified and more manageable segment of f𝑓fitalic_f, allowing for the straightforward application of subnetwork hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to closely approximate fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, even with the use of shallow networks.
Theorem 3.1.

Given x0<x1<<xnsubscript𝑥0subscript𝑥1subscript𝑥𝑛x_{0}<x_{1}<\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp% \mkern-0.1mu}<x_{n}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < start_ATOM ⋅ ⋅ ⋅ end_ATOM < italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, suppose isubscript𝑖{\mathcal{L}}_{i}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are given in Equations (2) and (4), respectively. Then the target function f:[x0,xn]:𝑓subscript𝑥0subscript𝑥𝑛f:[x_{0},x_{n}]\to{\mathbb{R}}italic_f : [ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] → blackboard_R has the following (smoother) decomposition (fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) with a piecewise linear transformation (ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT),

f(x)=i=1nfiψi(x)i=1n1f(xi)constantfor any x[x0,xn],𝑓𝑥superscriptsubscript𝑖1𝑛subscript𝑓𝑖subscript𝜓𝑖𝑥subscriptsuperscriptsubscript𝑖1𝑛1𝑓subscript𝑥𝑖constantfor any x[x0,xn],\begin{split}f(x)=\sum_{i=1}^{n}f_{i}\circ\psi_{i}(x)-\underbrace{\sum_{i=1}^{% n-1}f(x_{i})}_{\textnormal{constant}}\quad\textnormal{for any $x\in[x_{0},x_{n% }]$,}\end{split}start_ROW start_CELL italic_f ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT constant end_POSTSUBSCRIPT for any italic_x ∈ [ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] , end_CELL end_ROW

where fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is given in Equation (3).

Proof of Theorem 3.1.

By definition of ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in Equation (4), it is easy to check

ψi(x)={biif x>xi,i1(x)if x[xi1,xi],aiif x<xi1,ψi(x)={biif ij1,j1(x)if i=j,aiif ij+1,formulae-sequencesubscript𝜓𝑖𝑥casessubscript𝑏𝑖if 𝑥subscript𝑥𝑖superscriptsubscript𝑖1𝑥if 𝑥subscript𝑥𝑖1subscript𝑥𝑖subscript𝑎𝑖if 𝑥subscript𝑥𝑖1subscript𝜓𝑖𝑥casessubscript𝑏𝑖if 𝑖𝑗1superscriptsubscript𝑗1𝑥if 𝑖𝑗subscript𝑎𝑖if 𝑖𝑗1\psi_{i}(x)=\begin{cases}b_{i}&\textnormal{if }x>x_{i},\\ {\mathcal{L}}_{i}^{-1}(x)&\textnormal{if }x\in[x_{i-1},x_{i}],\\ a_{i}&\textnormal{if }x<x_{i-1},\end{cases}\quad\Longrightarrow\quad\psi_{i}(x% )=\begin{cases}b_{i}&\textnormal{if }i\leq j-1,\\ {\mathcal{L}}_{j}^{-1}(x)&\textnormal{if }i=j,\\ a_{i}&\textnormal{if }i\geq j+1,\end{cases}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = { start_ROW start_CELL italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL if italic_x > italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) end_CELL start_CELL if italic_x ∈ [ italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] , end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL if italic_x < italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , end_CELL end_ROW ⟹ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = { start_ROW start_CELL italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL if italic_i ≤ italic_j - 1 , end_CELL end_ROW start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) end_CELL start_CELL if italic_i = italic_j , end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL if italic_i ≥ italic_j + 1 , end_CELL end_ROW

for a fixed j{1,2,,n}𝑗12𝑛j\in\{1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp% \mkern-0.1mu},n\}italic_j ∈ { 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , italic_n } and any x[xj1,xj]𝑥subscript𝑥𝑗1subscript𝑥𝑗x\in[x_{j-1},x_{j}]italic_x ∈ [ italic_x start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ]. It follows that

i=1nfiψi(x)=i=1nfiψi(x)=i=1j1fiψi(x)+fjψj(x)+i=j+1nfiψi(x)=i=1j1fi(bi)+fjj1(x)+i=j+1nfi(ai)=i=1j1f(xi)+f(x)+i=j+1nf(xi1)=f(x)+i=1n1f(xi)constant.superscriptsubscript𝑖1𝑛subscript𝑓𝑖subscript𝜓𝑖𝑥superscriptsubscript𝑖1𝑛𝑓subscript𝑖subscript𝜓𝑖𝑥superscriptsubscript𝑖1𝑗1𝑓subscript𝑖subscript𝜓𝑖𝑥𝑓subscript𝑗subscript𝜓𝑗𝑥superscriptsubscript𝑖𝑗1𝑛𝑓subscript𝑖subscript𝜓𝑖𝑥superscriptsubscript𝑖1𝑗1𝑓subscript𝑖subscript𝑏𝑖𝑓subscript𝑗superscriptsubscript𝑗1𝑥superscriptsubscript𝑖𝑗1𝑛𝑓subscript𝑖subscript𝑎𝑖superscriptsubscript𝑖1𝑗1𝑓subscript𝑥𝑖𝑓𝑥superscriptsubscript𝑖𝑗1𝑛𝑓subscript𝑥𝑖1𝑓𝑥subscriptsuperscriptsubscript𝑖1𝑛1𝑓subscript𝑥𝑖constant\begin{split}\sum_{i=1}^{n}f_{i}\circ\psi_{i}(x)&=\sum_{i=1}^{n}f\circ{% \mathcal{L}}_{i}\circ\psi_{i}(x)=\sum_{i=1}^{j-1}f\circ{\mathcal{L}}_{i}\circ% \psi_{i}(x)+f\circ{\mathcal{L}}_{j}\circ\psi_{j}(x)+\sum_{i=j+1}^{n}f\circ{% \mathcal{L}}_{i}\circ\psi_{i}(x)\\ &=\sum_{i=1}^{j-1}f\circ{\mathcal{L}}_{i}(b_{i})+f\circ{\mathcal{L}}_{j}\circ{% \mathcal{L}}_{j}^{-1}(x)+\sum_{i=j+1}^{n}f\circ{\mathcal{L}}_{i}(a_{i})\\ &=\sum_{i=1}^{j-1}f(x_{i})+f(x)+\sum_{i=j+1}^{n}f(x_{i-1})=f(x)+\underbrace{% \sum_{i=1}^{n-1}f(x_{i})}_{\textnormal{constant}}.\end{split}start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ∘ caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT italic_f ∘ caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) + italic_f ∘ caligraphic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) + ∑ start_POSTSUBSCRIPT italic_i = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ∘ caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT italic_f ∘ caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_f ∘ caligraphic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∘ caligraphic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) + ∑ start_POSTSUBSCRIPT italic_i = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ∘ caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_f ( italic_x ) + ∑ start_POSTSUBSCRIPT italic_i = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) = italic_f ( italic_x ) + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT constant end_POSTSUBSCRIPT . end_CELL end_ROW

It follows that

f(x)=i=1nfiψi(x)i=1n1f(xi)constantfor any x[xj1,xj].𝑓𝑥superscriptsubscript𝑖1𝑛subscript𝑓𝑖subscript𝜓𝑖𝑥subscriptsuperscriptsubscript𝑖1𝑛1𝑓subscript𝑥𝑖constantfor any x[xj1,xj].\begin{split}f(x)=\sum_{i=1}^{n}f_{i}\circ\psi_{i}(x)-\underbrace{\sum_{i=1}^{% n-1}f(x_{i})}_{\textnormal{constant}}\quad\textnormal{for any $x\in[x_{j-1},x_% {j}]$.}\end{split}start_ROW start_CELL italic_f ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT constant end_POSTSUBSCRIPT for any italic_x ∈ [ italic_x start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] . end_CELL end_ROW

Since j𝑗jitalic_j is arbitrary, the above equation holds for all x=j=1n[xj1,xj]=[x0,xn]𝑥superscriptsubscript𝑗1𝑛subscript𝑥𝑗1subscript𝑥𝑗subscript𝑥0subscript𝑥𝑛x=\cup_{j=1}^{n}[x_{j-1},x_{j}]=[x_{0},x_{n}]italic_x = ∪ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT [ italic_x start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] = [ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ]. ∎

For each smoother fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, one can use a shallow network component ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT- a linear combination of random basis functions to approximate fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT well on [ai,bi]subscript𝑎𝑖subscript𝑏𝑖[a_{i},b_{i}][ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ]. Then

f(x)=i=1nfiψi(x)i=1n1f(xi)constanti=1nϕiψi(x)i=1n1f(xi)constanth(x),𝑓𝑥superscriptsubscript𝑖1𝑛subscript𝑓𝑖subscript𝜓𝑖𝑥subscriptsuperscriptsubscript𝑖1𝑛1𝑓subscript𝑥𝑖constantsuperscriptsubscript𝑖1𝑛subscriptitalic-ϕ𝑖subscript𝜓𝑖𝑥subscriptsuperscriptsubscript𝑖1𝑛1𝑓subscript𝑥𝑖constant𝑥f(x)=\sum_{i=1}^{n}f_{i}\circ\psi_{i}(x)-\underbrace{\sum_{i=1}^{n-1}f(x_{i})}% _{\textnormal{constant}}\approx\sum_{i=1}^{n}\phi_{i}\circ\psi_{i}(x)-% \underbrace{\sum_{i=1}^{n-1}f(x_{i})}_{\textnormal{constant}}\eqqcolon h(x),italic_f ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT constant end_POSTSUBSCRIPT ≈ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT constant end_POSTSUBSCRIPT ≕ italic_h ( italic_x ) ,

h(x)𝑥h(x)italic_h ( italic_x ) is a one-hidden-layer neural network approximation of the target function f(x)𝑓𝑥f(x)italic_f ( italic_x ) that can approximate a complex function better than a single layer. See Figure 7 for an illustration. In practice, one can choose repeated decomposition using a multi-component and multi-layer network structure which is the motivation for MMNN. It is well-known that neural networks can approximate smooth functions well. For localized rapid change/oscillation, our construction shows that a small network in terms of the width as well as the number of components and layers can achieve adaptive decomposition and deal with it rather easily. Hence MMNN is effective in approximating a function with localized fine features. This is an important advantage in dealing with low-dimensional structures embedded in high dimensions. The most difficult situation is approximating global highly oscillatory functions, especially with diverse frequency modes, for which wider networks with more components and layers are needed to deal with both the complexity and curse of dimensions.

3.1.3 Examples

Here we use two examples to demonstrate the complexity decomposition strategy presented in the previous section. We start with the Runge function f(x)=125x2+1𝑓𝑥125superscript𝑥21f(x)=\frac{1}{25x^{2}+1}italic_f ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 25 italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 end_ARG and modify it to f(x)=11000x2+1𝑓𝑥11000superscript𝑥21f(x)=\frac{1}{1000x^{2}+1}italic_f ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 1000 italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 end_ARG, which has a localized rapid change near 00. As an example, we use four components n=4𝑛4n=4italic_n = 4, choose points x0,x1,x2,x3,x4subscript𝑥0subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4x_{0},x_{1},x_{2},x_{3},x_{4}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT at 1,0.2,0,0.2,110.200.21-1,-0.2,0,0.2,1- 1 , - 0.2 , 0 , 0.2 , 1, and let ai=1subscript𝑎𝑖1a_{i}=-1italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - 1 and bi=1subscript𝑏𝑖1b_{i}=1italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 for all i𝑖iitalic_i. In practice, each component is approximated by a single-layer network - a linear combination of basis functions, and trained by an optimization method, e.g., Adam. Our examples here are just a proof of concept for the decomposition of a target function into smoother components using MMNN structure in the form

f(x)=i=14fiψi(x)i=13f(xi)constant,𝑓𝑥superscriptsubscript𝑖14subscript𝑓𝑖subscript𝜓𝑖𝑥subscriptsuperscriptsubscript𝑖13𝑓subscript𝑥𝑖constantf(x)=\sum_{i=1}^{4}f_{i}\circ\psi_{i}(x)-\underbrace{\sum_{i=1}^{3}f(x_{i})}_{% \text{constant}},italic_f ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT constant end_POSTSUBSCRIPT ,

where fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (piecewise tranformation/rescaling) are defined as in (3) and (4), respectively. These components are illustrated in Figure 8. Each component is relatively smooth, making it easier for approximation and learning through shallow networks. This approach essentially utilizes a divide-and-conquer principle.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Illustrations of f(x)=11000x2+1𝑓𝑥11000superscript𝑥21f(x)=\frac{1}{1000x^{2}+1}italic_f ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 1000 italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 end_ARG and its multi-component decomposition through fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT i=1,2,3,4𝑖1234i=1,2,3,4italic_i = 1 , 2 , 3 , 4, where f(x)=i=14fiψi(x)i=13f(xi)𝑓𝑥superscriptsubscript𝑖14subscript𝑓𝑖subscript𝜓𝑖𝑥superscriptsubscript𝑖13𝑓subscript𝑥𝑖f(x)=\sum_{i=1}^{4}f_{i}\circ\psi_{i}(x)-{\sum_{i=1}^{3}f(x_{i})}italic_f ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

The second example is a globally oscillatory function of the form

f(x)=cos2(6πx)+sin(10πx2).𝑓𝑥superscript26𝜋𝑥10𝜋superscript𝑥2f(x)=\cos^{2}(6\pi x)+\sin(10\pi x^{2}).italic_f ( italic_x ) = roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 6 italic_π italic_x ) + roman_sin ( 10 italic_π italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Again we illustrate using four components n=4𝑛4n=4italic_n = 4, selecting points x0,x1,x2,x3,x4subscript𝑥0subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4x_{0},x_{1},x_{2},x_{3},x_{4}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT at 1,0.7,0,0.7,110.700.71-1,-0.7,0,0.7,1- 1 , - 0.7 , 0 , 0.7 , 1, and setting ai=1subscript𝑎𝑖1a_{i}=-1italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - 1 and bi=1subscript𝑏𝑖1b_{i}=1italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 for all i𝑖iitalic_i. As shown in Figure 9, the target function f(x)𝑓𝑥f(x)italic_f ( italic_x ) is decomposed into components that are less oscillatory again facilitating their approximation and learning through shallow networks.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Illustrations of f(x)=cos2(6πx)+sin(10πx2)𝑓𝑥superscript26𝜋𝑥10𝜋superscript𝑥2f(x)=\cos^{2}(6\pi x)+\sin(10\pi x^{2})italic_f ( italic_x ) = roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 6 italic_π italic_x ) + roman_sin ( 10 italic_π italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and its decomposition components fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ψisubscript𝜓𝑖\psi_{i}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that f(x)=i=14fiψi(x)i=13f(xi)𝑓𝑥superscriptsubscript𝑖14subscript𝑓𝑖subscript𝜓𝑖𝑥superscriptsubscript𝑖13𝑓subscript𝑥𝑖f(x)=\sum_{i=1}^{4}f_{i}\circ\psi_{i}(x)-{\sum_{i=1}^{3}f(x_{i})}italic_f ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

3.2 High dimensional cases

Let us now consider the extension to multi-dimension using two dimensions as an example since the simple dimension-by-dimension strategy applies to any dimension.

Given x0<x1<<xnsubscript𝑥0subscript𝑥1subscript𝑥𝑛x_{0}<x_{1}<\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp% \mkern-0.1mu}<x_{n}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < start_ATOM ⋅ ⋅ ⋅ end_ATOM < italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and y0<y1<<ymsubscript𝑦0subscript𝑦1subscript𝑦𝑚y_{0}<y_{1}<\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp% \mkern-0.1mu}<y_{m}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < start_ATOM ⋅ ⋅ ⋅ end_ATOM < italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, dividing the domain of the function f(x,y)𝑓𝑥𝑦f(x,y)italic_f ( italic_x , italic_y ) into small Cartesian rectangles [xi1,xi]×[yj1,yj]subscript𝑥𝑖1subscript𝑥𝑖subscript𝑦𝑗1subscript𝑦𝑗[x_{i-1},x_{i}]\times[y_{j-1},y_{j}][ italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] × [ italic_y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ]. Let 1,i:[ai,bi][xi1,xi]:subscript1𝑖subscript𝑎𝑖subscript𝑏𝑖subscript𝑥𝑖1subscript𝑥𝑖{\mathcal{L}}_{1,i}:[a_{i},b_{i}]\to[x_{i-1},x_{i}]caligraphic_L start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT : [ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] → [ italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] and 2,j:[ci,di][yj1,yj]:subscript2𝑗subscript𝑐𝑖subscript𝑑𝑖subscript𝑦𝑗1subscript𝑦𝑗{\mathcal{L}}_{2,j}:[c_{i},d_{i}]\to[y_{j-1},y_{j}]caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT : [ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] → [ italic_y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] be the linear maps with

{1,i(ai)=xi1,1,i(bi)=xiand{2,j(ci)=yj1,2,j(di)=yj.casessubscript1𝑖subscript𝑎𝑖subscript𝑥𝑖1otherwisesubscript1𝑖subscript𝑏𝑖subscript𝑥𝑖otherwiseandcasessubscript2𝑗subscript𝑐𝑖subscript𝑦𝑗1otherwisesubscript2𝑗subscript𝑑𝑖subscript𝑦𝑗otherwise\begin{cases}{\mathcal{L}}_{1,i}(a_{i})=x_{i-1},\\ {\mathcal{L}}_{1,i}(b_{i})=x_{i}\end{cases}\quad\textnormal{and}\quad\,\ % \begin{cases}{\mathcal{L}}_{2,j}(c_{i})=y_{j-1},\\ {\mathcal{L}}_{2,j}(d_{i})=y_{j}.\end{cases}{ start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW and { start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT . end_CELL start_CELL end_CELL end_ROW (5)

For i=1,2,,n𝑖12𝑛i=1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-% 0.1mu},nitalic_i = 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , italic_n and j=1,2,,m𝑗12𝑚j=1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-% 0.1mu},mitalic_j = 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , italic_m, we define

{fi,0(x,y)f(1,i(x),y),f0,j(x,y)f(x,2,j(y)),fi,j(x,y)f(1,i(x),2,j(y))=f0,j(1,i(x),y)=fi,0(x,2,j(y)).casessubscript𝑓𝑖0𝑥𝑦𝑓subscript1𝑖𝑥𝑦subscript𝑓0𝑗𝑥𝑦𝑓𝑥subscript2𝑗𝑦subscript𝑓𝑖𝑗𝑥𝑦𝑓subscript1𝑖𝑥subscript2𝑗𝑦subscript𝑓0𝑗subscript1𝑖𝑥𝑦subscript𝑓𝑖0𝑥subscript2𝑗𝑦\left\{\begin{array}[]{l}f_{i,0}(x,y)\coloneqq f\Big{(}{\mathcal{L}}_{1,i}(x),% \,y\Big{)},\\ f_{0,j}(x,y)\coloneqq f\Big{(}x,\,{\mathcal{L}}_{2,j}(y)\Big{)},\\ f_{i,j}(x,y)\coloneqq f\Big{(}{\mathcal{L}}_{1,i}(x),\,{\mathcal{L}}_{2,j}(y)% \Big{)}=f_{0,j}\Big{(}{\mathcal{L}}_{1,i}(x),\,y\Big{)}=f_{i,0}\Big{(}x,\,{% \mathcal{L}}_{2,j}(y)\Big{)}.\end{array}\right.{ start_ARRAY start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_x , italic_y ) ≔ italic_f ( caligraphic_L start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y ) , end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ( italic_x , italic_y ) ≔ italic_f ( italic_x , caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ( italic_y ) ) , end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_x , italic_y ) ≔ italic_f ( caligraphic_L start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ( italic_x ) , caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ( italic_y ) ) = italic_f start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y ) = italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_x , caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ( italic_y ) ) . end_CELL end_ROW end_ARRAY (6)

It is evident that with appropriate transformation and rescaling, fi,0(x,y)subscript𝑓𝑖0𝑥𝑦f_{i,0}(x,y)italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_x , italic_y ) is smooth in x𝑥xitalic_x when y𝑦yitalic_y is held constant, f0,j(x,y)subscript𝑓0𝑗𝑥𝑦f_{0,j}(x,y)italic_f start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ( italic_x , italic_y ) is smooth in y𝑦yitalic_y when x𝑥xitalic_x is fixed, and fi,j(x,y)subscript𝑓𝑖𝑗𝑥𝑦f_{i,j}(x,y)italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_x , italic_y ) is smooth in both x𝑥xitalic_x and y𝑦yitalic_y. Define

ψi(x)={biif x>xi,1,i1(x)if x[xi1,xi],aiif x<xi1andϕj(y)={djif y>yj,2,j1(y)if y[yj1,yj],cjif y<yj1.formulae-sequencesubscript𝜓𝑖𝑥casessubscript𝑏𝑖if 𝑥subscript𝑥𝑖superscriptsubscript1𝑖1𝑥if 𝑥subscript𝑥𝑖1subscript𝑥𝑖subscript𝑎𝑖if 𝑥subscript𝑥𝑖1andsubscriptitalic-ϕ𝑗𝑦casessubscript𝑑𝑗if 𝑦subscript𝑦𝑗superscriptsubscript2𝑗1𝑦if 𝑦subscript𝑦𝑗1subscript𝑦𝑗subscript𝑐𝑗if 𝑦subscript𝑦𝑗1\psi_{i}(x)=\begin{cases}b_{i}&\textnormal{if }x>x_{i},\\ {\mathcal{L}}_{1,i}^{-1}(x)&\textnormal{if }x\in[x_{i-1},x_{i}],\\ a_{i}&\textnormal{if }x<x_{i-1}\end{cases}\quad\textnormal{and}\quad\phi_{j}(y% )=\begin{cases}d_{j}&\textnormal{if }y>y_{j},\\ {\mathcal{L}}_{2,j}^{-1}(y)&\textnormal{if }y\in[y_{j-1},y_{j}],\\ c_{j}&\textnormal{if }y<y_{j-1}.\end{cases}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = { start_ROW start_CELL italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL if italic_x > italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) end_CELL start_CELL if italic_x ∈ [ italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] , end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL if italic_x < italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_CELL end_ROW and italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) = { start_ROW start_CELL italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL start_CELL if italic_y > italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_y ) end_CELL start_CELL if italic_y ∈ [ italic_y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] , end_CELL end_ROW start_ROW start_CELL italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL start_CELL if italic_y < italic_y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT . end_CELL end_ROW (7)

The following result provides a decomposition of f𝑓fitalic_f that fits into the structure of MMNN.

Theorem 3.2.

Given x0<x1<<xnsubscript𝑥0subscript𝑥1subscript𝑥𝑛x_{0}<x_{1}<\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp% \mkern-0.1mu}<x_{n}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < start_ATOM ⋅ ⋅ ⋅ end_ATOM < italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and y0<y1<<ymsubscript𝑦0subscript𝑦1subscript𝑦𝑚y_{0}<y_{1}<\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp% \mkern-0.1mu}<y_{m}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < start_ATOM ⋅ ⋅ ⋅ end_ATOM < italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, suppose 1,i,2,jsubscript1𝑖subscript2𝑗{\mathcal{L}}_{1,i},{\mathcal{L}}_{2,j}caligraphic_L start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT and ψi,ϕjsubscript𝜓𝑖subscriptitalic-ϕ𝑗\psi_{i},\phi_{j}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are given in Equations (5) and (7), respectively. Then the function f:[x0,xn]×[y0,ym]:𝑓subscript𝑥0subscript𝑥𝑛subscript𝑦0subscript𝑦𝑚f:[x_{0},x_{n}]\times[y_{0},y_{m}]\to{\mathbb{R}}italic_f : [ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] × [ italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] → blackboard_R can be expressed as

f(x,y)=i=1nj=1mfi,j(ψi(x),ϕj(y))i=1nj=1m1fi,0(ψi(x),yj)i=1n1j=1mf0,j(xi,ϕj(y))+i=1n1j=1m1f(xi,yj)𝑓𝑥𝑦superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑚subscript𝑓𝑖𝑗subscript𝜓𝑖𝑥subscriptitalic-ϕ𝑗𝑦superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑚1subscript𝑓𝑖0subscript𝜓𝑖𝑥subscript𝑦𝑗superscriptsubscript𝑖1𝑛1superscriptsubscript𝑗1𝑚subscript𝑓0𝑗subscript𝑥𝑖subscriptitalic-ϕ𝑗𝑦superscriptsubscript𝑖1𝑛1superscriptsubscript𝑗1𝑚1𝑓subscript𝑥𝑖subscript𝑦𝑗\begin{split}f(x,\,y)&=\sum_{i=1}^{n}\sum_{j=1}^{m}f_{i,j}\Big{(}\psi_{i}(x),% \,\phi_{j}(y)\Big{)}-\sum_{i=1}^{n}\sum_{j=1}^{m-1}f_{i,0}\Big{(}\psi_{i}(x),% \,y_{j}\Big{)}\\ &\quad\ -\sum_{i=1}^{n-1}\sum_{j=1}^{m}f_{0,j}\Big{(}x_{i},\,\phi_{j}(y)\Big{)% }+\sum_{i=1}^{n-1}\sum_{j=1}^{m-1}f(x_{i},\,y_{j})\end{split}start_ROW start_CELL italic_f ( italic_x , italic_y ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_CELL end_ROW (8)

for all (x,y)[x0,xn]×[y0,ym]𝑥𝑦subscript𝑥0subscript𝑥𝑛subscript𝑦0subscript𝑦𝑚(x,y)\in[x_{0},x_{n}]\times[y_{0},y_{m}]( italic_x , italic_y ) ∈ [ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] × [ italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ], where fi,jsubscript𝑓𝑖𝑗f_{i,j}italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT are given in Equation (6).

Proof of Theorem 3.2.

Fixing (k,j)𝑘𝑗(k,j)( italic_k , italic_j ), for any (x,y)[xk1,xk]×[y1,y]𝑥𝑦subscript𝑥𝑘1subscript𝑥𝑘subscript𝑦1subscript𝑦(x,y)\in[x_{k-1},x_{k}]\times[y_{\ell-1},y_{\ell}]( italic_x , italic_y ) ∈ [ italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] × [ italic_y start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ], we have

ψi(x)={biif ik1,1,k1(x)if i=k,aiif ik+1andϕj(y)={djif j1,2,1(y)if j=,cjif j+1.formulae-sequencesubscript𝜓𝑖𝑥casessubscript𝑏𝑖if 𝑖𝑘1superscriptsubscript1𝑘1𝑥if 𝑖𝑘subscript𝑎𝑖if 𝑖𝑘1andsubscriptitalic-ϕ𝑗𝑦casessubscript𝑑𝑗if 𝑗1superscriptsubscript21𝑦if 𝑗subscript𝑐𝑗if 𝑗1\psi_{i}(x)=\begin{cases}b_{i}&\textnormal{if }i\leq k-1,\\ {\mathcal{L}}_{1,k}^{-1}(x)&\textnormal{if }i=k,\\ a_{i}&\textnormal{if }i\geq k+1\end{cases}\quad\textnormal{and}\quad\phi_{j}(y% )=\begin{cases}d_{j}&\textnormal{if }j\leq\ell-1,\\ {\mathcal{L}}_{2,\ell}^{-1}(y)&\textnormal{if }j=\ell,\\ c_{j}&\textnormal{if }j\geq\ell+1.\end{cases}italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = { start_ROW start_CELL italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL if italic_i ≤ italic_k - 1 , end_CELL end_ROW start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) end_CELL start_CELL if italic_i = italic_k , end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL if italic_i ≥ italic_k + 1 end_CELL end_ROW and italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) = { start_ROW start_CELL italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL start_CELL if italic_j ≤ roman_ℓ - 1 , end_CELL end_ROW start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT 2 , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_y ) end_CELL start_CELL if italic_j = roman_ℓ , end_CELL end_ROW start_ROW start_CELL italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL start_CELL if italic_j ≥ roman_ℓ + 1 . end_CELL end_ROW

It follows that

i=1nfi,0(ψi(x),y)=i=1nf(1,iψi(x),y)=i=1k1f(1,iψi(x),y)+f(1,kψk(x),y)+i=k+1nf(1,iψi(x),y)=i=1k1f(1,i(bi),y)+f(1,k1,k1(x),y)+i=k+1nf(1,i(ai),y)=i=1k1f(xi,y)+f(x,y)+i=k+1nf(xi1,y)=f(x,y)+i=1n1f(xi,y),superscriptsubscript𝑖1𝑛subscript𝑓𝑖0subscript𝜓𝑖𝑥𝑦superscriptsubscript𝑖1𝑛𝑓subscript1𝑖subscript𝜓𝑖𝑥𝑦superscriptsubscript𝑖1𝑘1𝑓subscript1𝑖subscript𝜓𝑖𝑥𝑦𝑓subscript1𝑘subscript𝜓𝑘𝑥𝑦superscriptsubscript𝑖𝑘1𝑛𝑓subscript1𝑖subscript𝜓𝑖𝑥𝑦superscriptsubscript𝑖1𝑘1𝑓subscript1𝑖subscript𝑏𝑖𝑦𝑓subscript1𝑘superscriptsubscript1𝑘1𝑥𝑦superscriptsubscript𝑖𝑘1𝑛𝑓subscript1𝑖subscript𝑎𝑖𝑦superscriptsubscript𝑖1𝑘1𝑓subscript𝑥𝑖𝑦𝑓𝑥𝑦superscriptsubscript𝑖𝑘1𝑛𝑓subscript𝑥𝑖1𝑦𝑓𝑥𝑦superscriptsubscript𝑖1𝑛1𝑓subscript𝑥𝑖𝑦\begin{split}&\phantom{=\;}\sum_{i=1}^{n}f_{i,0}\Big{(}\psi_{i}(x),\,y\Big{)}=% \sum_{i=1}^{n}f\Big{(}{\mathcal{L}}_{1,i}\circ\psi_{i}(x),\,y\Big{)}\\ &=\sum_{i=1}^{k-1}f\Big{(}{\mathcal{L}}_{1,i}\circ\psi_{i}(x),\,y\Big{)}+f\Big% {(}{\mathcal{L}}_{1,k}\circ\psi_{k}(x),\,y\Big{)}+\sum_{i=k+1}^{n}f\Big{(}{% \mathcal{L}}_{1,i}\circ\psi_{i}(x),\,y\Big{)}\\ &=\sum_{i=1}^{k-1}f\Big{(}{\mathcal{L}}_{1,i}(b_{i}),\,y\Big{)}+f\Big{(}{% \mathcal{L}}_{1,k}\circ{\mathcal{L}}_{1,k}^{-1}(x),\,y\Big{)}+\sum_{i=k+1}^{n}% f\Big{(}{\mathcal{L}}_{1,i}(a_{i}),\,y\Big{)}\\ &=\sum_{i=1}^{k-1}f(x_{i},\,y)+f(x,\,y)+\sum_{i=k+1}^{n}f(x_{i-1},\,y)=f(x,\,y% )+\sum_{i=1}^{n-1}f(x_{i},\,y),\end{split}start_ROW start_CELL end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( caligraphic_L start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_f ( caligraphic_L start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y ) + italic_f ( caligraphic_L start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) , italic_y ) + ∑ start_POSTSUBSCRIPT italic_i = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( caligraphic_L start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_f ( caligraphic_L start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_y ) + italic_f ( caligraphic_L start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT ∘ caligraphic_L start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) , italic_y ) + ∑ start_POSTSUBSCRIPT italic_i = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( caligraphic_L start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_y ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) + italic_f ( italic_x , italic_y ) + ∑ start_POSTSUBSCRIPT italic_i = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_y ) = italic_f ( italic_x , italic_y ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) , end_CELL end_ROW

implying

f(x,y)=i=1nfi,0(ψi(x),y)i=1n1f(xi,y).𝑓𝑥𝑦superscriptsubscript𝑖1𝑛subscript𝑓𝑖0subscript𝜓𝑖𝑥𝑦superscriptsubscript𝑖1𝑛1𝑓subscript𝑥𝑖𝑦\begin{split}f(x,\,y)=\sum_{i=1}^{n}f_{i,0}\Big{(}\psi_{i}(x),\,y\Big{)}-\sum_% {i=1}^{n-1}f(x_{i},\,y).\end{split}start_ROW start_CELL italic_f ( italic_x , italic_y ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) . end_CELL end_ROW

For each i𝑖iitalic_i, using the 1D decomposition technique described in Section 3.1, we find the decompositions for fi,0(ψi(x),y)subscript𝑓𝑖0subscript𝜓𝑖𝑥𝑦f_{i,0}\big{(}\psi_{i}(x),\,y\big{)}italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y ) and f(xi,y)𝑓subscript𝑥𝑖𝑦f(x_{i},\,y)italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ). We have

j=1mfi,j(ψi(x),ϕj(y))=j=1mfi,0(ψi(x),2,jϕj(y))=j=11fi,0(ψi(x),2,jϕj(y))+fi,0(ψi(x),2,ϕ(y))+j=+1mfi,0(ψi(x),2,j(ϕj(y)))=j=11fi,0(ψi(x),2,j(dj))+fi,0(ψi(x),2,2,1(y))+j=+1mfi,0(ψi(x),2,j(cj))=j=11fi,0(ψi(x),yj)+fi,0(ψi(x),y)+j=+1mfi,0(ψi(x),yj1)=fi,0(ψi(x),y)+j=1m1fi,0(ψi(x),yj1),\begin{split}&\phantom{=\;}\sum_{j=1}^{m}f_{i,j}\Big{(}\psi_{i}(x),\,\phi_{j}(% y)\Big{)}=\sum_{j=1}^{m}f_{i,0}\Big{(}\psi_{i}(x),\,{\mathcal{L}}_{2,j}\circ% \phi_{j}(y)\Big{)}\\ &=\sum_{j=1}^{\ell-1}f_{i,0}\Big{(}\psi_{i}(x),\,{\mathcal{L}}_{2,j}\circ\phi_% {j}(y)\Big{)}+f_{i,0}\Big{(}\psi_{i}(x),\,{\mathcal{L}}_{2,\ell}\circ\phi_{% \ell}(y)\Big{)}+\sum_{j=\ell+1}^{m}f_{i,0}\Big{(}\psi_{i}(x),\,{\mathcal{L}}_{% 2,j}(\circ\phi_{j}(y))\Big{)}\\ &=\sum_{j=1}^{\ell-1}f_{i,0}\Big{(}\psi_{i}(x),\,{\mathcal{L}}_{2,j}(d_{j})% \Big{)}+f_{i,0}\Big{(}\psi_{i}(x),\,{\mathcal{L}}_{2,\ell}\circ{\mathcal{L}}_{% 2,\ell}^{-1}(y)\Big{)}+\sum_{j=\ell+1}^{m}f_{i,0}\Big{(}\psi_{i}(x),\,{% \mathcal{L}}_{2,j}(c_{j})\Big{)}\\ &=\sum_{j=1}^{\ell-1}f_{i,0}\Big{(}\psi_{i}(x),\,y_{j}\Big{)}+f_{i,0}\Big{(}% \psi_{i}(x),\,y\Big{)}+\sum_{j=\ell+1}^{m}f_{i,0}\Big{(}\psi_{i}(x),\,y_{j-1}% \Big{)}\\ &=f_{i,0}\Big{(}\psi_{i}(x),\,y\Big{)}+\sum_{j=1}^{m-1}f_{i,0}\Big{(}\psi_{i}(% x),\,y_{j-1}\Big{)},\end{split}start_ROW start_CELL end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ∘ italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ - 1 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ∘ italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) ) + italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , caligraphic_L start_POSTSUBSCRIPT 2 , roman_ℓ end_POSTSUBSCRIPT ∘ italic_ϕ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_y ) ) + ∑ start_POSTSUBSCRIPT italic_j = roman_ℓ + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ( ∘ italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ - 1 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) + italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , caligraphic_L start_POSTSUBSCRIPT 2 , roman_ℓ end_POSTSUBSCRIPT ∘ caligraphic_L start_POSTSUBSCRIPT 2 , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_y ) ) + ∑ start_POSTSUBSCRIPT italic_j = roman_ℓ + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ - 1 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y ) + ∑ start_POSTSUBSCRIPT italic_j = roman_ℓ + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y ) + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ) , end_CELL end_ROW

implying

fi,0(ψi(x),y)=j=1mfi,j(ψi(x),ϕj(y))j=1m1fi,0(ψi(x),yj).subscript𝑓𝑖0subscript𝜓𝑖𝑥𝑦superscriptsubscript𝑗1𝑚subscript𝑓𝑖𝑗subscript𝜓𝑖𝑥subscriptitalic-ϕ𝑗𝑦superscriptsubscript𝑗1𝑚1subscript𝑓𝑖0subscript𝜓𝑖𝑥subscript𝑦𝑗\begin{split}f_{i,0}\Big{(}\psi_{i}(x),\,y\Big{)}=\sum_{j=1}^{m}f_{i,j}\Big{(}% \psi_{i}(x),\,\phi_{j}(y)\Big{)}-\sum_{j=1}^{m-1}f_{i,0}\Big{(}\psi_{i}(x),\,y% _{j}\Big{)}.\end{split}start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) ) - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . end_CELL end_ROW

Moreover,

j=1mf0,j(xi,ϕj(y))=j=1mf(xi,2,jϕj(y))=j=11f(xi,2,jϕj(y))+f(xi,2,ϕ(y))+j=+1mf(xi,2,jϕj(y))=j=11f(xi,2,j(dj))+f(xi,2,2,1(y))+j=+1mf(xi,2,j(cj))=j=11f(xi,yj)+f(xi,y)+j=+1mf(xi,yj1)=f(xi,y)+j=1m1f(xi,yj),superscriptsubscript𝑗1𝑚subscript𝑓0𝑗subscript𝑥𝑖subscriptitalic-ϕ𝑗𝑦superscriptsubscript𝑗1𝑚𝑓subscript𝑥𝑖subscript2𝑗subscriptitalic-ϕ𝑗𝑦superscriptsubscript𝑗11𝑓subscript𝑥𝑖subscript2𝑗subscriptitalic-ϕ𝑗𝑦𝑓subscript𝑥𝑖subscript2subscriptitalic-ϕ𝑦superscriptsubscript𝑗1𝑚𝑓subscript𝑥𝑖subscript2𝑗subscriptitalic-ϕ𝑗𝑦superscriptsubscript𝑗11𝑓subscript𝑥𝑖subscript2𝑗subscript𝑑𝑗𝑓subscript𝑥𝑖subscript2superscriptsubscript21𝑦superscriptsubscript𝑗1𝑚𝑓subscript𝑥𝑖subscript2𝑗subscript𝑐𝑗superscriptsubscript𝑗11𝑓subscript𝑥𝑖subscript𝑦𝑗𝑓subscript𝑥𝑖𝑦superscriptsubscript𝑗1𝑚𝑓subscript𝑥𝑖subscript𝑦𝑗1𝑓subscript𝑥𝑖𝑦superscriptsubscript𝑗1𝑚1𝑓subscript𝑥𝑖subscript𝑦𝑗\begin{split}&\phantom{=\;}\sum_{j=1}^{m}f_{0,j}\Big{(}x_{i},\,\phi_{j}(y)\Big% {)}=\sum_{j=1}^{m}f\Big{(}x_{i},\,{\mathcal{L}}_{2,j}\circ\phi_{j}(y)\Big{)}\\ &=\sum_{j=1}^{\ell-1}f\Big{(}x_{i},\,{\mathcal{L}}_{2,j}\circ\phi_{j}(y)\Big{)% }+f\Big{(}x_{i},\,{\mathcal{L}}_{2,\ell}\circ\phi_{\ell}(y)\Big{)}+\sum_{j=% \ell+1}^{m}f\Big{(}x_{i},\,{\mathcal{L}}_{2,j}\circ\phi_{j}(y)\Big{)}\\ &=\sum_{j=1}^{\ell-1}f\Big{(}x_{i},\,{\mathcal{L}}_{2,j}(d_{j})\Big{)}+f\Big{(% }x_{i},\,{\mathcal{L}}_{2,\ell}\circ{\mathcal{L}}_{2,\ell}^{-1}(y)\Big{)}+\sum% _{j=\ell+1}^{m}f\Big{(}x_{i},\,{\mathcal{L}}_{2,j}(c_{j})\Big{)}\\ &=\sum_{j=1}^{\ell-1}f(x_{i},\,y_{j})+f(x_{i},\,y)+\sum_{j=\ell+1}^{m}f(x_{i},% \,y_{j-1})=f(x_{i},\,y)+\sum_{j=1}^{m-1}f(x_{i},\,y_{j}),\end{split}start_ROW start_CELL end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ∘ italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ∘ italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) ) + italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT 2 , roman_ℓ end_POSTSUBSCRIPT ∘ italic_ϕ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_y ) ) + ∑ start_POSTSUBSCRIPT italic_j = roman_ℓ + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ∘ italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) + italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT 2 , roman_ℓ end_POSTSUBSCRIPT ∘ caligraphic_L start_POSTSUBSCRIPT 2 , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_y ) ) + ∑ start_POSTSUBSCRIPT italic_j = roman_ℓ + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_ℓ - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) + ∑ start_POSTSUBSCRIPT italic_j = roman_ℓ + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ) = italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , end_CELL end_ROW

implying

f(xi,y)=j=1mf0,j(xi,ϕj(y))j=1m1f(xi,yj).𝑓subscript𝑥𝑖𝑦superscriptsubscript𝑗1𝑚subscript𝑓0𝑗subscript𝑥𝑖subscriptitalic-ϕ𝑗𝑦superscriptsubscript𝑗1𝑚1𝑓subscript𝑥𝑖subscript𝑦𝑗\begin{split}f(x_{i},\,y)=\sum_{j=1}^{m}f_{0,j}\Big{(}x_{i},\,\phi_{j}(y)\Big{% )}-\sum_{j=1}^{m-1}f(x_{i},\,y_{j}).\end{split}start_ROW start_CELL italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) ) - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . end_CELL end_ROW

Therefore, for any (x,y)[xk1,xk]×[y1,y]𝑥𝑦subscript𝑥𝑘1subscript𝑥𝑘subscript𝑦1subscript𝑦(x,y)\in[x_{k-1},x_{k}]\times[y_{\ell-1},y_{\ell}]( italic_x , italic_y ) ∈ [ italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] × [ italic_y start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ],

f(x,y)=i=1nfi,0(ψi(x),y)i=1n1f(xi,y)=i=1n(j=1mfi,j(ψi(x),ϕj(y))j=1m1fi,0(ψi(x),yj))i=1n1(j=1mf0,j(xi,ϕj(y))j=1m1f(xi,yj))=i=1nj=1mfi,j(ψi(x),ϕj(y))i=1nj=1m1fi,0(ψi(x),yj)i=1n1j=1mf0,j(xi,ϕj(y))+i=1n1j=1m1f(xi,yj).𝑓𝑥𝑦superscriptsubscript𝑖1𝑛subscript𝑓𝑖0subscript𝜓𝑖𝑥𝑦superscriptsubscript𝑖1𝑛1𝑓subscript𝑥𝑖𝑦superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑚subscript𝑓𝑖𝑗subscript𝜓𝑖𝑥subscriptitalic-ϕ𝑗𝑦superscriptsubscript𝑗1𝑚1subscript𝑓𝑖0subscript𝜓𝑖𝑥subscript𝑦𝑗superscriptsubscript𝑖1𝑛1superscriptsubscript𝑗1𝑚subscript𝑓0𝑗subscript𝑥𝑖subscriptitalic-ϕ𝑗𝑦superscriptsubscript𝑗1𝑚1𝑓subscript𝑥𝑖subscript𝑦𝑗superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑚subscript𝑓𝑖𝑗subscript𝜓𝑖𝑥subscriptitalic-ϕ𝑗𝑦superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑚1subscript𝑓𝑖0subscript𝜓𝑖𝑥subscript𝑦𝑗superscriptsubscript𝑖1𝑛1superscriptsubscript𝑗1𝑚subscript𝑓0𝑗subscript𝑥𝑖subscriptitalic-ϕ𝑗𝑦superscriptsubscript𝑖1𝑛1superscriptsubscript𝑗1𝑚1𝑓subscript𝑥𝑖subscript𝑦𝑗\begin{split}&\phantom{=\;\;}f(x,\,y)=\sum_{i=1}^{n}f_{i,0}\Big{(}\psi_{i}(x),% \,y\Big{)}-\sum_{i=1}^{n-1}f(x_{i},\,y)\\ &=\sum_{i=1}^{n}\Bigg{(}\sum_{j=1}^{m}f_{i,j}\Big{(}\psi_{i}(x),\,\phi_{j}(y)% \Big{)}-\sum_{j=1}^{m-1}f_{i,0}\Big{(}\psi_{i}(x),\,y_{j}\Big{)}\Bigg{)}-\sum_% {i=1}^{n-1}\Bigg{(}\sum_{j=1}^{m}f_{0,j}\Big{(}x_{i},\,\phi_{j}(y)\Big{)}-\sum% _{j=1}^{m-1}f(x_{i},\,y_{j})\Bigg{)}\\ &=\sum_{i=1}^{n}\sum_{j=1}^{m}f_{i,j}\Big{(}\psi_{i}(x),\,\phi_{j}(y)\Big{)}-% \sum_{i=1}^{n}\sum_{j=1}^{m-1}f_{i,0}\Big{(}\psi_{i}(x),\,y_{j}\Big{)}-\sum_{i% =1}^{n-1}\sum_{j=1}^{m}f_{0,j}\Big{(}x_{i},\,\phi_{j}(y)\Big{)}+\sum_{i=1}^{n-% 1}\sum_{j=1}^{m-1}f(x_{i},\,y_{j}).\end{split}start_ROW start_CELL end_CELL start_CELL italic_f ( italic_x , italic_y ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) ) - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) ) - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y ) ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . end_CELL end_ROW

Since k𝑘kitalic_k and \ellroman_ℓ are arbitrary, the above equation holds for all (x,y)=k=1n=1m[xk1,xk]×[y1,y]=[x0,xn]×[y0,ym](x,y)=\cup_{k=1}^{n}\cup_{\ell=1}^{m}[x_{k-1},x_{k}]\times[y_{\ell-1},y_{\ell}% ]=[x_{0},x_{n}]\times[y_{0},y_{m}]( italic_x , italic_y ) = ∪ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∪ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT [ italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] × [ italic_y start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ] = [ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] × [ italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ]. ∎

3.3 Related work

Approximation

Extensive research has examined the approximation capabilities of neural networks, focusing on various architectures to approximate diverse target functions. Early studies concentrated on the universal approximation power of single-hidden-layer networks [3, 10, 11], which demonstrated that sufficiently large neural networks could approximate specific functions with arbitrary precision mathematically, without quantifying the error relative to network size. Subsequent research, such as [36, 35, 1, 39, 2, 6, 7, 21, 30, 31, 19, 32, 37, 33], analyzed the approximation error for different networks in terms of size characterized by width, depth, or the number of parameters. Those studies have primarily concentrated on the mathematical theory that supports the existence theory for such neural networks. However, there has been limited focus on determining the parameters within these networks computationally and the numerical errors, particularly those arising from finite precision in computer simulations. This gap motivated our current investigation, which considers practical training processes and numerical errors. Specifically, the balanced structure of MMNN, the choice of training parameters, and the associated learning strategy discussed here are intended to facilitate a smooth decomposition of the function, thereby promoting an efficient training process.

Low-rank methods

Low-rank structures in the weight matrix 𝑾𝑾{\bm{W}}bold_italic_W of a fully connected neural network have been investigated by various groups. For example, the methods proposed in [26, 12, 28] focus on accelerating training and reducing memory requirements while maintaining final performance. The concept of low-rank structures is further extended to tensor train decomposition in [22]. The MMNN proposed here differs in two key aspects. First, each layer contains two matrices: 𝑨𝑨{\bm{A}}bold_italic_A outside and 𝑾𝑾{\bm{W}}bold_italic_W inside the activation functions. Each row of 𝑨𝑨{\bm{A}}bold_italic_A represents the weights for a linear combination of a set of random basis functions, forming a component in each layer. The number of rows in 𝑨𝑨{\bm{A}}bold_italic_A, which equals the number of components, is selected based on the complexity of the function and is typically much smaller than the number of columns, corresponding to the number of basis functions. Each row of (𝑾,𝒃)𝑾𝒃({\bm{W}},{\bm{b}})( bold_italic_W , bold_italic_b ) represents a random parameterization of a basis function, with the number of rows in 𝑾𝑾{\bm{W}}bold_italic_W corresponding to the number of basis functions, usually much larger than the number of columns in 𝑾𝑾{\bm{W}}bold_italic_W, which is the input dimension. Secondly, in our MMNN, only 𝑨𝑨{\bm{A}}bold_italic_A is trained while 𝑾𝑾{\bm{W}}bold_italic_W remains fixed with randomly initialized values. Theoretical studies and numerical experiments demonstrate that the architecture of MMNN, combined with the learning strategy, is effective in approximating complex functions.

Random features

Fixing (𝑾,𝒃)𝑾𝒃({\bm{W}},{\bm{b}})( bold_italic_W , bold_italic_b ) of each layer and use of random basis functions in the MMNNs is inspired by a previous approach known as random features [24, 34, 16, 27, 23]. In typical random feature methods, only the linear combination parameters at the output layer are trained which also leads to the issue of ill-conditioning of the representation. While in MMNNS matrix 𝑨𝑨{\bm{A}}bold_italic_A and vector 𝒄𝒄{\bm{c}}bold_italic_c of each layer are trained. Our MMNN employs a composition architecture and learning mechanism that enhances the approximation capabilities compared to random feature methods while achieving a more effective training process than a standard fully connected network of equivalent size. Extensive experiments demonstrate that our approach can strike a satisfactory balance between approximation accuracy and training cost.

Komogolrov-Arnold (KA) representation

The KA representation theorem [15] states that any multivariate continuous function on a hypercube can be expressed as a finite composition of continuous univariate functions and the binary operation of addition. However, this elegant mathematical representation may result in non-smooth or even fractal univariate functions in general due to this very specific form of representation, a computational challenge one has to address in practice. KA representation has been explored in several studies [17, 20, 32, 13]. A recently proposed network known as the KA network (KAN) utilizes spline functions to approximate the univariate functions in the KA representation. The proposed MMNN is motivated by a multi-component and multi-layer smooth decomposition, or a “divide and conquer” approach, employing distinct network architectures, activation functions, and training strategies.

4 Numerical experiments

In this section, we perform extensive experiments to validate our analysis and demonstrate the effectiveness of MMNNs through multi-component and multi-layer decomposition studied in Section 3. In particular, our tests show its ability in 1) adaptively capturing localized high-frequency features in Section 4.1, 2) approximating highly oscillatory functions in Section 4.2, and 3) extending to higher dimensions in Section 4.3 as well as some interesting learning dynamics in Section 4.4. All our experiments involve target functions that include high-frequency components in various ways and are difficult to handle by shallow networks (no matter how wide) as shown in our previous work [38]. Moreover, our experience on these tests shows that using a fully connected deep neural network would require many more parameters and is much harder (if possible) to train to get a comparable result. This is mainly due to a balanced and structured network design of MMNN in terms of 1) the network width w𝑤witalic_w, which is the number of hidden neurons or random basis functions in each component, 2) the rank r𝑟ritalic_r, which is the number of components in each layer, and 3) the network depth l𝑙litalic_l, which is the number of layers in the network. The use of a controllable number of collective components (through 𝑨𝑨{\bm{A}}bold_italic_A) in each layer instead of a large number of individual neurons and the use of fixed and randomly chosen weights (𝑾,𝒃)𝑾𝒃({\bm{W}},{\bm{b}})( bold_italic_W , bold_italic_b ) make the training process more effective.

In all tests, 1) data are sampled enough to resolve fine features in the target function, 2) the Adam optimizer is used in training, 3) mean squared error (MSE) is the loss function, 4) all parameters are initialized according to the PyTorch default initialization (see Section 2.3) unless otherwise specified, 5) 𝑾𝑾{\bm{W}}bold_italic_W’s and 𝒃𝒃{\bm{b}}bold_italic_b’s (the parameters inside the activation functions, see Section 2.2) are fixed and only 𝑨𝑨{\bm{A}}bold_italic_A’s and 𝒄𝒄{\bm{c}}bold_italic_c’s (the parameters outside the activation functions) are trained, 6) computations are conducted on an NVIDIA RTX 3500 Ada Generation Laptop GPU (power cap 130W), with most experiments concluding within a range from a few dozen to several thousand seconds. All our MMNN setups are specified by three parameters (w,r,l)𝑤𝑟𝑙(w,r,l)( italic_w , italic_r , italic_l ) which depends on the function complexity. Another tuning parameter is the learning rate the choice of which is guided by the following criteria: 1) not too large initially due to stability, 2) a decreasing rate with iterations such that the learning rate becomes small near the equilibrium to achieve a good accuracy while not decreasing too fast (especially during a long training process for more difficult target functions) so that the training is stalled.

4.1 Localized rapid changes

We begin with two examples in 1D. The first is f(x)=arctan(100x+20)𝑓𝑥100𝑥20f(x)=\arctan(100x+20)italic_f ( italic_x ) = roman_arctan ( 100 italic_x + 20 ), which is smooth but features a rapid transition at zero. While demonstrated in our previous work [38], a shallow network struggles to capture such a simple local fast transition which contains high-frequencies, we show that this function can be approximated easily by a composition of a smooth function on top of a (repeated) spatial decomposition and local rescaling using MMNN structure in Section 2.2. Our test indeed verifies that our new architecture can effectively capture a localized fast transition rather easily using a very small network of size (16,4,3)1643(16,4,3)( 16 , 4 , 3 ) as shown in Figure 11. For this test, a total of 1000 data points are uniformly sampled in the range [1,1]11[-1,1][ - 1 , 1 ], with a mini-batch size of 100, a learning rate of 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, and the number of epochs set to 2000. Figure 11 gives the error plot.

Refer to caption
(a) Epoch 100.
Refer to caption
(b) Epoch 200.
Refer to caption
(c) Epoch 2000.
Figure 10: Illustrations of the training process.
Refer to caption
Figure 11: Training and test errors (MSE).

Next, we consider a more complicated target function, f(x)=𝟙{|x+0.2|<0.02}sin(50πx)𝑓𝑥subscript1𝑥0.20.0250𝜋𝑥f(x)={\mathds{1}}_{\{|x+0.2|<0.02\}}\cdot\sin(50\pi x)italic_f ( italic_x ) = blackboard_1 start_POSTSUBSCRIPT { | italic_x + 0.2 | < 0.02 } end_POSTSUBSCRIPT ⋅ roman_sin ( 50 italic_π italic_x ), which represents a localized fast oscillation. For this example, we will conduct two tests. The first one is to show the flexibility of MMNN to automatically adapt to local features. The network has a small size as above (16,4,3)1643(16,4,3)( 16 , 4 , 3 ). Each layer has a network width of 16. In other words, each component is a linear combination of 16 ReLU functions which has no way to approximate such a target function well. However, with a multi-layer and multi-component decomposition with parameters appropriately trained by Adam, MMNN can adapt to the behavior of the target function as shown in Figure 12. Figure 14 gives the error plot. Also, the test shows that this example is more difficult to train. For this test, there are a total of 1000 uniformly sampled points in [1,1]11[-1,1][ - 1 , 1 ] with a mini-batch size of 100 and a learning rate of 0.002×0.95k/10000.002superscript0.95𝑘10000.002\times 0.95^{\lfloor k/1000\rfloor}0.002 × 0.95 start_POSTSUPERSCRIPT ⌊ italic_k / 1000 ⌋ end_POSTSUPERSCRIPT, where \lfloor\cdot\rfloor⌊ ⋅ ⌋ denotes floor operation and k=1,2,,20000𝑘1220000k=1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-% 0.1mu},20000italic_k = 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , 20000 is the epoch number. It should be noted that in this test, we initialize the biases 𝒃𝒃{\bm{b}}bold_italic_b’s to 𝟎0\bm{0}bold_0 and use the PyTorch default initialization method for the weights 𝑾𝑾{\bm{W}}bold_italic_W. This approach, inspired by Xavier initialization, is chosen because the target function is locally oscillatory and the MMNN size is quite small, necessitating a setup adaptive to the target function to facilitate the training. For other experiments, both the biases and weights use the PyTorch default initialization. We then compare with least square approximation using uniform finite element method (FEM) basis with the same degrees of freedom. As shown in Figure 14, MMNN renders a better approximation due to automatic adaptation through the training process. We would like to remark that when training an extremely compact MMNN which does not have much flexibility and makes the training more subtle, the training hyperparameters, such as learning rate, min-batch size, and etc., need to be more carefully tuned. However, when there is some redundancy in MMNN, i.e., an MMNN with a slightly larger size, MMNN becomes more flexible and the training process becomes easier. On the other hand, when the network becomes too large, then training a large number of parameters and over-redundancy will lead to potential difficulties for optimization. This also shows that there is a trade-off between representation and optimization one needs to balance in practice.

Refer to caption
(a) Epoch 500.
Refer to caption
(b) Epoch 1000.
Refer to caption
(c) Epoch 2000.
Refer to caption
(d) Epoch 20000.
Figure 12: Illustrations of the training process.
Refer to caption
Figure 13: Training and test errors measured in MSE vs. epoch.
Refer to caption
(a) FEM.
Refer to caption
(b) MMNN.
Figure 14: (a) Least square using equally spaced 153 FEM bases. (b) MMNN with (16+1)×4×(31)+(16+1)=153161431161153(16+1)\times 4\times(3-1)+(16+1)=153( 16 + 1 ) × 4 × ( 3 - 1 ) + ( 16 + 1 ) = 153 free parameters.

Now we show an example in 2D shown in Figure 16 and defined in polar coordinates by

f(r,θ)={0if 0.5+25ρ25r0,1if 0.5+25ρ25r1,0.5+25ρ25rotherwise,whereρ=0.1+0.02cos(8πθ).formulae-sequence𝑓𝑟𝜃cases0if 0.525𝜌25𝑟01if 0.525𝜌25𝑟10.525𝜌25𝑟otherwisewhere𝜌0.10.028𝜋𝜃f(r,\theta)=\begin{cases}0&\textnormal{if }0.5+25\rho-25r\leq 0,\\ 1&\textnormal{if }0.5+25\rho-25r\geq 1,\\ 0.5+25\rho-25r&\textnormal{otherwise},\\ \end{cases}\quad\textnormal{where}\quad\rho=0.1+0.02\cos(8\pi\theta).italic_f ( italic_r , italic_θ ) = { start_ROW start_CELL 0 end_CELL start_CELL if 0.5 + 25 italic_ρ - 25 italic_r ≤ 0 , end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL if 0.5 + 25 italic_ρ - 25 italic_r ≥ 1 , end_CELL end_ROW start_ROW start_CELL 0.5 + 25 italic_ρ - 25 italic_r end_CELL start_CELL otherwise , end_CELL end_ROW where italic_ρ = 0.1 + 0.02 roman_cos ( 8 italic_π italic_θ ) .

Again a rather compact MMNN of size (100,10,6)100106(100,10,6)( 100 , 10 , 6 ) can produce a good approximation. Figure 17 shows the error during the training process and Figure 16 shows the log\logroman_log plot of training and testing errors in MSE. For this test there are a total of 4002superscript4002400^{2}400 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT uniformly sampled points in [1,1]2superscript112[-1,1]^{2}[ - 1 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with mini-batch size of 1000100010001000 and a learning rate of 103×0.9k/25superscript103superscript0.9𝑘2510^{-3}\times 0.9^{\lfloor k/25\rfloor}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT × 0.9 start_POSTSUPERSCRIPT ⌊ italic_k / 25 ⌋ end_POSTSUPERSCRIPT, where k=1,2,,1000𝑘121000k=1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-% 0.1mu},1000italic_k = 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , 1000 is the epoch number. We compare the result with piecewise linear interpolation and least square approximation using FEM basis on a uniform grid with the same number of degrees of freedom in Figure 18. As observed before, MMNN renders the best result due to its adaptivity through training.

Refer to caption
Refer to caption
Figure 15: Target function.
Refer to caption
Figure 16: Training and test errors measured in MSE.
Refer to caption
(a) Epoch 10.
Refer to caption
(b) Epoch 50.
Refer to caption
(c) Epoch 100.
Refer to caption
(d) Epoch 1000.
Figure 17: The error during the training process.
Refer to caption
(a) Network.
Refer to caption
(b) Interpolation.
Refer to caption
(c) FEM.
Refer to caption
(d) Network (difference).
Refer to caption
(e) Interpolation (difference).
Refer to caption
(f) FEM (difference).
Figure 18: Comparison among different approximations using MMNN, interpolation, and least square FEM. The interpolation and FEM are all based on a 72×72=51847272518472\times 72=518472 × 72 = 5184 uniform grid. MMNN has (100+1)×10×(61)+(100+1)=51511001106110015151(100+1)\times 10\times(6-1)+(100+1)=5151( 100 + 1 ) × 10 × ( 6 - 1 ) + ( 100 + 1 ) = 5151 free parameters. The maximum error is approximately 0.050.050.050.05 for MMNN, 0.310.310.310.31 for interpolation, and 0.380.380.380.38 for FEM. The corresponding MSE errors are 0.85×1060.85superscript1060.85\times 10^{-6}0.85 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, 1.95×1041.95superscript1041.95\times 10^{-4}1.95 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and 1.45×1041.45superscript1041.45\times 10^{-4}1.45 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, respectively.

4.2 Highly oscillatory functions

Globally oscillatory functions with significant high-frequency components can not be approximated well by a shallow network when a global bounded activation function of the form σ(𝑾𝒙𝒃)𝜎𝑾𝒙𝒃\sigma({\bm{W}}\cdot{\bm{x}}-{\bm{b}})italic_σ ( bold_italic_W ⋅ bold_italic_x - bold_italic_b ), such as ReLU, is used. Due to almost orthogonality or high decorrelation (in terms of the inner product) between σ(𝑾𝒙𝒃)𝜎𝑾𝒙𝒃\sigma({\bm{W}}\cdot{\bm{x}}-{\bm{b}})italic_σ ( bold_italic_W ⋅ bold_italic_x - bold_italic_b ) and oscillatory functions with high likelihood (in terms of a random choice of (𝑾,𝒃)𝑾𝒃({\bm{W}},{\bm{b}})( bold_italic_W , bold_italic_b )), the set of parameters that can render a good approximation, namely the Rashomon set [29], becomes smaller and smaller (in terms of relative measure) and hence harder and harder to find as the target function becomes more and more oscillatory (see [38]). Although this difficulty can be alleviated by complexity decomposition using MMNN as shown in Section 2, it still requires a larger network in terms of width, rank, and layers and more training. Here we limit our tests to oscillatory functions in 1D and 2D due to the dramatic increase of complexity with dimensions, or the curse of dimensions, in general.

We again start with a 1D example, f(x)=sin(50πx),x[1,1]formulae-sequence𝑓𝑥50𝜋𝑥𝑥11f(x)=\sin(50\pi x),x\in[-1,1]italic_f ( italic_x ) = roman_sin ( 50 italic_π italic_x ) , italic_x ∈ [ - 1 , 1 ]. A MMNN of size (800,40,15)8004015(800,40,15)( 800 , 40 , 15 ) produces a good approximation of this highly oscillatory function, as illustrated by the error plot in Figure 20, with a smaller learning rate and a longer training process compared to previous examples with localized fine features. Due to the significant depth, we consider using ResMMNN as discussed in Section 2.2. For this test, a total of 1000100010001000 uniformly sampled points in [1,1]11[-1,1][ - 1 , 1 ] are used with a mini-batch size of 100100100100 and a learning rate of 104×0.9k/800superscript104superscript0.9𝑘80010^{-4}\times 0.9^{\lfloor k/800\rfloor}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT × 0.9 start_POSTSUPERSCRIPT ⌊ italic_k / 800 ⌋ end_POSTSUPERSCRIPT, where k=1,2,,40000𝑘1240000k=1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-% 0.1mu},40000italic_k = 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , 40000 is the epoch number. Also, an interesting learning dynamics for Adam is observed from Figure 20. In the beginning, nothing seems to happen until about epoch 3600 when learning starts from the boundary. Then more and more features are captured from the boundary to the inside gradually. Eventually, all features are captured and then fine-tuned together to improve the overall approximation.

Refer to caption
(a) Epoch 3600.
Refer to caption
(b) Epoch 3800.
Refer to caption
(c) Epoch 4200.
Refer to caption
(d) Epoch 5000.
Refer to caption
(e) Epoch 10000.
Refer to caption
(f) Epoch 40000.
Figure 19: Illustrations of the training process.
Refer to caption
Figure 20: Illustrations of training and test errors measured in MSE.

Next, we consider a two-dimensional target function of the following form:

fs(x1,x2)=i=12j=12aijsin(sbixi+scijxixj)cos(sbjxj+sdijxi2),subscript𝑓𝑠subscript𝑥1subscript𝑥2superscriptsubscript𝑖12superscriptsubscript𝑗12subscript𝑎𝑖𝑗𝑠subscript𝑏𝑖subscript𝑥𝑖𝑠subscript𝑐𝑖𝑗subscript𝑥𝑖subscript𝑥𝑗𝑠subscript𝑏𝑗subscript𝑥𝑗𝑠subscript𝑑𝑖𝑗superscriptsubscript𝑥𝑖2f_{s}(x_{1},x_{2})=\sum_{i=1}^{2}\sum_{j=1}^{2}a_{ij}\sin(sb_{i}x_{i}+sc_{ij}x% _{i}x_{j})\cos(sb_{j}x_{j}+sd_{ij}x_{i}^{2}),italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_sin ( italic_s italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_s italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) roman_cos ( italic_s italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_s italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where

(ai,j)=[0.30.20.20.3],(bi)=[2π4π],(ci,j)=[2π4π8π4π],and(di,j)=[4π6π8π6π].formulae-sequencesubscript𝑎𝑖𝑗matrix0.30.20.20.3formulae-sequencesubscript𝑏𝑖matrix2𝜋4𝜋formulae-sequencesubscript𝑐𝑖𝑗matrix2𝜋4𝜋8𝜋4𝜋andsubscript𝑑𝑖𝑗matrix4𝜋6𝜋8𝜋6𝜋(a_{i,j})=\begin{bmatrix}0.3&0.2\\ 0.2&0.3\end{bmatrix},\qquad(b_{i})=\begin{bmatrix}2\pi\\ 4\pi\end{bmatrix},\qquad(c_{i,j})=\begin{bmatrix}2\pi&4\pi\\ 8\pi&4\pi\end{bmatrix},\quad\textnormal{and}\quad(d_{i,j})=\begin{bmatrix}4\pi% &6\pi\\ 8\pi&6\pi\end{bmatrix}.( italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) = [ start_ARG start_ROW start_CELL 0.3 end_CELL start_CELL 0.2 end_CELL end_ROW start_ROW start_CELL 0.2 end_CELL start_CELL 0.3 end_CELL end_ROW end_ARG ] , ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = [ start_ARG start_ROW start_CELL 2 italic_π end_CELL end_ROW start_ROW start_CELL 4 italic_π end_CELL end_ROW end_ARG ] , ( italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) = [ start_ARG start_ROW start_CELL 2 italic_π end_CELL start_CELL 4 italic_π end_CELL end_ROW start_ROW start_CELL 8 italic_π end_CELL start_CELL 4 italic_π end_CELL end_ROW end_ARG ] , and ( italic_d start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) = [ start_ARG start_ROW start_CELL 4 italic_π end_CELL start_CELL 6 italic_π end_CELL end_ROW start_ROW start_CELL 8 italic_π end_CELL start_CELL 6 italic_π end_CELL end_ROW end_ARG ] .

In our test, we choose s=3𝑠3s=3italic_s = 3 to ensure the function exhibits significant oscillations and contains diverse Fourier modes as illustrated by Figure 22. Given the complexity of the function, we employ a MMNN with size (600,30,15)6003015(600,30,15)( 600 , 30 , 15 ). Again, ResMMNN is used due to the depth. For this test, a total of 4002superscript4002400^{2}400 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT data are sampled on a uniform grid in [1,1]2superscript112[-1,1]^{2}[ - 1 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with a mini-batch size of 1000100010001000 and a learning rate of 103×0.9k/40superscript103superscript0.9𝑘4010^{-3}\times 0.9^{\lfloor k/40\rfloor}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT × 0.9 start_POSTSUPERSCRIPT ⌊ italic_k / 40 ⌋ end_POSTSUPERSCRIPT, where k=1,2,,2000𝑘122000k=1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-% 0.1mu},2000italic_k = 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , 2000 is the epoch number. The training process is illustrated by Figure 23. Figure 22 shows log\logroman_log-error plot.

Refer to caption
Refer to caption
Figure 21: Illustrations of the target function.
Refer to caption
Figure 22: Training and test errors measured in MSE.
Refer to caption
(a) Epoch 25.
Refer to caption
(b) Epoch 50.
Refer to caption
(c) Epoch 1000.
Refer to caption
(d) Epoch 2000.
Refer to caption
(e) Epoch 25.
Refer to caption
(f) Epoch 50.
Refer to caption
(g) Epoch 1000.
Refer to caption
(h) Epoch 2000.
Figure 23: The top row: the learned neural network; the bottom row: the differences between the learned neural network and the target function.

We trained the same function using identical network settings, except we limited the domain of interest to a unit disc. We sampled 4522superscript4522452^{2}452 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT data points uniformly distributed over the [1,1]2superscript112[-1,1]^{2}[ - 1 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT area, then filtered to retain only those points that fall within the unit disk, totaling approximately 159692159692159692159692 (4002)absentsuperscript4002(\approx 400^{2})( ≈ 400 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) samples. As illustrated in Figure 24, our network successfully learned the target function in the disc with no adjustments or modifications. This test highlights the network’s flexibility for domain geometry, an advantage over traditional mesh or grid-based methods, especially in higher dimensions.

Refer to caption
(a) Approximation by MMNN.
Refer to caption
(b) Difference.
Refer to caption
(c) Errors (in MSE) vs. epoch
Figure 24: Approximation in a unit disk.

4.3 Tests in three dimension and higher

In this section, we test a few examples in three and four dimensions. Even sampling an interesting function becomes challenging as the dimension becomes higher. Although our examples are limited by our computation power using a laptop, our tests show that MMNN performs well and is more effective than a fully connected network.

The first example is a 3D function a level set of which is shown in Figure 27. Using polar coordinates (r,θ,ϕ)𝑟𝜃italic-ϕ(r,\theta,\phi)( italic_r , italic_θ , italic_ϕ ), θ[0,π]𝜃0𝜋\theta\in[0,\pi]italic_θ ∈ [ 0 , italic_π ], ϕ[0,2π)italic-ϕ02𝜋\phi\in[0,2\pi)italic_ϕ ∈ [ 0 , 2 italic_π ), the target function f(x,y,z)𝑓𝑥𝑦𝑧f(x,y,z)italic_f ( italic_x , italic_y , italic_z ) is defined as:

f(r,θ,ϕ)={0if 0.5+5ρ5r0,1if 0.5+5ρ5r1,0.5+5ρ5rotherwise,𝑓𝑟𝜃italic-ϕcases0if 0.55𝜌5𝑟01if 0.55𝜌5𝑟10.55𝜌5𝑟otherwisef(r,\theta,\phi)=\begin{cases}0&\text{if }0.5+5\rho-5r\leq 0,\\ 1&\text{if }0.5+5\rho-5r\geq 1,\\ 0.5+5\rho-5r&\text{otherwise},\end{cases}italic_f ( italic_r , italic_θ , italic_ϕ ) = { start_ROW start_CELL 0 end_CELL start_CELL if 0.5 + 5 italic_ρ - 5 italic_r ≤ 0 , end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL if 0.5 + 5 italic_ρ - 5 italic_r ≥ 1 , end_CELL end_ROW start_ROW start_CELL 0.5 + 5 italic_ρ - 5 italic_r end_CELL start_CELL otherwise , end_CELL end_ROW

where

ρ=ρ(θ,ϕ)=0.5+0.2sin(6θ)cos(6ϕ)sin2(θ).𝜌𝜌𝜃italic-ϕ0.50.26𝜃6italic-ϕsuperscript2𝜃\rho=\rho(\theta,\phi)=0.5+0.2\sin(6\theta)\cos(6\phi)\sin^{2}(\theta).italic_ρ = italic_ρ ( italic_θ , italic_ϕ ) = 0.5 + 0.2 roman_sin ( 6 italic_θ ) roman_cos ( 6 italic_ϕ ) roman_sin start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_θ ) .
Refer to caption
Figure 25: Surface plot of the levelset f(r,θ,ϕ)=0.5𝑓𝑟𝜃italic-ϕ0.5f(r,\theta,\phi)=0.5italic_f ( italic_r , italic_θ , italic_ϕ ) = 0.5.
Refer to caption
Figure 26: Surface plot of the levelset h(r,θ,ϕ)=0.5𝑟𝜃italic-ϕ0.5h(r,\theta,\phi)=0.5italic_h ( italic_r , italic_θ , italic_ϕ ) = 0.5.
Refer to caption
Figure 27: Training and test errors (MSE) vs. epoch.

Our MMNN is of a compact size (600,20,8)600208(600,20,8)( 600 , 20 , 8 ). For this test, a total of 1113superscript1113111^{3}111 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT data are sampled on a uniform grid in [1,1]3superscript113[-1,1]^{3}[ - 1 , 1 ] start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT with a mini-batch size of 999999999999 and a learning rate of 0.0005×0.9k/60.0005superscript0.9𝑘60.0005\times 0.9^{\lfloor k/6\rfloor}0.0005 × 0.9 start_POSTSUPERSCRIPT ⌊ italic_k / 6 ⌋ end_POSTSUPERSCRIPT for epochs k=1,2,,300𝑘12300k=1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-% 0.1mu},300italic_k = 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , 300. Figure 27 gives the error plot. As shown in Figures 27 and 27, the levelsets corresponding to the target function f𝑓fitalic_f and the learned MMNN approximation hhitalic_h are nearly identical. To visually demonstrate the quality of the approximation and complex structure of the 3D function, we present several slices of the target function and the MMNN approximation by fixing either x𝑥xitalic_x, y𝑦yitalic_y, or z𝑧zitalic_z in Figure 28.

Refer to caption
(a) z=0𝑧0z=0italic_z = 0.
Refer to caption
(b) z=0.1𝑧0.1z=0.1italic_z = 0.1.
Refer to caption
(c) y=0.2𝑦0.2y=0.2italic_y = 0.2.
Refer to caption
(d) y=0.3𝑦0.3y=0.3italic_y = 0.3.
Refer to caption
(e) x=0.4𝑥0.4x=0.4italic_x = 0.4.
Refer to caption
(f) x=0.5𝑥0.5x=0.5italic_x = 0.5.
Figure 28: Slices of the true function f(x,y,z)𝑓𝑥𝑦𝑧f(x,y,z)italic_f ( italic_x , italic_y , italic_z ) vs. those of the MMNN approximation h(x,y,z)𝑥𝑦𝑧h(x,y,z)italic_h ( italic_x , italic_y , italic_z ).

Next, we consider the probability density function (PDF) of a Gaussian (normal) distribution in 4D,

f(𝒙)=f(x1,,x4)=exp(12(𝒙𝝁)T𝚺1(𝒙𝝁))(2π)kdet(𝚺)𝑓𝒙𝑓subscript𝑥1subscript𝑥412superscript𝒙𝝁Tsuperscript𝚺1𝒙𝝁superscript2𝜋𝑘𝚺f(\bm{x})=f(x_{1},\dots,x_{4})=\frac{\exp\left(-\frac{1}{2}(\bm{x}-\bm{\mu})^{% \textsf{T}}{\bm{\Sigma}}^{-1}(\bm{x}-\bm{\mu})\right)}{\sqrt{(2\pi)^{k}\det({% \bm{\Sigma}})}}italic_f ( bold_italic_x ) = italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = divide start_ARG roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_italic_x - bold_italic_μ ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_x - bold_italic_μ ) ) end_ARG start_ARG square-root start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_det ( bold_Σ ) end_ARG end_ARG

where 𝚺𝚺{\bm{\Sigma}}bold_Σ is the covariance matrix. We set 𝝁=𝟎𝝁0\bm{\mu}=\bm{0}bold_italic_μ = bold_0 and 𝚺1=20[1.00.90.80.70.92.01.91.80.81.93.02.90.71.82.94.0].superscript𝚺120delimited-[]1.00.90.80.70.92.01.91.80.81.93.02.90.71.82.94.0{\bm{\Sigma}}^{-1}=20\left[\begin{smallmatrix}1.0&0.9&0.8&0.7\\ 0.9&2.0&1.9&1.8\\ 0.8&1.9&3.0&2.9\\ 0.7&1.8&2.9&4.0\\ \end{smallmatrix}\right].bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = 20 [ start_ROW start_CELL 1.0 end_CELL start_CELL 0.9 end_CELL start_CELL 0.8 end_CELL start_CELL 0.7 end_CELL end_ROW start_ROW start_CELL 0.9 end_CELL start_CELL 2.0 end_CELL start_CELL 1.9 end_CELL start_CELL 1.8 end_CELL end_ROW start_ROW start_CELL 0.8 end_CELL start_CELL 1.9 end_CELL start_CELL 3.0 end_CELL start_CELL 2.9 end_CELL end_ROW start_ROW start_CELL 0.7 end_CELL start_CELL 1.8 end_CELL start_CELL 2.9 end_CELL start_CELL 4.0 end_CELL end_ROW ] . We remark that the eigenvalues of 𝚺1superscript𝚺1{\bm{\Sigma}}^{-1}bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT are 6.82,9.93,25.28,158.056.829.9325.28158.056.82,9.93,25.28,158.056.82 , 9.93 , 25.28 , 158.05 which means that the distribution is quite anisotropic and concentrated near the center.

A compact MMNN with size of (500,12,6)500126(500,12,6)( 500 , 12 , 6 ) produces a good approximation as shown in the error plot Figure 30. Figure 30 compares the true function f(x,y,z,u)𝑓𝑥𝑦𝑧𝑢f(x,y,z,u)italic_f ( italic_x , italic_y , italic_z , italic_u ) and the MMNN approximation h(x,y,z,u)𝑥𝑦𝑧𝑢h(x,y,z,u)italic_h ( italic_x , italic_y , italic_z , italic_u ) with z=u=0.2𝑧𝑢0.2z=u=0.2italic_z = italic_u = 0.2. For this test a total of 354superscript35435^{4}35 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT data are sampled on a uniform grid in [1,1]4superscript114[-1,1]^{4}[ - 1 , 1 ] start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT with a mini-batch size of 352superscript35235^{2}35 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and a learning rate at 103×0.9k/6superscript103superscript0.9𝑘610^{-3}\times 0.9^{\lfloor k/6\rfloor}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT × 0.9 start_POSTSUPERSCRIPT ⌊ italic_k / 6 ⌋ end_POSTSUPERSCRIPT for epochs k=1,2,,300𝑘12300k=1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-% 0.1mu},300italic_k = 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , 300.

Refer to caption
Figure 29: True function f(x,y,z,u)𝑓𝑥𝑦𝑧𝑢f(x,y,z,u)italic_f ( italic_x , italic_y , italic_z , italic_u ) versus the learned network h(x,y,z,u)𝑥𝑦𝑧𝑢h(x,y,z,u)italic_h ( italic_x , italic_y , italic_z , italic_u ) with z=u=0.2𝑧𝑢0.2z=u=0.2italic_z = italic_u = 0.2.
Refer to caption
Figure 30: Training and test errors (MSE) vs. epoch.

4.4 Learning dynamics

In this section, we show some interesting learning dynamics observed during the training process. As the first example in Section 4.2 and the following examples show, the training process not just learns from low frequency first but can also learn feature by feature, i.e., can be localized in both frequency domain and spatial domain. We believe this is due to the combination of MMNN’s “divide and conquer” ability and the Adam optimizer which utilizes momentum. More understanding is needed and will be studied in our future research.

We again start with a 1D example, f(x)=sin(36π|x|1.5),x[1,1]formulae-sequence𝑓𝑥36𝜋superscript𝑥1.5𝑥11f(x)=\sin\big{(}36\pi|x|^{1.5}\big{)},x\in[-1,1]italic_f ( italic_x ) = roman_sin ( 36 italic_π | italic_x | start_POSTSUPERSCRIPT 1.5 end_POSTSUPERSCRIPT ) , italic_x ∈ [ - 1 , 1 ]. A MMNN of size (600,30,8)600308(600,30,8)( 600 , 30 , 8 ) produces a good approximation of this highly oscillatory function, as illustrated by the error plot in Figure 32. For this test, a total of 1000100010001000 uniformly sampled points in [1,1]11[-1,1][ - 1 , 1 ] are used with a mini-batch size of 100100100100 and a learning rate of 103×0.9k/200superscript103superscript0.9𝑘20010^{-3}\times 0.9^{\lfloor k/200\rfloor}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT × 0.9 start_POSTSUPERSCRIPT ⌊ italic_k / 200 ⌋ end_POSTSUPERSCRIPT, where k=1,2,,10000𝑘1210000k=1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-% 0.1mu},10000italic_k = 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , 10000 is the epoch number. As illustrated in Figure 32, the function is less oscillatory near 0. Therefore, we might anticipate that the network will initially learn the part near 0 and then feature by feature from the middle to the boundary. The experimental results presented in Figure 33 agree with our expectations.

Refer to caption
Figure 31: Derivative of the target function.
Refer to caption
Figure 32: Errors (in MSE) vs. epoch.
Refer to caption
(a) Epoch 400.
Refer to caption
(b) Epoch 500.
Refer to caption
(c) Epoch 600.
Refer to caption
(d) Epoch 700.
Refer to caption
(e) Epoch 1000.
Refer to caption
(f) Epoch 1700.
Refer to caption
(g) Epoch 2400.
Refer to caption
(h) Epoch 10000.
Figure 33: Illustration of the training process.

Now we show an example of 2D function f(r,θ)𝑓𝑟𝜃f(r,\theta)italic_f ( italic_r , italic_θ ) (see Figure 35) defined in polar coordinates (r,θ)𝑟𝜃(r,\theta)( italic_r , italic_θ ) as

f(r,θ)={0if 0.5+5ρ5r0,1if 0.5+5ρ5r1,0.5+5ρ5rotherwise,whereρ=0.5+0.1cos(π2θ2).formulae-sequence𝑓𝑟𝜃cases0if 0.55𝜌5𝑟01if 0.55𝜌5𝑟10.55𝜌5𝑟otherwisewhere𝜌0.50.1superscript𝜋2superscript𝜃2f(r,\theta)=\begin{cases}0&\textnormal{if }0.5+5\rho-5r\leq 0,\\ 1&\textnormal{if }0.5+5\rho-5r\geq 1,\\ 0.5+5\rho-5r&\textnormal{otherwise},\\ \end{cases}\quad\textnormal{where}\quad\rho=0.5+0.1\cos(\pi^{2}\theta^{2}).italic_f ( italic_r , italic_θ ) = { start_ROW start_CELL 0 end_CELL start_CELL if 0.5 + 5 italic_ρ - 5 italic_r ≤ 0 , end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL if 0.5 + 5 italic_ρ - 5 italic_r ≥ 1 , end_CELL end_ROW start_ROW start_CELL 0.5 + 5 italic_ρ - 5 italic_r end_CELL start_CELL otherwise , end_CELL end_ROW where italic_ρ = 0.5 + 0.1 roman_cos ( italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Our MMNN is of a compact size (500,20,8)500208(500,20,8)( 500 , 20 , 8 ). For this test, a total of 6002superscript6002600^{2}600 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT data are sampled on a uniform grid in [1,1]2superscript112[-1,1]^{2}[ - 1 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with a mini-batch size of 1000100010001000 and a learning rate of 0.001×0.9k/60.001superscript0.9𝑘60.001\times 0.9^{\lfloor k/6\rfloor}0.001 × 0.9 start_POSTSUPERSCRIPT ⌊ italic_k / 6 ⌋ end_POSTSUPERSCRIPT for epochs k=1,2,,300𝑘12300k=1,2,\mathinner{\mkern-0.1mu\cdotp\mkern-0.3mu\cdotp\mkern-0.3mu\cdotp\mkern-% 0.1mu},300italic_k = 1 , 2 , start_ATOM ⋅ ⋅ ⋅ end_ATOM , 300. Figure 35 gives the error plot. The training process shown in Figure 36 illustrates that an overall coarse scale or low-frequency component of the shape is learned first and then localized features are learned one by one from coarse to fine.

Refer to caption
Refer to caption
Figure 34: Illustration of the target function.
Refer to caption
Figure 35: Training and test errors (in MSE) vs. epoch.
Refer to caption
(a) Epoch 1.
Refer to caption
(b) Epoch 3.
Refer to caption
(c) Epoch 5.
Refer to caption
(d) Epoch 7.
Refer to caption
(e) Epoch 14.
Refer to caption
(f) Epoch 22.
Refer to caption
(g) Epoch 30.
Refer to caption
(h) Epoch 300.
Figure 36: Illustration of the learning dynamics.

5 Further discussion

In this section, we provide several general insights about MMNNs. First, in Section 5.1, we explore the advantages of MMNNs compared to fully connected networks (FCNNs) or multi-layer perceptrons (MLPs). Next, in Section 5.2, we offer practical guidelines for determining the appropriate MMNN size based on our theoretical understanding and extensive numerical experiments. Finally in Section 5.3, we discuss the use of alternative activation functions beyond 𝚁𝚎𝙻𝚄𝚁𝚎𝙻𝚄\mathtt{ReLU}typewriter_ReLU in MMNNs.

5.1 Advantages compared to FCNNs or MLPs

  • The two key differences between a standard FCNN or MLP and a MMNN are 1) the introduction of the weights 𝑨,𝒄𝑨𝒄{\bm{A}},{\bm{c}}bold_italic_A , bold_italic_c for different linear combinations of hidden neurons (or perceptrons) as the multi-components in each layer, and 2) the training strategy that fixes those randomly initialized 𝑾,𝒃𝑾𝒃{\bm{W}},{\bm{b}}bold_italic_W , bold_italic_b (random features) in the hidden neurons. Hence it is extremely easy to modify a FCNN or MLP to a MMNN.

  • MMNNs are much more effective than FCNNs in terms of representation, training, and accuracy especially for complex functions. In comparison, as shown in those experiments in Section 2.4, MMNNs 1) have much fewer training parameters, 2) converge much faster in training, 3) achieve much better accuracy. Moreover, experiments show that training process of MMNNs converges not only faster but also with a steady rate while FCNNs saturates pretty early to a quite low accuracy, as commonly observed in practices. These nice behaviors of MMNNs are due to their balanced structure for smooth decomposition as well as the training strategy. In practice, the introduction of 𝑨,𝒄𝑨𝒄{\bm{A}},{\bm{c}}bold_italic_A , bold_italic_c in MMNNs provides an important balance between the network width, which is the number of hidden neurons (basis functions) and can be very large, and the dimension of the input space, which is the number of components from the previous layer and can be much smaller than the network width. In other words, using a few linear combinations of the basis functions can capture smooth structures in the input space well. On the other hand, for FCNNs the two are the same and no balance is exerted.

5.2 Practical guidelines for MMNN

There are three hyperparameters for the configuration of MMNN sizes, the network width, the number of components (rank), and the number of layers (depth). Here are the general guidelines based on our mathematical construction and extensive experiments:

  1. 1.

    The network width should provide enough resolution to capture fine details of the target function. This means that the width should be at least comparable to the size of an adaptive mesh that can approximate the target function well.

  2. 2.

    The number of components (rank) is related to the overall complexity of the target function which depends on its spatial domain and Fourier domain representation as well as the input dimension. As indicated by our mathematical multi-component construction, it is related to the “divide and conquer” strategy.

  3. 3.

    The number of layers (depth) is also related to the overall complexity of the target function as for the number of components. Rank and depth are complementary but work together effectively for a smooth decomposition of the target function. The rule of thumb for depth is similar to that for the rank.

Here we use more concrete examples to illustrate the guidelines. For simplicity we fix the input dimension and domain of interest. As the domain size and dimension increases, the network size needs to increase correspondingly. For a smooth target function, a compact MMNN in terms of width, rank, and depth is enough and easy training process can render accurate results. Larger MMNNs are needed for target functions with localized rapid changes. Even with a relative compact size, the training process can allocate resources adaptive to the target function and render good approximation. The most difficult situation is to approximate globally highly oscillatory functions with diverse Fourier modes for which large MMNNs are needed. For instance, if the oscillation frequency doubles, the network width should increase by 2dsuperscript2𝑑2^{d}2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT where d𝑑ditalic_d is the dimension. In general the network width needs to deal with the curse of dimensionality just like a mesh based method. However, the growth of the number of components and layers with the increase of complexity seems to be relative mild (maybe polylogarithmic suggested by our mathematical construction).

  • Overall, for a given target function, MMNNs can work well with quite a large range of configuration with a trade-off between the network size and training process. For example, the training process for a network more on the compact size with respect to the complexity of a given target function may become more subtle and challenging, e.g., choosing the appropriate learning rate and batch size, due to the lack of flexibility (or redundancy) of the representation. On the other hand, a network of too large size (or redundancy) with respect to the complexity of a given target function requires unnecessarily expensive training cost. An interesting and important question for future research is how to develop a posteriori strategy to automatically adjust the network size in practice.

  • The most advantageous situation for using MMNNs is when approximating a function in relative high dimension which is mostly smooth except for localized fine features, e.g., a distribution in high dimensions concentrated on a low dimensional manifold. Through training, MMNNs can provide an automatic adaptive approximation of the underlying structure which can be challenging for a traditional mesh based method.

  • We would like to remark that learning rate scheduler can be a subtle and important issue for all training process in practice. For all our training process, the Step Learning Rate suffices. However, one could consider using other learning rate schedulers, such as the Cosine Scheduler [18] or the gradual warm-up strategy [5]. Exploring and designing a more efficient learning rate scheduler with some automatic restart mechanism is a potential interesting topic for future work.

5.3 Beyond ReLU to other activation functions

We also tried using different activation functions for MMNNs, e.g., 𝙶𝙴𝙻𝚄𝙶𝙴𝙻𝚄\mathtt{GELU}typewriter_GELU [9], 𝚂𝚠𝚒𝚜𝚑𝚂𝚠𝚒𝚜𝚑\mathtt{Swish}typewriter_Swish [25], 𝚂𝚒𝚐𝚖𝚘𝚒𝚍𝚂𝚒𝚐𝚖𝚘𝚒𝚍\mathtt{Sigmoid}typewriter_Sigmoid, and 𝚃𝚊𝚗𝚑𝚃𝚊𝚗𝚑\mathtt{Tanh}typewriter_Tanh. In general, ReLU provides the overall best results for various target functions. However, in situations where a smooth (e.g., Cssuperscript𝐶𝑠C^{s}italic_C start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT or real analytic) approximation is needed, one might consider using smooth alternatives to 𝚁𝚎𝙻𝚄𝚁𝚎𝙻𝚄\mathtt{ReLU}typewriter_ReLU such as 𝙶𝙴𝙻𝚄𝙶𝙴𝙻𝚄\mathtt{GELU}typewriter_GELU  or 𝚂𝚠𝚒𝚜𝚑𝚂𝚠𝚒𝚜𝚑\mathtt{Swish}typewriter_Swish, which generally yield results comparable to 𝚁𝚎𝙻𝚄𝚁𝚎𝙻𝚄\mathtt{ReLU}typewriter_ReLU:

  • For target functions that are Cssuperscript𝐶𝑠C^{s}italic_C start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT or even real analytic (and can be highly oscillatory), such as f(x)=cos(36πx2)0.6cos(12πx2)𝑓𝑥36𝜋superscript𝑥20.612𝜋superscript𝑥2f(x)=\cos(36\pi x^{2})-0.6\cos(12\pi x^{2})italic_f ( italic_x ) = roman_cos ( 36 italic_π italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - 0.6 roman_cos ( 12 italic_π italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), 𝙶𝙴𝙻𝚄𝙶𝙴𝙻𝚄\mathtt{GELU}typewriter_GELU (or 𝚂𝚠𝚒𝚜𝚑𝚂𝚠𝚒𝚜𝚑\mathtt{Swish}typewriter_Swish) tends to perform slightly better than 𝚁𝚎𝙻𝚄𝚁𝚎𝙻𝚄\mathtt{ReLU}typewriter_ReLU.

  • For target functions with non-differentiable points, such as f(x)=𝟙{|x|<0.02}sin(50πx)𝑓𝑥subscript1𝑥0.0250𝜋𝑥f(x)={\mathds{1}}_{\{|x|<0.02\}}\cdot\sin(50\pi x)italic_f ( italic_x ) = blackboard_1 start_POSTSUBSCRIPT { | italic_x | < 0.02 } end_POSTSUBSCRIPT ⋅ roman_sin ( 50 italic_π italic_x ), 𝙶𝙴𝙻𝚄𝙶𝙴𝙻𝚄\mathtt{GELU}typewriter_GELU (or 𝚂𝚠𝚒𝚜𝚑𝚂𝚠𝚒𝚜𝚑\mathtt{Swish}typewriter_Swish) generally performs slightly worse than 𝚁𝚎𝙻𝚄𝚁𝚎𝙻𝚄\mathtt{ReLU}typewriter_ReLU.

  • The use of 𝙶𝙴𝙻𝚄𝙶𝙴𝙻𝚄\mathtt{GELU}typewriter_GELU (or 𝚂𝚠𝚒𝚜𝚑𝚂𝚠𝚒𝚜𝚑\mathtt{Swish}typewriter_Swish) typically results in slightly longer training time compared to 𝚁𝚎𝙻𝚄𝚁𝚎𝙻𝚄\mathtt{ReLU}typewriter_ReLU.

Additionally, other popular S-shaped activation functions like 𝚂𝚒𝚐𝚖𝚘𝚒𝚍𝚂𝚒𝚐𝚖𝚘𝚒𝚍\mathtt{Sigmoid}typewriter_Sigmoid and 𝚃𝚊𝚗𝚑𝚃𝚊𝚗𝚑\mathtt{Tanh}typewriter_Tanh have demonstrated poor performance in our tests, possibly due to the vanishing gradient problem. For highly oscillatory target functions, when using 𝚂𝚒𝚐𝚖𝚘𝚒𝚍𝚂𝚒𝚐𝚖𝚘𝚒𝚍\mathtt{Sigmoid}typewriter_Sigmoid or 𝚃𝚊𝚗𝚑𝚃𝚊𝚗𝚑\mathtt{Tanh}typewriter_Tanh training errors did not even decrease during the training process.

6 Conclusion

In this work, we introduced the Multi-component and Multi-layer Neural Network (MMNN) and demonstrated its effectiveness in approximating complex functions. By incorporating the principles of structured and balanced decomposition, the MMNN architecture addresses the limitations of shallow networks, particularly in capturing high-frequency components and localized fine features. Our proposed network structure as confirmed by extensive numerical experiments can approximate highly oscillatory functions and functions with rapid transitions efficiently and accurately. Additionally, we highlight the advantages of our training strategy, which optimizes only the linear combination weights of basis functions for each component while kee** the parameters within the activation (basis) functions fixed, leading to a more efficient and stable training process.

The theoretical underpinnings and practical implementations presented in this paper suggest that MMNNs offer a promising direction for constructing neural networks capable of handling complex tasks with fewer parameters and reduced computational overhead. Future research can explore further generalizations and applications of MMNNs, as well as investigate the interplay between representation and optimization in more depth.

Acknowledgments

H. Zhao was partially supported by NSF grants DMS-2309551, and DMS-2012860. Y. Zhong was partially supported by NSF grant DMS-2309530, H. Zhou was partially supported by NSF grant DMS-2307465.

References

  • [1] Helmut. Bölcskei, Philipp. Grohs, Gitta. Kutyniok, and Philipp. Petersen. Optimal approximation with sparsely connected deep neural networks. SIAM Journal on Mathematics of Data Science, 1(1):8–45, 2019.
  • [2] Charles K. Chui, Shao-Bo Lin, and Ding-Xuan Zhou. Construction of neural networks for realization of localized deep learning. Frontiers in Applied Mathematics and Statistics, 4:14, 2018.
  • [3] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2:303–314, 1989.
  • [4] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR.
  • [5] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv e-prints, page arXiv:1706.02677, June 2017.
  • [6] Rémi Gribonval, Gitta Kutyniok, Morten Nielsen, and Felix Voigtlaender. Approximation spaces of deep neural networks. Constructive Approximation, 55:259–367, 2022.
  • [7] Ingo Gühring, Gitta Kutyniok, and Philipp Petersen. Error bounds for approximations with deep ReLU neural networks in Ws,psuperscript𝑊𝑠𝑝{W}^{s,p}italic_W start_POSTSUPERSCRIPT italic_s , italic_p end_POSTSUPERSCRIPT norms. Analysis and Applications, 18(05):803–859, 2020.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016.
  • [9] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). arXiv e-prints, page arXiv:1606.08415, June 2016.
  • [10] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257, 1991.
  • [11] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989.
  • [12] Yerlan Idelbayev and Miguel Á. Carreira-Perpiñán. Low-rank compression of neural nets: Learning the rank of each layer. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8046–8056, 2020.
  • [13] Aysu Ismayilova and Vugar E. Ismailov. On the kolmogorov neural networks. Neural Networks, 176:106333, 2024.
  • [14] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  • [15] A. N. Kolmogorov. On the representation of continuous functions of several variables by superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk SSSR, pages 953–956, 1957.
  • [16] Fanghui Liu, Xiaolin Huang, Yudong Chen, and Johan A. K. Suykens. Random features for kernel approximation: A survey on algorithms, theory, and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7128–7148, 2022.
  • [17] Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić, Thomas Y. Hou, and Max Tegmark. KAN: Kolmogorov-Arnold networks. arXiv e-prints, page arXiv:2404.19756, April 2024.
  • [18] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
  • [19] Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021.
  • [20] Vitaly Maiorov and Allan Pinkus. Lower bounds for approximation by MLP neural networks. Neurocomputing, 25(1):81–91, 1999.
  • [21] Hadrien Montanelli and Haizhao Yang. Error bounds for deep ReLU networks using the Kolmogorov-Arnold superposition theorem. Neural Networks, 129:1–6, 2020.
  • [22] Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. Advances in neural information processing systems, 28, 2015.
  • [23] Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong. Random feature attention. In International Conference on Learning Representations, 2021.
  • [24] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.
  • [25] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions. arXiv e-prints, page arXiv:1710.05941, October 2017.
  • [26] Siddhartha Rao Kamalakara, Acyr Locatelli, Bharat Venkitesh, Jimmy Ba, Yarin Gal, and Aidan N. Gomez. Exploring low rank training of deep neural networks. arXiv e-prints, page arXiv:2209.13569, September 2022.
  • [27] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • [28] Tara N. Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6655–6659, 2013.
  • [29] Lesia Semenova, Cynthia Rudin, and Ronald Parr. On the existence of simpler machine learning models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1827–1858, 2022.
  • [30] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Nonlinear approximation via compositions. Neural Networks, 119:74–84, 2019.
  • [31] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons. Communications in Computational Physics, 28(5):1768–1811, 2020.
  • [32] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons. Journal of Machine Learning Research, 23(276):1–60, 2022.
  • [33] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation in terms of intrinsic parameters. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 19909–19934. PMLR, 17–23 Jul 2022.
  • [34] Aman Sinha and John C Duchi. Learning kernels with random features. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  • [35] Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114, 2017.
  • [36] Dmitry Yarotsky. Optimal approximation of continuous functions by very deep ReLU networks. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 639–649. PMLR, 06–09 Jul 2018.
  • [37] Shijun Zhang. Deep neural network approximation via function compositions. PhD Thesis, National University of Singapore, 2020.
  • [38] Shijun Zhang, Hongkai Zhao, Yimin Zhong, and Haomin Zhou. Why shallow networks struggle with approximating and learning high frequency: A numerical study. arXiv e-prints, page arXiv:2306.17301, June 2023.
  • [39] Ding-Xuan Zhou. Universality of deep convolutional neural networks. Applied and Computational Harmonic Analysis, 48(2):787–794, 2020.