Expressivity of Neural Networks with Random Weights and Learned Biases

Ezekiel Williams
Mathematics and Statistics
Université de Montréal
Montréal, Québec, Canada
[email protected]
&Avery Hee-Woon Ryoo
Computer Science
Université de Montréal
Montréal, Québec, Canada
[email protected]
Thomas Jiralerspong
Computer Science
Université de Montréal
Montréal, Québec, Canada
[email protected]
&Alexandre Payeur
Mathematics and Statistics
Université de Montréal
Montréal, Québec, Canada
[email protected]
&Matthew G. Perich
Neuroscience
Université de Montréal
Montréal, Québec, Canada
[email protected]
&Luca Mazzucato
Biology, Physics, and Mathematics
University of Oregon
Eugene, Oregon, United States
[email protected]
&Guillaume Lajoie
Mathematics and Statistics
Université de Montréal
Montréal, Québec, Canada
[email protected]
Corresponding author; \dagger denotes equal contribution; \star denotes co-senior authors; second line is department
Abstract

Landmark universal function approximation results for neural networks with trained weights and biases provided impetus for the ubiquitous use of neural networks as learning models in Artificial Intelligence (AI) and neuroscience. Recent work has pushed the bounds of universal approximation by showing that arbitrary functions can similarly be learned by tuning smaller subsets of parameters, for example the output weights, within randomly initialized networks. Motivated by the fact that biases can be interpreted as biologically plausible mechanisms for adjusting unit outputs in neural networks, such as tonic inputs or activation thresholds, we investigate the expressivity of neural networks with random weights where only biases are optimized. We provide theoretical and numerical evidence demonstrating that feedforward neural networks with fixed random weights can be trained to perform multiple tasks by learning biases only. We further show that an equivalent result holds for recurrent neural networks predicting dynamical system trajectories. Our results are relevant to neuroscience, where they demonstrate the potential for behaviourally relevant changes in dynamics without modifying synaptic weights, as well as for AI, where they shed light on multi-task methods such as bias fine-tuning and unit masking.

1 Introduction

1.1 Context and Motivation

The universal approximation theorems Hornik et al. (1989); Funahashi (1989); Hornik (1991) of the late 1900s highlighted the expressivity of neural network models, i.e. their ability to approximate or express a broad class of functions through tuning of weights and biases, heralding the central role neural networks play in Machine Learning (ML) and neuroscience today. Since these foundational studies, a rich literature has explored the limits of the expressivity of neural networks by finding smaller parameter subsets whose tuning still results in the approximation of wide classes of functions or dynamics. Prior work has explored approximation capabilities of feed-forward (FFNs) and Recurrent Neural Networks (RNNs) where only the output weights are trained Rosenblatt et al. (1962); Rahimi and Recht (2008); Ding et al. (2014); Neufeld and Schmocker (2023); Jaeger and Haas (2004); Sussillo and Abbott (2009); Gonon et al. (2023); Hart et al. (2021), and deep FFNs where only subsets of parameters Rosenfeld and Tsotsos (2019), normalization parameters Burkholz (2023); Giannou et al. (2023), or binary masks–either over units or parameters–are trained Malach et al. (2020). Recently, a study has also explored the approximation abilities of transformers where only context is tuned Petrov et al. (2024). Here, motivated by recent insights from neuroscience, we initiate the investigation of expressivity in both FFNs and RNNs where only biases are trained, an avenue that remains largely unexplored.

Learned biases are of fundamental relevance to neuroscience and ML. In the latter domain, recent work has highlighted the optimization or the careful selection of bias vectors: fine-tuning of biases for multi-task learning Zaken et al. (2021) and the context tokens used for in-context learning Von Oswald et al. (2023) can all be viewed as methods of carefully setting the bias in a model where other connectivity parameters are fixed in order to perform a task. In neuroscience, single neurons employ diverse mechanisms to adjust their response to inputs, beyond synaptic plasticity. Importantly, some of these mechanisms can be construed as manipulating the firing onset of neurons which, in a firing rate model, is represented by the bias. Among these, we have shunting inhibition Holt and Koch (1997), threshold adaptation Azouz and Gray (2000), and a host of other mechanisms that participate in sha** the input-output transfer function of neurons (reviewed in Ferguson and Cardin (2020), see also Mazzucato et al. (2019)). Experimental evidence also suggests that bias-related signals play a role in learning Zhang and Linden (2003); Sehgal et al. (2013) but, despite this, most work modelling learning in neuroscience has focused on synapses.

If tuning the biases of a neural network will only span a reduced set of functions, or output dynamics, then this would solidify the role of synaptic plasticity as the critical component in biological and artificial learning. Conversely, if one can express arbitrary dynamics solely by changing biases, this would motivate deeper investigation of when and how non-synaptic plasticity mechanisms might shoulder some of the effort of learning. In this paper we address the question of the expressivity of bias learning. In a regime where all weight parameters are randomly initialized and frozen, and only hidden layer biases are optimized, we provide theoretical guarantees demonstrating that:

  1. 1.

    feed-forward neural networks with wide hidden layers are universal function approximators with high probability;

  2. 2.

    RNNs with wide hidden layers can arbitrarily approximate finite-time trajectories from smooth dynamical systems with high probability.

We provide empirical support for, and a deeper interrogation of, these theoretical findings with an array of numerical experiments.

1.2 Related Works

Neuroscience. Evidence from experimental and theoretical neuroscience shows that cortical circuits flexibly regulate their dynamical phases and toggle between different regimes solely by changing the distribution of input biases to a circuit. Bias-induced modulations can explain the context-dependent effects of expectation Mazzucato et al. (2019), movements Wyrick and Mazzucato (2021) and arousal Papadopoulos et al. (2024) on sensory processing across modalities and brain areas. Changes in the bias distribution modulate the gain of the input-output neuronal transfer function Mazzucato et al. (2019). Moreover, two bias-related biological mechanisms, neural firing threshold and network inputs, have also been noted for their ability to shape network dynamics: theoretical work shows that threshold heterogeneity can improve network capacity Gast et al. (2024, 2023), while a mix of theory and experimental evidence suggests a role for inputs from upstream brain areas in reconfiguring circuit dynamics on fast timescales Perich et al. (2018); Remington et al. (2018). Within the RNN reservoir computing approach to modelling in neuroscience, where recurrent weights are random and fixed, bias modulations can toggle between multiple phases (including fixed point, chaos, and multistable regimes) and, strikingly, enable RNN multi-tasking in the absence of any parameter optimization Ogawa et al. (2023). A repertoire of dynamical motifs can also be generated in RNN reservoirs with dynamic feedback loops Logiaco et al. (2021). However, in all these studies input biases or other parameters were either i.i.d. generated or set to specific values, but never learned. The role of plastic biases has not been investigated yet and will be the main focus of our efforts below.

Machine Learning. We mention primarily two bodies of literature in ML/AI that are related to this study. Many efforts have explored neural networks that are trained to quickly meta-learn new tasks via dynamics in activation space alone, without any adaptation of weights (see e.g. Feldkamp et al. (1997); Klos et al. (2020); Cotter and Conwell (1990, 1991)). Like our work, this thread of research proposes a mechanism by which a network might adapt to any new task without changing weights. However, prior work differs form the current study in that it requires an initial meta-training of all parameters in a network, weights included, before operating in the "fast learning" regime where network variables maintain context information that allow the networks to rapidly adapt to new tasks. In some cases these context variables can be thought of as biases Cotter and Conwell (1991). The closest ML contributions to our work are on the topic of masking–particularly the Strong Lottery Ticket Hypothesis (SLTH). This hypothesis conjectured that a given desired parameterization of a neural network can be found as a sub-network in a larger, appropriately initialized, network Ramanujan et al. (2020). SLTH is typically formulated with respect to weights, i.e. a subnetwork is defined by deleting weights from the full network. However, a few studies have investigated SLTH when subnetworks are constructed by deleting units Malach et al. (2020). While our study is different at face value for its focus on function approximation via bias optimization, rather than finding “lottery ticket" subnetworks, a key step in our analytic derivations relies on masking in a fashion analogous to proofs of the unit masking version of SLTH. Thus, our work also provides two results that may be of interest to the SLTH theory. First an alternative proof (complementing the first proof in Malach et al. (2020)) of the SLTH for units in single-hidden layer neural networks; second, a first proof of the SLTH over units for RNNs (see Section §2 for more details). Flavours of the lottery ticket hypothesis for RNNs has been explored empirically Yu et al. (2019); García-Arias et al. (2021); Schlake et al. (2022) but we have not encountered its proof, over weights or units, in the literature.

2 Theory Results

2.1 Feed-forward neural networks

This section studies the single-layer FFN, whose output is given by

yn(x,θ)=i=1nA:iϕ(Bi:x+b),subscript𝑦𝑛𝑥𝜃superscriptsubscript𝑖1𝑛subscript𝐴:absent𝑖italic-ϕsubscript𝐵:𝑖absent𝑥𝑏y_{n}(x,\theta)=\sum_{i=1}^{n}A_{:i}\phi(B_{i:}x+b),italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x , italic_θ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT : italic_i end_POSTSUBSCRIPT italic_ϕ ( italic_B start_POSTSUBSCRIPT italic_i : end_POSTSUBSCRIPT italic_x + italic_b ) , (1)

with Al×n𝐴superscript𝑙𝑛A\in\mathbb{R}^{l\times n}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_l × italic_n end_POSTSUPERSCRIPT, Bn×d𝐵superscript𝑛𝑑B\in\mathbb{R}^{n\times d}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT, bn𝑏superscript𝑛b\in\mathbb{R}^{n}italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and θ={B,A,b}𝜃𝐵𝐴𝑏\theta=\{B,A,b\}italic_θ = { italic_B , italic_A , italic_b }. We shall investigate the approximation properties of this neural network when all the weights in B𝐵Bitalic_B and A𝐴Aitalic_A are fixed and sampled uniformly from the n(l+d)𝑛𝑙𝑑n(l+d)italic_n ( italic_l + italic_d ) dimensional centered hypercube, with half-edges of length γ𝛾\gammaitalic_γ, and only b𝑏bitalic_b is tuned. We begin by outlining the activation function assumptions necessary for our theoretical results.

Definition 1.

The function ϕitalic-ϕ\phiitalic_ϕ is a suitable activation if, when employed in the neural network of Eq.1, it allows for universal approximation of the following kind: for any continuous h:Ul:𝑈superscript𝑙h:U\to\mathbb{R}^{l}italic_h : italic_U → blackboard_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and any ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 n𝑛\exists n\in\mathbb{N}∃ italic_n ∈ blackboard_N and parameters θ𝜃\thetaitalic_θ s.t. Uh(x)yn(x,θ)𝑑xϵsubscript𝑈norm𝑥subscript𝑦𝑛𝑥𝜃differential-d𝑥italic-ϵ\int_{U}||h(x)-y_{n}(x,\theta)||dx\leq\epsilon∫ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT | | italic_h ( italic_x ) - italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x , italic_θ ) | | italic_d italic_x ≤ italic_ϵ, where Ud𝑈superscript𝑑U\subset\mathbb{R}^{d}italic_U ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is compact and ||||||\cdot||| | ⋅ | | is the 1111-norm and will be throughout the paper.

From the universal approximation theorems of 1993 Leshno et al. (1993); Hornik (1993), a sufficient condition for ϕitalic-ϕ\phiitalic_ϕ to be a suitable activation is that it is non-polynomial. In this paper we conceptualize universal approximation as the approximation of continuous functions on compact sets with respect to an L1superscript𝐿1L^{1}italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT functional norm, but we remark that the literature has also studied other conditions on hhitalic_h (for example, continuity and measurability) and other forms of convergence (for example, with respect to the Lsuperscript𝐿L^{\infty}italic_L start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT norm). For a review of the literature, see Pinkus (1999).

Definition 2.

A suitable activation ϕitalic-ϕ\phiitalic_ϕ is referred to as a γ𝛾\gammaitalic_γ-parameter bounding activation if it allows for universal approximation even when each individual parameter, e.g. an element of a weight matrix or bias vector, is bounded.

Proposition 1.

The ReLU and the Heaviside step function are γ𝛾\gammaitalic_γ-parameter bounding activations for any γ>0𝛾0\gamma>0italic_γ > 0.

The proof is in Appendix §2.1 for completeness. A key subtlety of parameter-bounding is that it is a bound on individual, scalar, parameters. Thus, as a network grows in width the bias vector and weight matrix norms will still grow accordingly. This may be important, as research suggests band-limited parameters cannot universally approximate, at least for certain activation types Li et al. (2023). We leave it to future work to determine which other activations are parameter bounding.

We make one final definition:

Definition 3.

If ϕitalic-ϕ\phiitalic_ϕ is a γ𝛾\gammaitalic_γ-parameter bounding activation, is continuous, and if τ𝜏\exists\tau\in\mathbb{R}∃ italic_τ ∈ blackboard_R such that for x<τ𝑥𝜏x<\tauitalic_x < italic_τ ϕ(x)=0italic-ϕ𝑥0\phi(x)=0italic_ϕ ( italic_x ) = 0 then we say that ϕitalic-ϕ\phiitalic_ϕ is a γ𝛾\gammaitalic_γ-bias-learning activation.

Obviously the ReLU is a bias-learning activation. We leave the study of discontinuous functions like the Heaviside to future work. We conclude with the main result of this section. Define pRsubscript𝑝𝑅p_{R}italic_p start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT to be a uniform distribution on [R,+R]𝑅𝑅[-R,+R][ - italic_R , + italic_R ], where γ<R<𝛾𝑅\gamma<R<\inftyitalic_γ < italic_R < ∞.

Theorem 1.

Assume that ϕitalic-ϕ\phiitalic_ϕ is γ𝛾\gammaitalic_γ-bias learning and, for compact Ud𝑈superscript𝑑U\subset\mathbb{R}^{d}italic_U ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, h:Ul:𝑈superscript𝑙h:U\to\mathbb{R}^{l}italic_h : italic_U → blackboard_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is continuous. Then for any degree of accuracy ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 and probability of error δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), there exists a hidden-layer width m𝑚m\in\mathbb{N}italic_m ∈ blackboard_N and bias vector bm𝑏superscript𝑚b\in\mathbb{R}^{m}italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT such that, with a probability of 1δ1𝛿1-\delta1 - italic_δ, a neural network given by Eq.1 with each individual weight sampled from pRsubscript𝑝𝑅p_{R}italic_p start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT approximates hhitalic_h with error less than ϵitalic-ϵ\epsilonitalic_ϵ.

Corollary 1.

Assume that d=l𝑑𝑙d=litalic_d = italic_l, i.e. the input and output space of hhitalic_h has the same dimension. Then the results of Theorem 1 also hold for single-hidden-layer res-nets.

Proof Intuition: We provide intuition about the proof of Theorem 1 and its Corollary, whose details can be found in the Appendix. According to the universal approximation theorem, given a continuous function, we can find a one-hidden-layer network, 𝒩1subscript𝒩1\mathcal{N}_{1}caligraphic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, that is close to that function in the L1superscript𝐿1L^{1}italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT norm on the (compact) space of its inputs. If 𝒩1subscript𝒩1\mathcal{N}_{1}caligraphic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT has been constructed using γ𝛾\gammaitalic_γ-parameter bounding activation functions, then we know that each parameter will be on the interval [γ,γ]𝛾𝛾[-\gamma,\gamma][ - italic_γ , italic_γ ]. Next, we construct a second network, 𝒩2subscript𝒩2\mathcal{N}_{2}caligraphic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, to approximate 𝒩1subscript𝒩1\mathcal{N}_{1}caligraphic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT by randomly sampling each of its parameters, weight or bias, from pRsubscript𝑝𝑅p_{R}italic_p start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT. For 𝒩2subscript𝒩2\mathcal{N}_{2}caligraphic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to approximate 𝒩1subscript𝒩1\mathcal{N}_{1}caligraphic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, each parameter of 𝒩2subscript𝒩2\mathcal{N}_{2}caligraphic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT should fall within a tiny window of an analogous parameter in 𝒩1subscript𝒩1\mathcal{N}_{1}caligraphic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. This window must have half-length less than ϵitalic-ϵ\epsilonitalic_ϵ to yield the desired error bound. Without loss of generality we can assume ϵ<Rγitalic-ϵ𝑅𝛾\epsilon<R-\gammaitalic_ϵ < italic_R - italic_γ. Then, if we sample parameters uniformly on [R,R]𝑅𝑅[-R,R][ - italic_R , italic_R ] there will be a non-zero probability that a given parameter of 𝒩2subscript𝒩2\mathcal{N}_{2}caligraphic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT will end up within the tiny ϵitalic-ϵ\epsilonitalic_ϵ-window centered at a corresponding parameter value in 𝒩1subscript𝒩1\mathcal{N}_{1}caligraphic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT; because ϵ<Rγitalic-ϵ𝑅𝛾\epsilon<R-\gammaitalic_ϵ < italic_R - italic_γ we know that the ϵitalic-ϵ\epsilonitalic_ϵ-window won’t stretch outside the distribution support. If we randomly sample a very large number of units to construct the hidden layer of 𝒩2subscript𝒩2\mathcal{N}_{2}caligraphic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT the probability of finding a subnetwork of 𝒩2subscript𝒩2\mathcal{N}_{2}caligraphic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT corresponding to 𝒩1subscript𝒩1\mathcal{N}_{1}caligraphic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can be made arbitrarily close to 1111. If the activation function is bias-learning we can use biases to pick out this subnetwork by setting them appropriately smaller than the threshold given in Definition 3. We stress that this proof requires exceedingly massive hidden layer widths.

2.2 Recurrent neural networks

This section studies a discrete-time RNN given by

rt=αrt1+βϕ(Wrt1+Bxt1+b),yt=Crt,formulae-sequencesubscript𝑟𝑡𝛼subscript𝑟𝑡1𝛽italic-ϕ𝑊subscript𝑟𝑡1𝐵subscript𝑥𝑡1𝑏subscript𝑦𝑡𝐶subscript𝑟𝑡\displaystyle r_{t}=-\alpha r_{t-1}+\beta\phi(Wr_{t-1}+Bx_{t-1}+b),\quad y_{t}% =Cr_{t},italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - italic_α italic_r start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_β italic_ϕ ( italic_W italic_r start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_B italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_b ) , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_C italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (2)

where rtmsubscript𝑟𝑡superscript𝑚r_{t}\in\mathbb{R}^{m}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT for all 0tT0𝑡𝑇0\leq t\leq T0 ≤ italic_t ≤ italic_T for some T𝑇T\in\mathbb{N}italic_T ∈ blackboard_N, α𝛼\alphaitalic_α and β𝛽\betaitalic_β control the time scale of the dynamics, Wm×m𝑊superscript𝑚𝑚W\in\mathbb{R}^{m\times m}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT, Cl×m𝐶superscript𝑙𝑚C\in\mathbb{R}^{l\times m}italic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_l × italic_m end_POSTSUPERSCRIPT, and B𝐵Bitalic_B and b𝑏bitalic_b are as in the previous section. The parameters are now θ={W,B,C,b}𝜃𝑊𝐵𝐶𝑏\theta=\{W,B,C,b\}italic_θ = { italic_W , italic_B , italic_C , italic_b }. The time-dependent input xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT belongs to a compact subset Uxdsubscript𝑈𝑥superscript𝑑U_{x}\subset\mathbb{R}^{d}italic_U start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT for all t𝑡titalic_t. Note that when α=0𝛼0\alpha=0italic_α = 0, β=1𝛽1\beta=1italic_β = 1 one gets the standard vanilla RNN formulation; alternatively, α𝛼\alphaitalic_α and β𝛽\betaitalic_β can be set to approximate continuous-time dynamics using Euler’s method.

We will approximate the following class of dynamical systems by learning only biases:

zt+1=F(zt,xt),yt=Qzt,z0Uz,formulae-sequencesubscript𝑧𝑡1𝐹subscript𝑧𝑡subscript𝑥𝑡formulae-sequencesubscript𝑦𝑡𝑄subscript𝑧𝑡subscript𝑧0subscript𝑈𝑧z_{t+1}=F(z_{t},x_{t}),\quad y_{t}=Qz_{t},\quad z_{0}\in U_{z},italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_F ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_Q italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , (3)

where t𝑡titalic_t and xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are as defined for the RNN, F:Uz×Uxs:𝐹subscript𝑈𝑧subscript𝑈𝑥superscript𝑠F:U_{z}\times U_{x}\to\mathbb{R}^{s}italic_F : italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT × italic_U start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is continuous, and Ql×s𝑄superscript𝑙𝑠Q\in\mathbb{R}^{l\times s}italic_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_l × italic_s end_POSTSUPERSCRIPT. Because we build from the classic universal approximation results, we must be working with functions operating on compact sets. To guarantee that this will be the case we must make several more assumptions about the dynamical system. First, Uzssubscript𝑈𝑧superscript𝑠U_{z}\subset\mathbb{R}^{s}italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is assumed to be a compact invariant set of the dynamical system: if the system is in Uzsubscript𝑈𝑧U_{z}italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT it remains there for all t𝑡titalic_t for all inputs in Uxsubscript𝑈𝑥U_{x}italic_U start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. Second, we assume that the dynamical system is well-defined on a slightly larger compact set U~z×Uxsubscript~𝑈𝑧subscript𝑈𝑥\tilde{U}_{z}\times U_{x}over~ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT × italic_U start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, where U~z={z0+c0:z0Uz,c0<c}subscript~𝑈𝑧conditional-setsubscript𝑧0subscript𝑐0formulae-sequencesubscript𝑧0subscript𝑈𝑧normsubscript𝑐0𝑐\tilde{U}_{z}=\{z_{0}+c_{0}:z_{0}\in U_{z},\>||c_{0}||<c\}over~ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = { italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , | | italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | < italic_c } for some c>0𝑐0c>0italic_c > 0, with U~zUzsubscript𝑈𝑧subscript~𝑈𝑧\tilde{U}_{z}\supset U_{z}over~ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⊃ italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. Letting pRsubscript𝑝𝑅p_{R}italic_p start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT be defined as in Section 2.1, the main result of this section is:

Theorem 2.

Consider the RNN in Eq.2 with ϕitalic-ϕ\phiitalic_ϕ a γ𝛾\gammaitalic_γ-bias learning activation, and input, output, and recurrent weight parameters for each hidden unit sampled from pRsubscript𝑝𝑅p_{R}italic_p start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT. We can find a hidden-layer width, a bias vector, and a hidden-state initial condition for the RNN such that, with a probability that is arbitrarily close to 1111, the RNN approximates finite trajectories from the dynamical system defined in Eq.3 to below any positive, non-zero, error.

This proof is weaker than the proof given in the previous section in that it shows point-wise convergence, rather than L1superscript𝐿1L^{1}italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, convergence. We conjecture that this can be extended straightforwardly but we leave this result to future work.

Proof Intuition: The complete proof is in the Appendix; here we provide intuition. As in Theorem 1 the proof proceeds in two steps. First, the dynamical system is approximated by an RNN, 1subscript1\mathcal{R}_{1}caligraphic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, using universal approximation theory for RNNs (see e.g., Schäfer and Zimmermann (2006)). 1subscript1\mathcal{R}_{1}caligraphic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is then approximated by a much wider, random RNN, 2subscript2\mathcal{R}_{2}caligraphic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, with parameters sampled from pRsubscript𝑝𝑅p_{R}italic_p start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT. Analogous to Theorem 1, we show that one can find a sub-network of hidden units in 2subscript2\mathcal{R}_{2}caligraphic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT that approximates 1subscript1\mathcal{R}_{1}caligraphic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for very large hidden widths of 2subscript2\mathcal{R}_{2}caligraphic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

3 Numerical Results

3.1 Multi-task Learning with Bias-learning Feed-forward Networks

We first validated the theory by checking whether a single-hidden-layer, bias-learning FFN could perform digit recognition on the MNIST dataset Deng (2012) increasingly well as its hidden layer was widened. Initial weights were sampled from a uniform distribution on [0.1,0.1]0.10.1[-0.1,0.1][ - 0.1 , 0.1 ] and then frozen. As expected, the network learned the task and validation accuracy increased with layer width. The largest gains in performance occurred before 5000 units (Fig. 1A).

Refer to caption
Figure 1: A. Validation accuracy on MNIST vs layer width. B. Validation accuracy on multiple image classification tasks for bias-only (blue) and fully-trained (orange) networks. In panels A and B, training was on 20 epochs and error bars on 5 runs are omitted because the standard errors are of order 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. C. Top: K-mean clustering of Task Variance (TV) reveals task-specific clusters. Bottom: Spearman correlation between TV and bias vectors (mean across neurons in each cluster). In this and all other figures Adam Kingma and Ba (2014) was used for optimization.

Intuitively, bias learning should allow a single random set of weights to be used to learn multiple tasks by simply optimizing task-specific bias vectors. We confirmed this intuition by training a single-hidden-layer FFN with 32000 hidden units on 7 different tasks: MNIST Deng (2012), KMNIST Clanuwat et al. (2018), Fashion MNIST Xiao et al. (2017), Ethiopic-MNIST, Vai-MNIST, and Osmanya-MNIST from Afro-MNIST Wu et al. (2020), and Kannada-MNIST Prabhu (2019). All tasks involved classifying 28×\times×28 grayscale images into 10 classes. The random weights were fixed across tasks while different biases were learned. We compared bias learning against a fully-trained neural network with the same size and architecture (Fig. 1B). We found that the bias-only network achieved similar performance to the fully-trained network on most tasks (only significantly worse on KMNIST). A crucial difference here is that the networks had matched size and architecture, so that the number of trainable parameters in the bias-only network (32 000 parameters) was several orders of magnitude smaller than in the fully-trained case (25 440 010 parameters). Notably, a different set of biases was learned for each task. We conclude that bias-only learning in FFNs is a viable avenue to perform multi-tasking with randomly initialized and fixed weights, but that it requires a much wider hidden layer than fully trained networks.

Next, we investigated the neural mechanism underlying bias learning of multiple tasks. We investigated the task-specific functional organization of the hidden units by estimating single-unit Task Variance (TV) Yang et al. (2019), defined as the variance of a hidden unit activation across the test set for each task. The TV provides a measure of the extent that a given hidden unit contributes to the given task. A unit with high TV in a particular task indicates that its responses vary across stimuli, suggesting that the unit is recruited for solving that task. A unit with high TV in one task and low TV for all other tasks is specialized to one particular task. We clustered the hidden units TV using K-means clustering (K𝐾Kitalic_K chosen by cross-validation) on the vectors of TVs for each unit and found that distinct functional clusters of hidden units emerged (Fig. 1C). Most units reflected strong task specialization, i.e., they were only used for specific tasks (ex: cluster 3 for KMNIST and cluster 10 for Osmanya). Others were used for many or all tasks (ex: clusters 1 and 8), although a smaller fraction of clusters exhibited such non-selective activation patterns. Overall, we conclude that multi-task bias-learning leads to the emergence of task-specific functional organization.

We finally explored the relationship between the bias of a hidden unit and its task variance (TV). If the neural networks are using biases to shut-off units, analogous to the intuition in our theory work (Section 2.1), then the network units that do not actively participate in a task should be more quiet due to a low bias value learned during training on that particular task. In other words, this intuition would suggest that units should exhibit a correlation between bias and TV, especially in task-specific clusters. In our experiments, all clusters did exhibit the statistical trend of a positive correlation between bias and TV, although to a varying degree across clusters (see numbers at the bottom of Fig. 1C).

3.2 Relationship Between Bias Learning and Masked Learning in FFNs

Refer to caption
Figure 2: Comparing bias and masked learning on same weights. A. Learning curves for bias (orange) and mask (black) learning. Inset: we observed a trend where bias-learning achieved roughly 1111 percentage point higher accuracy on MNIST over mask-learning (0.915±0.0028SDplus-or-minus0.9150.0028SD0.915\pm 0.0028\mathrm{SD}0.915 ± 0.0028 roman_SD bias vs. 0.905±0.0028SDplus-or-minus0.9050.0028SD0.905\pm 0.0028\mathrm{SD}0.905 ± 0.0028 roman_SD mask). B. Histograms of hidden unit variances, calculated over 10,0001000010,00010 , 000 MNIST samples, for bias-trained (orange) and mask-trained (black). Histogram counts are log-scaled. C. Scatter plot of hidden unit variances from C but taking only units that are non-zero/not approximately-zero in mask/bias-trained networks; bias-trained on x-axis and mask-trained on y-axis. Plot A is mean±1plus-or-minus1\pm 1± 1sd over 4 mask-trained/bias-trained nets. Plot B, and C are both one representative network trained with biases and with masks. Learned parameters were initialized to uniform on [0.1,0.1]0.10.1[-0.1,0.1][ - 0.1 , 0.1 ], weights to uniform on [1d,1d]1𝑑1𝑑[-\frac{1}{\sqrt{d}},\frac{1}{\sqrt{d}}][ - divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG , divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ]; all other hyper-parameters can be found in the codebase.

As our theory shows that bias-learning networks can universally approximate simply by turning units off, we wished to test whether bias learning performs similarly to learning masks, and to what extent solutions learned by these approaches are different from each other. We compared training mask to bias learning on networks with the same random input/output weight matrices. For mask-training, we approximated binary masks with ‘soft’ sigmoid masks by optimizing the sigmoid parameters. The sigmoid was steepened over the course of training to approximately learn a mask, and at test time the slope of the sigmoid was made steep enough to effectively binarize the learned soft mask. We compared masks learned in this fashion with learned biases on single-hidden-layer ReLU networks with 10,0001000010,00010 , 000 hidden units. We observed a trend of bias-training slightly improving upon mask-training (Fig.2A), an expected increase given the added flexibility of biases over masks. Further research is needed to determine if this trend is reliable across datasets and different network parametrizations, and whether there might be scenarios where one style of learning works better or worse.

Next, we compared the solutions found via bias and mask learning. We calculated the variance of each hidden unit across 10,0001000010,00010 , 000 MNIST images, in both the bias and mask-trained paradigms, as a measure of the hidden-layer representation of MNIST. The histogram over hidden units of these variances is plotted for trained masks/biases (black/orange) for the same network weights (Fig.2B). The mask-trained networks found slightly sparser solutions (more 00s, far left histogram bin) and performed the task with lower unit variance values (see middle/right of histogram) compared to the bias-trained networks. We also plotted the hidden unit variances for units in mask and bias-trained networks and computed their correlation (Fig.2C). The scatter plot and correlation were calculated after removing unit variances that were zero in the mask-trained net and sufficiently close to zero (with variance values less than the mean variance divided by 100) in the bias-trained network. Despite the differences observed in the histogram, the correlation coefficient was 0.461±0.022plus-or-minus0.4610.0220.461\pm 0.0220.461 ± 0.022 (mean ±plus-or-minus\pm± SD across n=4𝑛4n=4italic_n = 4 networks), suggesting that there is some overlap in the way that these two methods solve MNIST.

3.3 Bias-learning of Dynamical Systems with Recurrent Neural Networks

3.3.1 Autonomous dynamical systems

Refer to caption
Figure 3: Learning autonomous dynamical systems. A: Cosine generated by a bias-learning recurrent network with 200 hidden units (dashed orange) and its target (solid black). B. Eigenvalue spectra for the recurrent weight matrix (left) and the Jacobian at the start of training (right, grey squares) and mid-training (right, orange circles), when the network produced a decaying oscillation with period 23.75, close to the target period of 25. Units’ activity approached a fixed point with respect to which the Jacobian was computed. Later in training, the rightmost eigenvalues approached 1 in magnitude and their phases were such that the oscillation had period 25similar-toabsent25\sim 25∼ 25. C. Van der Pol oscillator (target in solid black; with dynamics x¨=μ(1x2)x˙x¨𝑥𝜇1superscript𝑥2˙𝑥𝑥\ddot{x}=\mu(1-x^{2})\dot{x}-xover¨ start_ARG italic_x end_ARG = italic_μ ( 1 - italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) over˙ start_ARG italic_x end_ARG - italic_x, for μ=2𝜇2\mu=2italic_μ = 2) generated by the bias-learning recurrent network with 675 hidden units for a recurrent gain of 1 (dashed orange; see panel D) and a gain of 0.9 (dashed dark orange). Output represents the position of the oscillator, rescaled to [-1, 1]. D. (Left) Sensitivity to distribution of recurrent weights. The fully-trained and bias-learning networks had the same number of learnable parameters (size of hidden layer for fully-trained network was 25). Initial recurrent weight matrix had elements sampled from (g/m)𝒩(0,1)𝑔𝑚𝒩01(g/\sqrt{m})\mathcal{N}(0,1)( italic_g / square-root start_ARG italic_m end_ARG ) caligraphic_N ( 0 , 1 ), where g𝑔gitalic_g is the gain (Gain recurrent init.). Error bars denote SEM for n=10𝑛10n=10italic_n = 10. (Right) Schematics of the fully-trained (top) and bias-learning (bottom) autonomous RNNs. Colored links denote trained weights. In all panels, output matrix elements were 𝒩(0,1/m2)similar-toabsent𝒩01superscript𝑚2\sim\mathcal{N}(0,1/m^{2})∼ caligraphic_N ( 0 , 1 / italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and biases b𝒰(0,1)similar-to𝑏𝒰01b\sim\mathcal{U}(0,1)italic_b ∼ caligraphic_U ( 0 , 1 ).

We studied the expressivity of bias learning in RNNs trained to generate linear and nonlinear dynamical systems autonomously (i.e., with xt0subscript𝑥𝑡0x_{t}\equiv 0italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≡ 0 in Eq. 2). We found that RNNs with fixed and random Gaussian weights and trained biases were able to generate a simple cosine function (Fig. 3A). We then elucidated the mechanism underlying RNN bias learning by comparing the Jacobian matrix after learning with the random recurrent weight matrix (which was held fixed during learning). We found that although the random weight matrix maintained a fixed and circular eigenvalue distribution (Fig. 3B, left), learning the biases shaped the Jacobian matrix to develop complex conjugate pairs of large eigenvalues underlying the oscillations (Fig. 3B, right). Therefore, bias learning strongly relies on the ability to shape the “effective connectivity matrix”, i.e. the Jacobian, which involves the derivative of the activation and the recurrent weight matrix.

We next investigated whether bias learning relied on the statistics of the fixed recurrent weights. In light of Fig. 3B, we thus hypothesized that bias learning would be affected by changes in the weight distribution, because bias learning can only control the derivative. We initialized an i.i.d. Gaussian distributed weight matrix Wij𝒩(0,g2/N)similar-tosubscript𝑊𝑖𝑗𝒩0superscript𝑔2𝑁W_{ij}\sim{\cal N}(0,g^{2}/N)italic_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_N ), where g𝑔gitalic_g is referred to as its ‘gain’. We then trained bias-learning networks to generate a van der Pol oscillator (Fig. 3C). We found that bias learning required a large enough gain (at least g=1𝑔1g=1italic_g = 1) and failed for g<1𝑔1g<1italic_g < 1 (Fig. 3D). This was not purely due to a restricted dynamic range for the network activity since the network was able to reproduce the first peak of the oscillator and then flatlined (Fig. 3C). In contrast, fully-trained networks with the same number of training parameters (Fig. 3D) were not sensitive to the value of the gain at initialization. According to our theory, because ReLU is γ𝛾\gammaitalic_γ-parameter bounding, a bias-learning RNN should be able to produce the correct output even at low gain, but this could occur for a prohibitively large hidden layer. This result thus highlights that, when the hidden-layer size is fixed, the initial distribution of weights limits the capability of bias-learning networks.

3.3.2 Non-autonomous dynamical systems

Finally, we explored the capabilities of an RNN trained on a non-autonomous dynamical system, namely the x𝑥xitalic_x-dimension of the Lorenz system where the other dimensions are unobserved (see Fig.4 caption for more details). As in the autonomous network, only the biases of the input layer were trained and the weights were initialized randomly. However, here the networks also received an external input. The objective was formulated as follows: given a recurrent state and the value of a dynamical system at some time-point t𝑡titalic_t, predict the future value at t+τ𝑡𝜏t+\tauitalic_t + italic_τ, where this offset (τ=27𝜏27\tau=27italic_τ = 27) was chosen to be the half-width at the half-max of the x𝑥xitalic_x auto-correlation function. The numerical integration time step was 0.010.010.010.01; each model had a hidden-layer width of 1024 units and was trained on different windows of 1080 (i.e., 40τ𝜏\tauitalic_τ) time points.

Refer to caption
Figure 4: Learning non-autonomous dynamical systems. A. The outputs of the fully-trained (top) and bias-only (bottom) networks on a window of a Lorenz system unseen during training (x˙=σ(yx),y˙=x(ρz)y,z˙=xyβzformulae-sequence˙𝑥𝜎𝑦𝑥formulae-sequence˙𝑦𝑥𝜌𝑧𝑦˙𝑧𝑥𝑦𝛽𝑧\dot{x}=\sigma(y-x),\dot{y}=x(\rho-z)-y,\dot{z}=xy-\beta zover˙ start_ARG italic_x end_ARG = italic_σ ( italic_y - italic_x ) , over˙ start_ARG italic_y end_ARG = italic_x ( italic_ρ - italic_z ) - italic_y , over˙ start_ARG italic_z end_ARG = italic_x italic_y - italic_β italic_z). The system was generated using Euler’s method with a step size ΔtΔ𝑡\Delta troman_Δ italic_t of 0.01 from an initialization at (0,1,0), with σ=10𝜎10\sigma=10italic_σ = 10, ρ=28𝜌28\rho=28italic_ρ = 28, and β=83𝛽83\beta=\frac{8}{3}italic_β = divide start_ARG 8 end_ARG start_ARG 3 end_ARG. Standard deviation error bars were computed over 5 seeds, but are not visible. B. Generalization to sequences longer (4320 time-points) than those trained on. C. Output of the bias-only network diverges from the ground truth signal when using its own outputs as context (starting from the grey line).

We found that both the fully-trained and bias-only networks accurately predicted future points of the system, evidenced by a consistent R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT metric of >0.99 (n𝑛nitalic_n=5) on a window of the generated Lorenz attractor held out from training (Fig. 4A). Furthermore, the models showed remarkable stability when predicting windows that were several times larger than the ones used during training, continuing to reconstruct the system without a notable change in accuracy (Fig. 4B). However, when the networks were fed their own previous predictions as input, their prediction accuracy decreased, demonstrating the devastating effect of small compounding deviations propagated through time (Fig. 4C).

4 Discussion

In this paper, we presented theoretical results demonstrating that feed-forward and recurrent neural networks with fixed random weights but learnable biases can approximate arbitrary functions with high probability. We showcased the expressivity of bias-learned networks in auto-regressive modelling, multi-task learning, dynamic pattern generation, and dynamical system forecasting. Finally, we interrogated the representations learned by bias learning via analysis of task specialization, comparison with mask-learning, and an eigenvalue analysis in recurrent networks. In what follows we discuss key insights, limitations, and future directions.

Our results highlight three key insights. First, certain activation functions enable our universal approximation results when weights are drawn from a uniform distribution on a hyper-cube with any strictly positive edge length. Characterizing functions that do or do not support this property–we speculate that non-differentiable points might be an important component–and the link between hyper-cube edge length and network scaling (Remark 1 on Lemma B.4 in Appendix) both seem worthy of future work. Second, in the context of multi-tasking, we showed that bias learning finds solutions that rely on task-selective clusters (Fig.1), similar to the fully trained case Yang et al. (2019). Third, we believe our finding that bias learning yields solutions that are different, though related to, mask learning (Fig.2) suggests that further investigation of our method might shed light on this and other non-synaptic learning approaches.

Our three main study limitations inspire directions for future research. First, the mathematical convergence results for dynamical systems are only point-wise over initial conditions, and are for finite-time trajectories. We believe that one might strengthen these results at least to Lpsuperscript𝐿𝑝L^{p}italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT convergence over trajectories, and that one might be able to break the finite-time limitations by studying convergence in stationary distribution. Second, our theory massively overestimates how wide the hidden layer of a bias-learned network should scale to reach a desired level of error (Remark 2 on Lemma B.4 in Appendix). This limitation might be addressed by drawing on scaling arguments from the SLTH literature (see e.g. Malach et al. (2020)). Third, a potential confounding factor in our comparison of bias and mask learning is that our mask learning approach used a learning schedule in the steepness parameter for the soft-masks. It is possible that the altered learning dynamics due to this scheduling contributed to mask and bias learning finding different solutions. Addressing this confound is an important direction for future work.

Future directions should focus on capturing greater biological detail, and better hidden layer scaling. Experimental Ferguson and Cardin (2020) and theoretical work Wyrick and Mazzucato (2021); Ogawa et al. (2023) showed that neural pathways that modulated biases, like firing threshold or tonic inputs, may effect other neuronal properties, like neuron input-output gain. As our proofs rely on masking, they demonstrate universal approximation not just for bias learning but for any learned mechanism that can mask neurons, possibly including gain modulation. Exploring the flexibility of paradigms where gain (Stroud et al. (2018)) and biases are learned in concert is an interesting direction to exploit this. Moreover, the observed distribution of synaptic weights in the brain is not uniform but long-tailed Song et al. (2005), and searching for weight distributions that improve expressivity and hidden-layer scaling is an exciting future direction. If hidden-layer scaling can be improved, bias-learning might open up options for temporal credit assignment that perform poorly due to the N2superscript𝑁2N^{2}italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT scaling of synapses. Alternatively, it would be fascinating if, regardless of whether learned parameters are weights or biases, one needs roughly the same number of parameters to achieve a given degree of task performance.

Combined with decades old results on adaptive behaviour in networks without weight changes Cotter and Conwell (1990), and related findings in ML (see in-context learning Brown et al. (2020)), we hope that this study will inspire more research on learning phenomena that transcend synaptic adaptation.

Code Availability

Code will be released upon publication.

Acknowledgements

E.W. was supported by an NSERC CGS D scholarship, and wishes to thank the other members of the Lajoie lab for support and for helpful discussions. T.J. was supported by an NSERC CGS M scholarship. M.G.P. was supported by grant the Fonds de recherche du Québec – Santé (chercheurs-boursiers en intelligence artificielle). L.M. was partially supported by National Institutes of Health grants R01NS118461, R01MH127375 and R01DA055439 and National Science Foundation CAREER Award 2238247. GL acknowledges CIFAR and Canada chair programs

References

  • Azouz and Gray [2000] Rony Azouz and Charles M Gray. Dynamic spike threshold reveals a mechanism for synaptic coincidence detection in cortical neurons in vivo. Proceedings of the National Academy of Sciences, 97(14):8110–8115, 2000.
  • Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Burkholz [2023] Rebekka Burkholz. Batch normalization is sufficient for universal function approximation in cnns. In The Twelfth International Conference on Learning Representations, 2023.
  • Clanuwat et al. [2018] Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature. arXiv preprint arXiv:1812.01718, 2018.
  • Cotter and Conwell [1990] N.E. Cotter and P.R. Conwell. Fixed-weight networks can learn. In 1990 IJCNN International Joint Conference on Neural Networks, pages 553–559 vol.3, June 1990. doi: 10.1109/IJCNN.1990.137898.
  • Cotter and Conwell [1991] Neil E Cotter and Peter R Conwell. Learning algorithms and fixed dynamics. In IJCNN-91-Seattle International Joint Conference on Neural Networks, volume 1, pages 799–801. IEEE, 1991.
  • Deng [2012] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  • Ding et al. [2014] Shifei Ding, Xinzheng Xu, and Ru Nie. Extreme learning machine and its applications. Neural Computing and Applications, 25:549–556, 2014.
  • Feldkamp et al. [1997] L. A. Feldkamp, G. V. Puskorius, and P. C. Moore. Adaptive behavior from fixed weight networks. Information Sciences, 98(1):217–235, May 1997. ISSN 0020-0255. doi: 10.1016/S0020-0255(96)00216-2. URL https://www.sciencedirect.com/science/article/pii/S0020025596002162.
  • Ferguson and Cardin [2020] Katie A Ferguson and Jessica A Cardin. Mechanisms underlying gain modulation in the cortex. Nature Reviews Neuroscience, 21(2):80–92, 2020.
  • Funahashi [1989] Ken-Ichi Funahashi. On the approximate realization of continuous map**s by neural networks. Neural networks, 2(3):183–192, 1989.
  • García-Arias et al. [2021] Ángel López García-Arias, Masanori Hashimoto, Masato Motomura, and Jaehoon Yu. Hidden-fold networks: Random recurrent residuals using sparse supermasks. arXiv preprint arXiv:2111.12330, 2021.
  • Gast et al. [2023] Richard Gast, Sara A Solla, and Ann Kennedy. Macroscopic dynamics of neural networks with heterogeneous spiking thresholds. Physical Review E, 107(2):024306, 2023.
  • Gast et al. [2024] Richard Gast, Sara A Solla, and Ann Kennedy. Neural heterogeneity controls computations in spiking neural networks. Proceedings of the National Academy of Sciences, 121(3):e2311885121, 2024.
  • Giannou et al. [2023] Angeliki Giannou, Shashank Rajput, and Dimitris Papailiopoulos. The expressive power of tuning only the normalization layers. arXiv preprint arXiv:2302.07937, 2023.
  • Gonon et al. [2023] Lukas Gonon, Lyudmila Grigoryeva, and Juan-Pablo Ortega. Approximation bounds for random neural networks and reservoir systems. The Annals of Applied Probability, 33(1):28–69, 2023.
  • Hart et al. [2021] Allen G Hart, James L Hook, and Jonathan HP Dawes. Echo state networks trained by tikhonov least squares are l2 (μ𝜇\muitalic_μ) approximators of ergodic dynamical systems. Physica D: Nonlinear Phenomena, 421:132882, 2021.
  • Holt and Koch [1997] Gary R Holt and Christof Koch. Shunting inhibition does not have a divisive effect on firing rates. Neural computation, 9(5):1001–1013, 1997.
  • Hornik [1991] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
  • Hornik [1993] Kurt Hornik. Some new results on neural network approximation. Neural networks, 6(8):1069–1072, 1993.
  • Hornik et al. [1989] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  • Jaeger and Haas [2004] Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. science, 304(5667):78–80, 2004.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Klos et al. [2020] Christian Klos, Yaroslav Felipe Kalle Kossio, Sven Goedeke, Aditya Gilra, and Raoul-Martin Memmesheimer. Dynamical Learning of Dynamics. Physical Review Letters, 125(8), August 2020. ISSN 0031-9007, 1079-7114. doi: 10.1103/PhysRevLett.125.088103. URL https://link.aps.org/doi/10.1103/PhysRevLett.125.088103.
  • Leshno et al. [1993] Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861–867, 1993.
  • Li et al. [2023] Ming Li, Sho Sonoda, Feilong Cao, Yu Guang Wang, and Jiye Liang. How powerful are shallow neural networks with bandlimited random weights? In International Conference on Machine Learning, pages 19960–19981. PMLR, 2023.
  • Logiaco et al. [2021] Laureline Logiaco, LF Abbott, and Sean Escola. Thalamic control of cortical dynamics in a model of flexible motor sequencing. Cell reports, 35(9), 2021.
  • Malach et al. [2020] Eran Malach, Gilad Yehudai, Shai Shalev-Schwartz, and Ohad Shamir. Proving the lottery ticket hypothesis: Pruning is all you need. In International Conference on Machine Learning, pages 6682–6691. PMLR, 2020.
  • Mazzucato et al. [2019] Luca Mazzucato, Giancarlo La Camera, and Alfredo Fontanini. Expectation-induced modulation of metastable activity underlies faster coding of sensory stimuli. Nature neuroscience, 22(5):787–796, 2019.
  • Neufeld and Schmocker [2023] Ariel Neufeld and Philipp Schmocker. Universal approximation property of random neural networks. arXiv preprint arXiv:2312.08410, 2023.
  • Ogawa et al. [2023] Shun Ogawa, Francesco Fumarola, and Luca Mazzucato. Multitasking via baseline control in recurrent neural networks. Proceedings of the National Academy of Sciences, 120(33):e2304394120, 2023.
  • Papadopoulos et al. [2024] Lia Papadopoulos, Suhyun Jo, Kevin Zumwalt, Michael Wehr, David A McCormick, and Luca Mazzucato. Modulation of metastable ensemble dynamics explains optimal coding at moderate arousal in auditory cortex. arXiv preprint arXiv:2404.03902, 2024.
  • Perich et al. [2018] Matthew G. Perich, Juan A. Gallego, and Lee E. Miller. A neural population mechanism for rapid learning. Neuron, 100(4):964–976.e7, 2018. ISSN 0896-6273. doi: https://doi.org/10.1016/j.neuron.2018.09.030.
  • Petrov et al. [2024] Aleksandar Petrov, Philip HS Torr, and Adel Bibi. Prompting a pretrained transformer can be a universal approximator. arXiv preprint arXiv:2402.14753, 2024.
  • Pinkus [1999] Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta numerica, 8:143–195, 1999.
  • Prabhu [2019] Vinay Uday Prabhu. Kannada-mnist: A new handwritten digits dataset for the kannada language. arXiv preprint arXiv:1908.01242, 2019.
  • Rahimi and Recht [2008] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. Advances in neural information processing systems, 21, 2008.
  • Ramanujan et al. [2020] Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What’s hidden in a randomly weighted neural network? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11893–11902, 2020.
  • Remington et al. [2018] Evan D. Remington, Devika Narain, Eghbal A. Hosseini, and Mehrdad Jazayeri. Flexible sensorimotor computations through rapid reconfiguration of cortical dynamics. Neuron, 98(5):1005–1019.e5, 2018. ISSN 0896-6273. doi: https://doi.org/10.1016/j.neuron.2018.05.020.
  • Rosenblatt et al. [1962] Frank Rosenblatt et al. Principles of neurodynamics: Perceptrons and the theory of brain mechanisms, volume 55. Spartan books Washington, DC, 1962.
  • Rosenfeld and Tsotsos [2019] Amir Rosenfeld and John K Tsotsos. Intriguing properties of randomly weighted networks: Generalizing while learning next to nothing. In 2019 16th conference on computer and robot vision (CRV), pages 9–16. IEEE, 2019.
  • Schäfer and Zimmermann [2006] Anton Maximilian Schäfer and Hans Georg Zimmermann. Recurrent neural networks are universal approximators. In Artificial Neural Networks–ICANN 2006: 16th International Conference, Athens, Greece, September 10-14, 2006. Proceedings, Part I 16, pages 632–640. Springer, 2006.
  • Schlake et al. [2022] Georg Stefan Schlake, Jan David Hüwel, Fabian Berns, and Christian Beecks. Evaluating the lottery ticket hypothesis to sparsify neural networks for time series classification. In 2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW), pages 70–73. IEEE, 2022.
  • Sehgal et al. [2013] Megha Sehgal, Chenghui Song, Vanessa L Ehlers, and James R Moyer Jr. Learning to learn–intrinsic plasticity as a metaplasticity mechanism for memory formation. Neurobiology of learning and memory, 105:186–199, 2013.
  • Song et al. [2005] Sen Song, Per Jesper Sjöström, Markus Reigl, Sacha Nelson, and Dmitri B Chklovskii. Highly nonrandom features of synaptic connectivity in local cortical circuits. PLoS biology, 3(3):e68, 2005.
  • Stroud et al. [2018] Jake P Stroud, Mason A Porter, Guillaume Hennequin, and Tim P Vogels. Motor primitives in space and time via targeted gain modulation in cortical networks. Nature neuroscience, 21(12):1774–1783, 2018.
  • Sussillo and Abbott [2009] David Sussillo and Larry F Abbott. Generating coherent patterns of activity from chaotic neural networks. Neuron, 63(4):544–557, 2009.
  • Von Oswald et al. [2023] Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151–35174. PMLR, 2023.
  • Wu et al. [2020] Daniel J Wu, Andrew C Yang, and Vinay U Prabhu. Afro-mnist: Synthetic generation of mnist-style datasets for low-resource languages, 2020.
  • Wyrick and Mazzucato [2021] David Wyrick and Luca Mazzucato. State-dependent regulation of cortical processing speed via gain modulation. Journal of Neuroscience, 41(18):3988–4005, 2021.
  • Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • Yang et al. [2019] Guangyu Robert Yang, Madhura R Joglekar, H Francis Song, William T Newsome, and Xiao-**g Wang. Task representations in neural networks trained to perform many cognitive tasks. Nature neuroscience, 22(2):297–306, 2019.
  • Yu et al. [2019] Haonan Yu, Sergey Edunov, Yuandong Tian, and Ari S Morcos. Playing the lottery with rewards and multiple languages: lottery tickets in rl and nlp. arXiv preprint arXiv:1906.02768, 2019.
  • Zaken et al. [2021] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
  • Zhang and Linden [2003] Wei Zhang and David J Linden. The other side of the engram: experience-driven changes in neuronal intrinsic excitability. Nature Reviews Neuroscience, 4(11):885–900, 2003.

Throughout the appendix the proofs are restated for ease of reference. We will always take ||||||\cdot||| | ⋅ | | to be the 1111-norm unless stated otherwise.

Appendix A Random Neural Network Formulation

The proofs of this section revolve around masked, random, neural networks:

r~m=αr0+βϕ(Wr0+Bx+b),y~m=Ar~m,formulae-sequencesuperscriptsubscript~𝑟𝑚𝛼subscript𝑟0direct-product𝛽italic-ϕ𝑊subscript𝑟0𝐵𝑥𝑏superscriptsubscript~𝑦𝑚𝐴superscriptsubscript~𝑟𝑚\displaystyle\tilde{r}_{m}^{\mathcal{M}}=-\alpha r_{0}+\mathcal{M}\odot\beta% \phi(Wr_{0}+Bx+b),\quad\tilde{y}_{m}^{\mathcal{M}}=A\tilde{r}_{m}^{\mathcal{M}},over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT = - italic_α italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + caligraphic_M ⊙ italic_β italic_ϕ ( italic_W italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_B italic_x + italic_b ) , over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT = italic_A over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT , (A.1)

where 0α<0𝛼0\leq\alpha<\infty0 ≤ italic_α < ∞, 0<β<0𝛽0<\beta<\infty0 < italic_β < ∞, m𝑚m\in\mathbb{N}italic_m ∈ blackboard_N, r0,r~mmsubscript𝑟0superscriptsubscript~𝑟𝑚superscript𝑚r_{0},\>\tilde{r}_{m}^{\mathcal{M}}\in\mathbb{R}^{m}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, y~mlsuperscriptsubscript~𝑦𝑚superscript𝑙\tilde{y}_{m}^{\mathcal{M}}\in\mathbb{R}^{l}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, {0,1}msuperscript01𝑚\mathcal{M}\in\{0,1\}^{m}caligraphic_M ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, and all other matrices and vectors have real elements with the dimensions required by the above definitions. We assume that ϕitalic-ϕ\phiitalic_ϕ is γ𝛾\gammaitalic_γ-parameter bounding and that each individual (scalar) parameter, be it weight or bias, is sampled randomly–before masking–from a uniform distribution on [γ¯,γ¯]¯𝛾¯𝛾[-\bar{\gamma},\bar{\gamma}][ - over¯ start_ARG italic_γ end_ARG , over¯ start_ARG italic_γ end_ARG ] (note that here we are using γ¯¯𝛾\bar{\gamma}over¯ start_ARG italic_γ end_ARG where we used R𝑅Ritalic_R in the main text). In this way the parameters are random variables with compact support. If =𝟏1\mathcal{M}=\mathbf{1}caligraphic_M = bold_1 then we drop the superscript. To account for feed-forward neural networks we simply assume that W𝑊Witalic_W is the zero matrix.

W.l.o.g. assume there are n𝑛nitalic_n non-zero elements in \mathcal{M}caligraphic_M. We construct Wn×nsuperscript𝑊superscript𝑛𝑛W^{\mathcal{M}}\in\mathbb{R}^{n\times n}italic_W start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT–the recurrent matrix restricted to participating (non-masked) hidden units–by beginning with W𝑊Witalic_W and deleting the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row and ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT column of the matrix if i=0subscript𝑖0\mathcal{M}_{i}=0caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0. We construct Bn×dsuperscript𝐵superscript𝑛𝑑B^{\mathcal{M}}\in\mathbb{R}^{n\times d}italic_B start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT, Al×nsuperscript𝐴superscript𝑙𝑛A^{\mathcal{M}}\in\mathbb{R}^{l\times n}italic_A start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l × italic_n end_POSTSUPERSCRIPT, and bnsuperscript𝑏superscript𝑛b^{\mathcal{M}}\in\mathbb{R}^{n}italic_b start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT by deleting the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row of B𝐵Bitalic_B, A𝐴Aitalic_A, and ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT element of b𝑏bitalic_b if i=0subscript𝑖0\mathcal{M}_{i}=0caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0.

Consider the case where the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT element of r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is 00 whenever i=0subscript𝑖0\mathcal{M}_{i}=0caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0. Then, regardless of whether Eq.A.1 represents a feed-forward network or the transition function for an RNN, the masked units will always be zero. We can thus simply track the n𝑛nitalic_n units that correspond with 1111’s in \mathcal{M}caligraphic_M as the outputs, ysuperscript𝑦y^{\mathcal{M}}italic_y start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT will depend solely on these. We observe that the behaviour of these units can be described by the following network:

rm=αr0+βϕ(Wr0+Bx+b),ym=Arm.formulae-sequencesuperscriptsubscript𝑟𝑚𝛼subscript𝑟0𝛽italic-ϕsuperscript𝑊subscript𝑟0superscript𝐵𝑥superscript𝑏superscriptsubscript𝑦𝑚superscript𝐴superscriptsubscript𝑟𝑚r_{m}^{\mathcal{M}}=-\alpha r_{0}+\beta\phi(W^{\mathcal{M}}r_{0}+B^{\mathcal{M% }}x+b^{\mathcal{M}}),\quad y_{m}^{\mathcal{M}}=A^{\mathcal{M}}r_{m}^{\mathcal{% M}}.italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT = - italic_α italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_β italic_ϕ ( italic_W start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_B start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT = italic_A start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT . (A.2)

It is networks of the form of Eq.A.2 that will be the primary subject of study in what follows. Note that the ‘similar-to\sim’, over the r𝑟ritalic_r, is dropped to denote the fact that r𝑟ritalic_r is a different vector on account of drop** the zero units. In the feed-forward case we use subscripts, as we have done above, to denote hidden layer width. Whenever we discuss RNNs or dynamical systems we will instead use the subscript to denote time.

Appendix B Proofs from Section 2.1

See 1

Proof.

We prove this solely for the ReLU, as the logic for the Heaviside is effectively the same. Let ϕitalic-ϕ\phiitalic_ϕ thus be a ReLU. First, observe the following useful property: for all α>0𝛼0\alpha>0italic_α > 0 we have αϕ(x)=ϕ(αx)𝛼italic-ϕ𝑥italic-ϕ𝛼𝑥\alpha\phi(x)=\phi(\alpha x)italic_α italic_ϕ ( italic_x ) = italic_ϕ ( italic_α italic_x ). From this, consider the neural network of hidden layer width n𝑛nitalic_n with ReLU activations, yn(θ)subscript𝑦𝑛𝜃y_{n}(\theta)italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ), and observe:

yn(θ)=α2i=1nA:iαϕ(Bi:αx+biα)=α2yn(θα).subscript𝑦𝑛𝜃superscript𝛼2superscriptsubscript𝑖1𝑛subscript𝐴:absent𝑖𝛼italic-ϕsubscript𝐵:𝑖absent𝛼𝑥subscript𝑏𝑖𝛼superscript𝛼2subscript𝑦𝑛𝜃𝛼y_{n}(\theta)=\alpha^{2}\sum_{i=1}^{n}\frac{A_{:i}}{\alpha}\phi\bigg{(}\frac{B% _{i:}}{\alpha}x+\frac{b_{i}}{\alpha}\bigg{)}=\alpha^{2}y_{n}\Big{(}\frac{% \theta}{\alpha}\Big{)}.italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) = italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_A start_POSTSUBSCRIPT : italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_α end_ARG italic_ϕ ( divide start_ARG italic_B start_POSTSUBSCRIPT italic_i : end_POSTSUBSCRIPT end_ARG start_ARG italic_α end_ARG italic_x + divide start_ARG italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_α end_ARG ) = italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( divide start_ARG italic_θ end_ARG start_ARG italic_α end_ARG ) . (B.1)

Moreover, if α𝛼\alpha\in\mathbb{N}italic_α ∈ blackboard_N we have

yn(θ)=i=1α2nA~:iϕ(B~i:x+b~)=yα2n(θ~),subscript𝑦𝑛𝜃superscriptsubscript𝑖1superscript𝛼2𝑛subscript~𝐴:absent𝑖italic-ϕsubscript~𝐵:𝑖absent𝑥~𝑏subscript𝑦superscript𝛼2𝑛~𝜃y_{n}(\theta)=\sum_{i=1}^{\alpha^{2}n}\tilde{A}_{:i}\phi\big{(}\tilde{B}_{i:}x% +\tilde{b}\big{)}=y_{\alpha^{2}n}(\tilde{\theta}),italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT : italic_i end_POSTSUBSCRIPT italic_ϕ ( over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_i : end_POSTSUBSCRIPT italic_x + over~ start_ARG italic_b end_ARG ) = italic_y start_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ) , (B.2)

where θ~=[θ1α,,θnα,θ1α,,θnα]~𝜃subscript𝜃1𝛼subscript𝜃𝑛𝛼subscript𝜃1𝛼subscript𝜃𝑛𝛼\tilde{\theta}=[\frac{\theta_{1}}{\alpha},\dots,\frac{\theta_{n}}{\alpha},% \dots\frac{\theta_{1}}{\alpha},\dots,\frac{\theta_{n}}{\alpha}]over~ start_ARG italic_θ end_ARG = [ divide start_ARG italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_α end_ARG , … , divide start_ARG italic_θ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_α end_ARG , … divide start_ARG italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_α end_ARG , … , divide start_ARG italic_θ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_α end_ARG ] so that each element is simply a re-scaled and repeated version of the original parameters; we have α2superscript𝛼2\alpha^{2}italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT repeats for each term to replace the α2superscript𝛼2\alpha^{2}italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT factor in the LHS of Eq. B.1.

Now, given an arbitrary compact set Ud𝑈superscript𝑑U\in\mathbb{R}^{d}italic_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, continuous function h:dl:superscript𝑑superscript𝑙h:\mathbb{R}^{d}\to\mathbb{R}^{l}italic_h : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, and ε>0𝜀0\varepsilon>0italic_ε > 0, by the universal approximator theory (see e.g. Leshno et al. [1993], Hornik [1993]) we can find n𝑛nitalic_n such that

Uh(x)yn(x,θ)ϵsubscript𝑈norm𝑥subscript𝑦𝑛𝑥𝜃italic-ϵ\int_{U}||h(x)-y_{n}(x,\theta)||\leq\epsilon∫ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT | | italic_h ( italic_x ) - italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x , italic_θ ) | | ≤ italic_ϵ (B.3)

holds. Because n𝑛nitalic_n is finite we can bound every individual (scalar) parameter by M𝑀Mitalic_M, for some sufficiently large M𝑀Mitalic_M. Suppose we want the parameters to be bounded instead by γ𝛾\gammaitalic_γ with M>γ>0𝑀𝛾0M>\gamma>0italic_M > italic_γ > 0. If we select α𝛼\alpha\in\mathbb{N}italic_α ∈ blackboard_N s.t. α>Mγ𝛼𝑀𝛾\alpha>\frac{M}{\gamma}italic_α > divide start_ARG italic_M end_ARG start_ARG italic_γ end_ARG then we can find yα2n(x,θ~)subscript𝑦superscript𝛼2𝑛𝑥~𝜃y_{\alpha^{2}n}(x,\tilde{\theta})italic_y start_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x , over~ start_ARG italic_θ end_ARG ) such that yα2n(x,θ~)=yn(x,θ)subscript𝑦superscript𝛼2𝑛𝑥~𝜃subscript𝑦𝑛𝑥𝜃y_{\alpha^{2}n}(x,\tilde{\theta})=y_{n}(x,\theta)italic_y start_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x , over~ start_ARG italic_θ end_ARG ) = italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x , italic_θ ). Thus we have found a parameter-bounding ReLU neural network satisfying Eq.B.3, completing the proof.

Remark: The intuition behind this result, for the ReLU, is credited to a reply to the Universal Approximation Theorem with Bounded Parameters question on Mathematics Stack Exchange.

The following lemma constitutes the core of Theorem 1. It shows that one can achieve universal approximation, in the sense needed for the theorem, using masking. The theorem then follows by manipulating biases to achieve masking. As mentioned in the main text, the following lemma is very closely related to past results on the SLTH over units for MLPs with one hidden layer. Specifically, we believe that one should be able to prove an analogue to our theorem by combining Theorem 3.2 in Malach et al. [2020] with Theorem 1 of Rahimi and Recht [2008] (or a similar result on learning with random networks). We leave the details of this to future studies.

Lemma 1.

Let h:Ul:𝑈superscript𝑙h:U\to\mathbb{R}^{l}italic_h : italic_U → blackboard_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT be a continuous function on compact support Ud𝑈superscript𝑑U\subset\mathbb{R}^{d}italic_U ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Then for any ϵ>0,δ(0,1)formulae-sequenceitalic-ϵ0𝛿01\epsilon>0,\delta\in(0,1)italic_ϵ > 0 , italic_δ ∈ ( 0 , 1 ), we can find a layer width m𝑚m\in\mathbb{N}italic_m ∈ blackboard_N such that with probability at least 1δ1𝛿1-\delta1 - italic_δ {0,1}msuperscript01𝑚\exists\mathcal{M}\in\{0,1\}^{m}∃ caligraphic_M ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT satisfying the following:

Uh(x)ym(x)𝑑xϵ.subscript𝑈norm𝑥superscriptsubscript𝑦𝑚𝑥differential-d𝑥italic-ϵ\int_{U}||h(x)-y_{m}^{\mathcal{M}}(x)||dx\leq\epsilon.∫ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT | | italic_h ( italic_x ) - italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ( italic_x ) | | italic_d italic_x ≤ italic_ϵ . (B.4)
Proof.

First, we find a neural network with parameters that approximate the desired function hhitalic_h. Given the assumptions on ϕitalic-ϕ\phiitalic_ϕ, we can use Proposition D.1 to find n𝑛nitalic_n and parameters θ={A,B,b}superscript𝜃superscript𝐴superscript𝐵superscript𝑏\theta^{*}=\{A^{*},B^{*},b^{*}\}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } such that

Uh(x)yn(x,θ)𝑑xϵ2,subscript𝑈norm𝑥subscript𝑦𝑛𝑥superscript𝜃differential-d𝑥italic-ϵ2\int_{U}||h(x)-y_{n}(x,\theta^{*})||dx\leq\frac{\epsilon}{2},∫ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT | | italic_h ( italic_x ) - italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x , italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | | italic_d italic_x ≤ divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG , (B.5)

because U𝑈Uitalic_U is compact and hhitalic_h is continuous.

We make a brief comment about the domain of a given activation function in ynsubscript𝑦𝑛y_{n}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. ϕitalic-ϕ\phiitalic_ϕ will be operating on a compact domain {ux+b:u[γ¯,γ¯]d,xU,b[γ¯,γ¯]}conditional-set𝑢𝑥𝑏formulae-sequence𝑢superscript¯𝛾¯𝛾𝑑formulae-sequence𝑥𝑈𝑏¯𝛾¯𝛾\{ux+b:u\in[-\bar{\gamma},\bar{\gamma}]^{d},x\in U,b\in[-\bar{\gamma},\bar{% \gamma}]\}{ italic_u italic_x + italic_b : italic_u ∈ [ - over¯ start_ARG italic_γ end_ARG , over¯ start_ARG italic_γ end_ARG ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_x ∈ italic_U , italic_b ∈ [ - over¯ start_ARG italic_γ end_ARG , over¯ start_ARG italic_γ end_ARG ] }, as a consequence of the compactness of the support of the parameters, and of the assumed compactness of U𝑈Uitalic_U. By its continuity, ϕitalic-ϕ\phiitalic_ϕ is Lipschitz and bounded on this domain. We label Lipschitz constant and bound Kϕsubscript𝐾italic-ϕK_{\phi}italic_K start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and Mϕsubscript𝑀italic-ϕM_{\phi}italic_M start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT respectively. We further define Mxsubscript𝑀𝑥M_{x}italic_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT to be the bound for x𝑥xitalic_x on U𝑈Uitalic_U, and |U|𝑈|U|| italic_U | to be the value U𝑑xsubscript𝑈differential-d𝑥\int_{U}dx∫ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT italic_d italic_x, which is finite by the boundedness of the domain.

Next, we construct a masked random network that approximates ynsubscript𝑦𝑛y_{n}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with high probability. By Lemma 3, we can find a random feed-forward neural network of hidden layer width m𝑚mitalic_m such that a mask, \mathcal{M}caligraphic_M, exists satisfying |θiθi|<εsubscriptsuperscript𝜃𝑖superscriptsubscript𝜃𝑖𝜀|\theta^{*}_{i}-\theta_{i}^{\mathcal{M}}|<\varepsilon| italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT | < italic_ε for some arbitrarily ε>0𝜀0\varepsilon>0italic_ε > 0. In particular, we can choose ε𝜀\varepsilonitalic_ε as:

|θiθi|<ε=ϵ2|U|max(nl(Kϕγ¯[1+Mx]+Mϕ),1)subscriptsuperscript𝜃𝑖superscriptsubscript𝜃𝑖𝜀italic-ϵ2𝑈𝑛𝑙subscript𝐾italic-ϕ¯𝛾delimited-[]1subscript𝑀𝑥subscript𝑀italic-ϕ1|\theta^{*}_{i}-\theta_{i}^{\mathcal{M}}|<\varepsilon=\frac{\epsilon}{2|U|\max% \big{(}nl(K_{\phi}\bar{\gamma}[1+M_{x}]+M_{\phi}),1\big{)}}| italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT | < italic_ε = divide start_ARG italic_ϵ end_ARG start_ARG 2 | italic_U | roman_max ( italic_n italic_l ( italic_K start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT over¯ start_ARG italic_γ end_ARG [ 1 + italic_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ] + italic_M start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) , 1 ) end_ARG (B.6)

for all i𝑖iitalic_i with probability at least 1δ1𝛿1-\delta1 - italic_δ. If we are in the regime of probability 1δ1𝛿1-\delta1 - italic_δ where the mask satisfying the above error bound exists then we get

ym(x)yn(x)nl[Kϕγ¯(1+Mx)+Mϕ]εϵ2|U|,normsuperscriptsubscript𝑦𝑚𝑥subscript𝑦𝑛𝑥𝑛𝑙delimited-[]subscript𝐾italic-ϕ¯𝛾1subscript𝑀𝑥subscript𝑀italic-ϕ𝜀italic-ϵ2𝑈||y_{m}^{\mathcal{M}}(x)-y_{n}(x)||\leq nl[K_{\phi}\bar{\gamma}(1+M_{x})+M_{% \phi}]\varepsilon\leq\frac{\epsilon}{2|U|},| | italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ( italic_x ) - italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) | | ≤ italic_n italic_l [ italic_K start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT over¯ start_ARG italic_γ end_ARG ( 1 + italic_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) + italic_M start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ] italic_ε ≤ divide start_ARG italic_ϵ end_ARG start_ARG 2 | italic_U | end_ARG , (B.7)

where, in addition to Eq.B.6, we use the assumptions on ϕitalic-ϕ\phiitalic_ϕ and U𝑈Uitalic_U stated before the start of the proof and repeated application of the triangle inequality (see Lemma 4 for the derivation of the above bound).

Integrating the error over the domain and using Eq.B.5 and Eq.B.7 gives

Uh(x)ym(x)𝑑xUh(x)yn(x)𝑑x+Uym(x)yn(x)𝑑xϵ,subscript𝑈norm𝑥superscriptsubscript𝑦𝑚𝑥differential-d𝑥subscript𝑈norm𝑥subscript𝑦𝑛𝑥differential-d𝑥subscript𝑈normsuperscriptsubscript𝑦𝑚𝑥subscript𝑦𝑛𝑥differential-d𝑥italic-ϵ\int_{U}||h(x)-y_{m}^{\mathcal{M}}(x)||dx\leq\int_{U}||h(x)-y_{n}(x)||dx+\int_% {U}||y_{m}^{\mathcal{M}}(x)-y_{n}(x)||dx\leq\epsilon,∫ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT | | italic_h ( italic_x ) - italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ( italic_x ) | | italic_d italic_x ≤ ∫ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT | | italic_h ( italic_x ) - italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) | | italic_d italic_x + ∫ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT | | italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ( italic_x ) - italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) | | italic_d italic_x ≤ italic_ϵ , (B.8)

with probability 1δ1𝛿1-\delta1 - italic_δ.

See 1

Proof.

Observe that, once we have choosen an m𝑚mitalic_m satisfying the desiderata of Lemma B.4, because ϕitalic-ϕ\phiitalic_ϕ is assumed to be γ𝛾\gammaitalic_γ-bias-learning, m𝑚mitalic_m is some finite value and all variables that make up the input of ϕitalic-ϕ\phiitalic_ϕ are bounded, we can implement the mask by setting bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be very negative for every i𝑖iitalic_i such that i=0subscript𝑖0\mathcal{M}_{i}=0caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0. For every bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that =11\mathcal{M}=1caligraphic_M = 1 we simply leave bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at its original randomly chosen value. ∎

Corollary 2.

Assume d=l𝑑𝑙d=litalic_d = italic_l, that is, the output and input spaces are the same. Then the results of Lemma B.4 and Theorem 1 also hold for res-nets; that is, networks whose output is of the form x+ym(x)𝑥superscriptsubscript𝑦𝑚𝑥x+y_{m}^{\mathcal{M}}(x)italic_x + italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ( italic_x ).

Proof.

This follows by observing that h(x)+x𝑥𝑥h(x)+xitalic_h ( italic_x ) + italic_x is also a continuous function and then replacing h(x)𝑥h(x)italic_h ( italic_x ) with h(x)+x𝑥𝑥h(x)+xitalic_h ( italic_x ) + italic_x in Eq.B.4 and rearranging. ∎

Remark: While the error can be made arbitrarily small, the limit of zero error itself is undefined. This is because our proof relies on first approximating the given smooth function with a neural network with all parameters tuned and then approximating this second network using bias-learning to pick-out a matching sub-network from a large random reservoir; the probability of perfectly matching the fully tuned network with the bias-learned network is zero. This could be addressed by using an integral representation for continuous functions instead of directly using a finite-width neural network to approximate the given function (see e.g. Rahimi and Recht [2008], Li et al. [2023]). As one will see below, this remark also applies to the recurrent neural network result.

Appendix C Proof from Section 2.2

Analogous to the section containing the feed-forward proofs, we first state and prove a lemma which comprises the core of the proof for recurrent neural networks. This lemma shows that one can achieve universal approximation with high probability using masking in a randomly initialized RNN, and in this way provides a proof of the SLTH over units for RNNs. The proof of the main theorem in this section then follows quite straightforwardly.

Lemma 2.

Consider a discrete time, partially observed dynamical system of the form of, and satisfying the same conditions as, the one in Eq.3. Let 0<T<0𝑇0<T<\infty0 < italic_T < ∞, initial condition z0Uzsubscript𝑧0subscript𝑈𝑧z_{0}\in U_{z}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, input xtUxsubscript𝑥𝑡subscript𝑈𝑥x_{t}\in U_{x}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT tfor-all𝑡\forall t∀ italic_t, ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 and δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ). Then we can find an RNN initial condition r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and a layer width m𝑚m\in\mathbb{N}italic_m ∈ blackboard_N such that with probability at least 1δ1𝛿1-\delta1 - italic_δ {0,1}msuperscript01𝑚\exists\mathcal{M}\in\{0,1\}^{m}∃ caligraphic_M ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT satisfying the following:

t=1Tytyt<ϵ.superscriptsubscript𝑡1𝑇normsubscript𝑦𝑡superscriptsubscript𝑦𝑡italic-ϵ\sum_{t=1}^{T}||y_{t}-y_{t}^{\mathcal{M}}||<\epsilon.∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | | italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT | | < italic_ϵ . (C.1)
Proof.

In what follows W.L.O.G. we assume ϵ<1italic-ϵ1\epsilon<1italic_ϵ < 1. It is well known that we can arbitrarily approximate this dynamical system with an RNN Schäfer and Zimmermann [2006]; we provide a simple proof of this in Proposition 3. In particular, for arbitrary 1>ϵ>01italic-ϵ01>\epsilon>01 > italic_ϵ > 0 we can find an RNN of the form in Eq.2, with hidden layer width n𝑛n\in\mathbb{N}italic_n ∈ blackboard_N and output y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG, satisfying:

t=1Ty^tyt<ϵ2,superscriptsubscript𝑡1𝑇normsubscript^𝑦𝑡subscript𝑦𝑡italic-ϵ2\sum_{t=1}^{T}||\hat{y}_{t}-y_{t}||<\frac{\epsilon}{2},∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | | over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | < divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG , (C.2)

for any initial condition of the dynamical system selected within the invariant set Uzsubscript𝑈𝑧U_{z}italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. We note that this is a point-wise convergence. It can be shown (see Prop.3) that the hidden states of this RNN remain on a compact set, Ursubscript𝑈𝑟U_{r}italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, when approximating finite time trajectories of the original dynamical system with initial conditions in Uzsubscript𝑈𝑧U_{z}italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. Given this compactness, we can then show that the following set

U~={\displaystyle\tilde{U}=\{over~ start_ARG italic_U end_ARG = { u(r+r)+vx+b:u[γ¯,γ¯]n,v[γ¯,γ¯]d,:𝑢𝑟superscript𝑟𝑣𝑥𝑏formulae-sequence𝑢superscript¯𝛾¯𝛾𝑛𝑣superscript¯𝛾¯𝛾𝑑\displaystyle u(r+r^{\prime})+vx+b:u\in[-\bar{\gamma},\bar{\gamma}]^{n},v\in[-% \bar{\gamma},\bar{\gamma}]^{d},italic_u ( italic_r + italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_v italic_x + italic_b : italic_u ∈ [ - over¯ start_ARG italic_γ end_ARG , over¯ start_ARG italic_γ end_ARG ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_v ∈ [ - over¯ start_ARG italic_γ end_ARG , over¯ start_ARG italic_γ end_ARG ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , (C.3)
rUr,xU,b[γ¯,γ¯],rn,||r||1},\displaystyle r\in U_{r},x\in U,b\in[-\bar{\gamma},\bar{\gamma}],r^{\prime}\in% \mathbb{R}^{n},||r^{\prime}||\leq 1\},italic_r ∈ italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_x ∈ italic_U , italic_b ∈ [ - over¯ start_ARG italic_γ end_ARG , over¯ start_ARG italic_γ end_ARG ] , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , | | italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | | ≤ 1 } , (C.4)

is itself compact by the compactness of the sets from which it is formed. It will turn out that this set will contain the arguments of ϕitalic-ϕ\phiitalic_ϕ that appear in the proof. By its own compactness and the assumptions on ϕitalic-ϕ\phiitalic_ϕ we observe that ϕitalic-ϕ\phiitalic_ϕ has Lipschitz constant Kϕsubscript𝐾italic-ϕK_{\phi}italic_K start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT. The size of Ursubscript𝑈𝑟U_{r}italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT–and thus U~~𝑈\tilde{U}over~ start_ARG italic_U end_ARG–and Kϕsubscript𝐾italic-ϕK_{\phi}italic_K start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT will, in general, depend upon the first approximating neural network. By the compactness of U𝑈Uitalic_U and Ursubscript𝑈𝑟U_{r}italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, we can bound x𝑥xitalic_x and r𝑟ritalic_r on these sets to get bounds Rxsubscript𝑅𝑥R_{x}italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and Rrsubscript𝑅𝑟R_{r}italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT respectively.

Let the parameters of the above-defined approximating RNN be given by θ={A,W,B,b}superscript𝜃superscript𝐴superscript𝑊superscript𝐵superscript𝑏\theta^{*}=\{A^{*},W^{*},B^{*},b^{*}\}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT }. Then by Lemma 3 we can find a random RNN of hidden width m𝑚mitalic_m and with parameters θ𝜃\thetaitalic_θ such that a mask, \mathcal{M}caligraphic_M, exists satisfying

|θiθi|<ε=ϵ2lT[max(γ¯,1)max(nβKϕR~t=0T1(α+γ¯βKϕn)t,1)+Rr],subscriptsuperscript𝜃𝑖superscriptsubscript𝜃𝑖𝜀italic-ϵ2𝑙𝑇delimited-[]¯𝛾1𝑛𝛽subscript𝐾italic-ϕ~𝑅superscriptsubscript𝑡0𝑇1superscript𝛼¯𝛾𝛽subscript𝐾italic-ϕ𝑛𝑡1subscript𝑅𝑟|\theta^{*}_{i}-\theta_{i}^{\mathcal{M}}|<\varepsilon=\frac{\epsilon}{2lT\big{% [}\max(\bar{\gamma},1)\max\big{(}n\beta K_{\phi}\tilde{R}\sum_{t=0}^{T-1}(% \alpha+\bar{\gamma}\beta K_{\phi}n)^{t},1\big{)}+R_{r}\big{]}},| italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT | < italic_ε = divide start_ARG italic_ϵ end_ARG start_ARG 2 italic_l italic_T [ roman_max ( over¯ start_ARG italic_γ end_ARG , 1 ) roman_max ( italic_n italic_β italic_K start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT over~ start_ARG italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_α + over¯ start_ARG italic_γ end_ARG italic_β italic_K start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_n ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , 1 ) + italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] end_ARG , (C.5)

for all i𝑖iitalic_i with probability at least 1δ1𝛿1-\delta1 - italic_δ, where R~=Rx+Rr+1~𝑅subscript𝑅𝑥subscript𝑅𝑟1\tilde{R}=R_{x}+R_{r}+1over~ start_ARG italic_R end_ARG = italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + 1. One can show using induction (see Lemma 5 for the derivation) that if |θiθi|<εsubscriptsuperscript𝜃𝑖superscriptsubscript𝜃𝑖𝜀|\theta^{*}_{i}-\theta_{i}^{\mathcal{M}}|<\varepsilon| italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT | < italic_ε for all i𝑖iitalic_i we get

t=1Tyty^t<lT[γ¯nβKϕR~t=0T1(α+γ¯βKϕn)t+Rr]ε.superscriptsubscript𝑡1𝑇normsuperscriptsubscript𝑦𝑡subscript^𝑦𝑡𝑙𝑇delimited-[]¯𝛾𝑛𝛽subscript𝐾italic-ϕ~𝑅superscriptsubscript𝑡0𝑇1superscript𝛼¯𝛾𝛽subscript𝐾italic-ϕ𝑛𝑡subscript𝑅𝑟𝜀\sum_{t=1}^{T}||y_{t}^{\mathcal{M}}-\hat{y}_{t}||<lT\big{[}\bar{\gamma}n\beta K% _{\phi}\tilde{R}\sum_{t=0}^{T-1}(\alpha+\bar{\gamma}\beta K_{\phi}n)^{t}+R_{r}% \big{]}\varepsilon.∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | | italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | < italic_l italic_T [ over¯ start_ARG italic_γ end_ARG italic_n italic_β italic_K start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT over~ start_ARG italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_α + over¯ start_ARG italic_γ end_ARG italic_β italic_K start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_n ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] italic_ε . (C.6)

The triangle inequality on yty^t+y^tytnormsuperscriptsubscript𝑦𝑡subscript^𝑦𝑡subscript^𝑦𝑡subscript𝑦𝑡||y_{t}^{\mathcal{M}}-\hat{y}_{t}+\hat{y}_{t}-y_{t}||| | italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | |, along with Equations C.2, C.5, and C.6 completes the proof.

See 2

Proof.

This follows directly from Theorem 2, by observing that one can replace the mask by simply setting biases to some sufficiently low value. ∎

Appendix D Supplementary Lemmas

The following result is well known in the literature; see e.g. Proposition 1 of Leshno et al. [1993].

Proposition 2.

For any ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 n𝑛\exists n\in\mathbb{N}∃ italic_n ∈ blackboard_N s.t.

Uh(x)yn(x,θ)𝑑xϵsubscript𝑈norm𝑥subscript𝑦𝑛𝑥superscript𝜃differential-d𝑥italic-ϵ\int_{U}||h(x)-y_{n}(x,\theta^{*})||dx\leq\epsilon∫ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT | | italic_h ( italic_x ) - italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x , italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | | italic_d italic_x ≤ italic_ϵ (D.1)
Corollary 3.

The above holds if we restrict the output weight matrix of the neural network to have rank equal to the output dimension.

Proof.

This is because the set of full rank matrices is dense in m×nsuperscript𝑚𝑛\mathbb{R}^{m\times n}blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT for m,n𝑚𝑛m,n\in\mathbb{N}italic_m , italic_n ∈ blackboard_N. ∎

Consider matrices Wn×nsuperscript𝑊superscript𝑛𝑛W^{*}\in\mathbb{R}^{n\times n}italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT, Bn×dsuperscript𝐵superscript𝑛𝑑B^{*}\in\mathbb{R}^{n\times d}italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT, Al×nsuperscript𝐴superscript𝑙𝑛A^{*}\in\mathbb{R}^{l\times n}italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l × italic_n end_POSTSUPERSCRIPT, and vector bnsuperscript𝑏superscript𝑛b^{*}\in\mathbb{R}^{n}italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. We can vectorize and concatenate their elements into the single long vector θπ𝜃superscript𝜋\theta\in\mathbb{R}^{\pi}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT, where π=n(n+d+l+1)𝜋𝑛𝑛𝑑𝑙1\pi=n(n+d+l+1)italic_π = italic_n ( italic_n + italic_d + italic_l + 1 ). Assume that |θi|<γsuperscriptsubscript𝜃𝑖𝛾|\theta_{i}^{*}|<\gamma| italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | < italic_γ for all i𝑖iitalic_i.

Next, construct Wm×m𝑊superscript𝑚𝑚W\in\mathbb{R}^{m\times m}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT, Bm×d𝐵superscript𝑚𝑑B\in\mathbb{R}^{m\times d}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT, Al×m𝐴superscript𝑙𝑚A\in\mathbb{R}^{l\times m}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_l × italic_m end_POSTSUPERSCRIPT, and vector bm𝑏superscript𝑚b\in\mathbb{R}^{m}italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, by sampling each element randomly from a uniform distribution on [γ¯,γ¯]¯𝛾¯𝛾[-\bar{\gamma},\bar{\gamma}][ - over¯ start_ARG italic_γ end_ARG , over¯ start_ARG italic_γ end_ARG ] where γ¯=γ+Δγ¯𝛾𝛾Δ𝛾\bar{\gamma}=\gamma+\Delta\gammaover¯ start_ARG italic_γ end_ARG = italic_γ + roman_Δ italic_γ for Δγ>0Δ𝛾0\Delta\gamma>0roman_Δ italic_γ > 0. We analogously group these into a single vector, θm(m+d+l+1)𝜃superscript𝑚𝑚𝑑𝑙1\theta\in\mathbb{R}^{m(m+d+l+1)}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_m ( italic_m + italic_d + italic_l + 1 ) end_POSTSUPERSCRIPT Observe that for each {0,1}msuperscript01𝑚\mathcal{M}\in\{0,1\}^{m}caligraphic_M ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, we can construct sub-matrices of W𝑊Witalic_W, B𝐵Bitalic_B, A𝐴Aitalic_A, and sub-vector of b𝑏bitalic_b by deleting column and row pairs in W𝑊Witalic_W, rows in B𝐵Bitalic_B, columns in A𝐴Aitalic_A, and elements of b𝑏bitalic_b whose indices correspond to i{1,,m}𝑖1𝑚i\in\{1,\dots,m\}italic_i ∈ { 1 , … , italic_m } such that i=0subscript𝑖0\mathcal{M}_{i}=0caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0. For a given \mathcal{M}caligraphic_M, we define θsuperscript𝜃\theta^{\mathcal{M}}italic_θ start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT to be the vector constructed by flattening and concatenating these sub-matrices and vector. We then have the following lemma:

Lemma 3.

For θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, defined above, and arbitrary ε>0𝜀0\varepsilon>0italic_ε > 0, δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), we can find m>n𝑚𝑛m>nitalic_m > italic_n such that with probability at least 1δ1𝛿1-\delta1 - italic_δ {0,1}msuperscript01𝑚\exists\mathcal{M}\in\{0,1\}^{m}∃ caligraphic_M ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT with only n𝑛nitalic_n non-zero elements such that |θiθi|<εsuperscriptsubscript𝜃𝑖superscriptsubscript𝜃𝑖𝜀|\theta_{i}^{*}-\theta_{i}^{\mathcal{M}}|<\varepsilon| italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT | < italic_ε for all i{1,,π}𝑖1𝜋i\in\{1,\dots,\pi\}italic_i ∈ { 1 , … , italic_π }. In particular, any mnlogδlog[1(ϵγ¯)π]𝑚𝑛𝛿1superscriptitalic-ϵ¯𝛾𝜋m\geq\frac{n\log\delta}{\log[1-(\frac{\epsilon}{\bar{\gamma}})^{\pi}]}italic_m ≥ divide start_ARG italic_n roman_log italic_δ end_ARG start_ARG roman_log [ 1 - ( divide start_ARG italic_ϵ end_ARG start_ARG over¯ start_ARG italic_γ end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ] end_ARG will satisfy the result, where ϵ=min(ε,Δγ)italic-ϵ𝜀Δ𝛾\epsilon=\min(\varepsilon,\Delta\gamma)italic_ϵ = roman_min ( italic_ε , roman_Δ italic_γ ).

Proof.

In what follows we set ϵ=min(ε,Δγ)italic-ϵ𝜀Δ𝛾\epsilon=\min(\varepsilon,\Delta\gamma)italic_ϵ = roman_min ( italic_ε , roman_Δ italic_γ ). This simplifies the below probability bound that we derive because it means the probability of falling within an ϵitalic-ϵ\epsilonitalic_ϵ window of a desired parameter will not change, even if the desired parameter is very close to its bound, ±γplus-or-minus𝛾\pm\gamma± italic_γ. We will refer to the event that the desiderata of the lemma are satisfied for ϵitalic-ϵ\epsilonitalic_ϵ, rather than ε𝜀\varepsilonitalic_ε, as A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT; that is: {0,1}msuperscript01𝑚\exists\mathcal{M}\in\{0,1\}^{m}∃ caligraphic_M ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT with only n𝑛nitalic_n non-zero elements such that |θiθi|<ϵsuperscriptsubscript𝜃𝑖superscriptsubscript𝜃𝑖italic-ϵ|\theta_{i}^{*}-\theta_{i}^{\mathcal{M}}|<\epsilon| italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT | < italic_ϵ for all i{1,,n}𝑖1𝑛i\in\{1,\dots,n\}italic_i ∈ { 1 , … , italic_n }. The event that the desiderata are not satisfied is A1csuperscriptsubscript𝐴1𝑐A_{1}^{c}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT.

Assume that m=knsuperscript𝑚𝑘𝑛m^{\star}=knitalic_m start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = italic_k italic_n for some k+𝑘superscriptk\in\mathbb{N}^{+}italic_k ∈ blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Consider the ‘block’ mask k1superscriptsubscript𝑘1\mathcal{M}^{k_{1}}caligraphic_M start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT s.t. ik1=1superscriptsubscript𝑖subscript𝑘11\mathcal{M}_{i}^{k_{1}}=1caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = 1 only for i{(k11)n+1,,k1n}𝑖subscript𝑘11𝑛1subscript𝑘1𝑛i\in\{(k_{1}-1)n+1,\dots,k_{1}n\}italic_i ∈ { ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 ) italic_n + 1 , … , italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n }, with 0<k1k0subscript𝑘1𝑘0<k_{1}\leq k0 < italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_k. Note that the n𝑛nitalic_n elements selected by these block masks are non-overlap** for two different k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Let event A2subscript𝐴2A_{2}italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be the event that there is a block mask that occurs satisfying the desiderata of the lemma with error ϵitalic-ϵ\epsilonitalic_ϵ. Clearly A2A1A1cA2cP(A1c)P(A2c)subscript𝐴2subscript𝐴1superscriptsubscript𝐴1𝑐superscriptsubscript𝐴2𝑐𝑃superscriptsubscript𝐴1𝑐𝑃superscriptsubscript𝐴2𝑐A_{2}\subset A_{1}\implies A_{1}^{c}\subset A_{2}^{c}\implies P(A_{1}^{c})\leq P% (A_{2}^{c})italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊂ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟹ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ⊂ italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ⟹ italic_P ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ≤ italic_P ( italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ). A2csuperscriptsubscript𝐴2𝑐A_{2}^{c}italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT is the probability that there is no block mask satisfying the desiderata. Observe that

P(A2c)𝑃superscriptsubscript𝐴2𝑐\displaystyle P(A_{2}^{c})italic_P ( italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) =P[k1=1k{k1thblockmaskdoesntwork}]=k1=1kP({k1thblockmaskdoesntwork})absent𝑃delimited-[]superscriptsubscriptsubscript𝑘11𝑘superscriptsubscript𝑘1𝑡blockmasksuperscriptdoesntworksuperscriptsubscriptproductsubscript𝑘11𝑘𝑃superscriptsubscript𝑘1𝑡blockmasksuperscriptdoesntwork\displaystyle=P\bigg{[}\bigcap_{k_{1}=1}^{k}\{k_{1}^{th}\>\mathrm{block}\>% \mathrm{mask}\>\mathrm{doesn^{\prime}t}\>\mathrm{work}\}\bigg{]}=\prod_{k_{1}=% 1}^{k}P(\{k_{1}^{th}\>\mathrm{block}\>\mathrm{mask}\>\mathrm{doesn^{\prime}t}% \>\mathrm{work}\})= italic_P [ ⋂ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT { italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT roman_block roman_mask roman_doesn start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_t roman_work } ] = ∏ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_P ( { italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT roman_block roman_mask roman_doesn start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_t roman_work } )
=k1=1k1P({k1thblockmaskworks})=[1(ϵγ¯)π]mn,absentsuperscriptsubscriptproductsubscript𝑘11𝑘1𝑃superscriptsubscript𝑘1𝑡blockmaskworkssuperscriptdelimited-[]1superscriptitalic-ϵ¯𝛾𝜋superscript𝑚𝑛\displaystyle=\prod_{k_{1}=1}^{k}1-P(\{k_{1}^{th}\>\mathrm{block}\>\mathrm{% mask}\>\mathrm{works}\})=\bigg{[}1-\Big{(}\frac{\epsilon}{\bar{\gamma}}\Big{)}% ^{\pi}\bigg{]}^{\frac{m^{\star}}{n}},= ∏ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT 1 - italic_P ( { italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT roman_block roman_mask roman_works } ) = [ 1 - ( divide start_ARG italic_ϵ end_ARG start_ARG over¯ start_ARG italic_γ end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT divide start_ARG italic_m start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG start_ARG italic_n end_ARG end_POSTSUPERSCRIPT , (D.2)

which follows from the fact that the elements of the matrices are independently sampled and the elements corresponding to sub-matrices selected by a given block mask are independent of those associated with another block mask. By making msuperscript𝑚m^{\star}italic_m start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT very large we can make P(A2c)𝑃superscriptsubscript𝐴2𝑐P(A_{2}^{c})italic_P ( italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) arbitrarily small. Because P(A1c)P(A2c)𝑃superscriptsubscript𝐴1𝑐𝑃superscriptsubscript𝐴2𝑐P(A_{1}^{c})\leq P(A_{2}^{c})italic_P ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ≤ italic_P ( italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT )–and the desiderata of the lemma with error ϵitalic-ϵ\epsilonitalic_ϵ are not satisfied solely on A1csuperscriptsubscript𝐴1𝑐A_{1}^{c}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT–the result follows by selecting m=msuperscript𝑚𝑚m^{\star}=mitalic_m start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = italic_m such that P(A2c)δ𝑃superscriptsubscript𝐴2𝑐𝛿P(A_{2}^{c})\leq\deltaitalic_P ( italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ≤ italic_δ. We thus see that the probability of finding a sufficient mask occurs with probability at least 1δ1𝛿1-\delta1 - italic_δ. Lastly, because we have found a mask that satisfies per-parameter error ϵitalic-ϵ\epsilonitalic_ϵ, and because ϵεitalic-ϵ𝜀\epsilon\leq\varepsilonitalic_ϵ ≤ italic_ε, we have proved the lemma. ∎

Remark 1: We note that, in Eq.D.2, n𝑛nitalic_n will likely also depend implicitly on γ𝛾\gammaitalic_γ. If γ𝛾\gammaitalic_γ is very small then we will need to stack many ReLUs on top of each other to attain a large enough dynamic range to approximate the desired function (see §2.1), leading to a larger number of units. Conversely, if γ𝛾\gammaitalic_γ is very large we will need to sample a large number of units before we get parameters appropriately close to the desired subnetwork configuration. This suggests the existence of some sweet spot in the value γ𝛾\gammaitalic_γ, which we leave for future work to explore.

Remark 2: We observe that this bound appears to be very weak. For example, if one wished to use it to find a masked network to match an MLP with input, hidden, and output dimensions of only 1111, 3333, and 1111 respectively, with a per-parameter error of ϵ=0.05italic-ϵ0.05\epsilon=0.05italic_ϵ = 0.05 an error probability of δ=0.1𝛿0.1\delta=0.1italic_δ = 0.1, and with γ=0.1𝛾0.1\gamma=0.1italic_γ = 0.1, this bound would suggest we need a hidden layer of m8.34×1012𝑚8.34superscript1012m\geq 8.34\times 10^{12}italic_m ≥ 8.34 × 10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT neurons in the bias learning network. In light of the numerical experiments, it is clear that while the math here provides proofs of existence for bias learning it does not say anything useful about the hidden layer width scaling required.

For the following proposition we consider the discrete time dynamical system that we wish to approximate to be as in Eq.3.

Proposition 3.

For finite 0<T<0𝑇0<T<\infty0 < italic_T < ∞, any initial condition z0Uzsubscript𝑧0subscript𝑈𝑧z_{0}\in U_{z}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, input xtUxsubscript𝑥𝑡subscript𝑈𝑥x_{t}\in U_{x}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT tfor-all𝑡\forall t∀ italic_t, ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, and any |α|<,β>0formulae-sequence𝛼𝛽0|\alpha|<\infty,\beta>0| italic_α | < ∞ , italic_β > 0 we can find an RNN of the style of Eq. 2 of hidden width n𝑛n\in\mathbb{N}italic_n ∈ blackboard_N and an initial value r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for the RNN such that:

t=1Ty^tyt<ϵsuperscriptsubscript𝑡1𝑇normsubscript^𝑦𝑡subscript𝑦𝑡italic-ϵ\sum_{t=1}^{T}||\hat{y}_{t}-y_{t}||<\epsilon∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | | over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | < italic_ϵ (D.3)

where y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the output of the RNN and ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is that of the dynamical system. Moreover, when approximating the dynamical system in this way the RNN hidden states will remain in a compact set which we denote Ursubscript𝑈𝑟U_{r}italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT.

The main portion of this result is well known, see e.g. Schäfer and Zimmermann [2006]. For completeness, we provide an example proof below.

Proof.

In what follows, W.L.O.G we will assume that the error is smaller than c𝑐citalic_c. We want to approximate the dynamical system:

zt+1=F(zt,xt),yt=Czt,z0Uz,formulae-sequencesubscript𝑧𝑡1𝐹subscript𝑧𝑡subscript𝑥𝑡formulae-sequencesubscript𝑦𝑡𝐶subscript𝑧𝑡subscript𝑧0subscript𝑈𝑧z_{t+1}=F(z_{t},x_{t}),\quad y_{t}=Cz_{t},\quad z_{0}\in U_{z},italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_F ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_C italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , (D.4)

defined on set U~z={z0+c0:z0Uz,c0<c}subscript~𝑈𝑧conditional-setsubscript𝑧0subscript𝑐0formulae-sequencesubscript𝑧0subscript𝑈𝑧normsubscript𝑐0𝑐\tilde{U}_{z}=\{z_{0}+c_{0}:z_{0}\in U_{z},\>||c_{0}||<c\}over~ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = { italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , | | italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | < italic_c }, where Uzsubscript𝑈𝑧U_{z}italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is an invariant set (see §2.2).

We define the set:

Uzx={[z+c0x]:zUz,xU,c0<c}.subscript𝑈𝑧𝑥conditional-setdelimited-[]𝑧subscript𝑐0𝑥formulae-sequence𝑧subscript𝑈𝑧formulae-sequence𝑥𝑈normsubscript𝑐0𝑐\displaystyle U_{zx}=\{[z+c_{0}\>\>x]:z\in U_{z},x\in U,||c_{0}||<c\}.italic_U start_POSTSUBSCRIPT italic_z italic_x end_POSTSUBSCRIPT = { [ italic_z + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x ] : italic_z ∈ italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_x ∈ italic_U , | | italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | < italic_c } . (D.5)

Importantly, this set is compact given the compactness assumptions on U𝑈Uitalic_U and Uzsubscript𝑈𝑧U_{z}italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. Also note that, since F𝐹Fitalic_F is assumed continuous, it will be KFsubscript𝐾𝐹K_{F}italic_K start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT-Lipschitz on this compact set for some constant KFsubscript𝐾𝐹K_{F}italic_K start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. We can thus use the corollary to Proposition D.1 to find a neural network of hidden dimension n𝑛n\in\mathbb{N}italic_n ∈ blackboard_N that approximates F𝐹Fitalic_F with a maximum-rank output matrix, A𝐴Aitalic_A. We write this neural network:

z^=αz+βAϕ(Wz+Bx+b)=F^(z,x),^𝑧𝛼𝑧𝛽𝐴italic-ϕ𝑊𝑧𝐵𝑥𝑏^𝐹𝑧𝑥\hat{z}=-\alpha z+\beta A\phi(Wz+Bx+b)=\hat{F}(z,x),over^ start_ARG italic_z end_ARG = - italic_α italic_z + italic_β italic_A italic_ϕ ( italic_W italic_z + italic_B italic_x + italic_b ) = over^ start_ARG italic_F end_ARG ( italic_z , italic_x ) , (D.6)

assuming zUz𝑧subscript𝑈𝑧z\in U_{z}italic_z ∈ italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT and xU𝑥𝑈x\in Uitalic_x ∈ italic_U, with As×n𝐴superscript𝑠𝑛A\in\mathbb{R}^{s\times n}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_s × italic_n end_POSTSUPERSCRIPT, Wn×s𝑊superscript𝑛𝑠W\in\mathbb{R}^{n\times s}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_s end_POSTSUPERSCRIPT, Bn×d𝐵superscript𝑛𝑑B\in\mathbb{R}^{n\times d}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT, and bn𝑏superscript𝑛b\in\mathbb{R}^{n}italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. In particular, we can find arbitrary ϵitalic-ϵ\epsilonitalic_ϵ with 0<ϵ<c0italic-ϵ𝑐0<\epsilon<c0 < italic_ϵ < italic_c such that:

F^(z,x)F(z,x)<ε=ϵTmax(RCt=0T1KFt,1),norm^𝐹𝑧𝑥𝐹𝑧𝑥𝜀italic-ϵ𝑇subscript𝑅𝐶superscriptsubscript𝑡0𝑇1subscriptsuperscript𝐾𝑡𝐹1||\hat{F}(z,x)-F(z,x)||<\varepsilon=\frac{\epsilon}{T\max(R_{C}\sum_{t=0}^{T-1% }K^{t}_{F},1)},| | over^ start_ARG italic_F end_ARG ( italic_z , italic_x ) - italic_F ( italic_z , italic_x ) | | < italic_ε = divide start_ARG italic_ϵ end_ARG start_ARG italic_T roman_max ( italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , 1 ) end_ARG , (D.7)

where RC=Csubscript𝑅𝐶norm𝐶R_{C}=||C||italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = | | italic_C | |. Fix T1𝑇1T\geq 1italic_T ≥ 1. To prove that we can approximate the underlying dynamical system, we use induction starting at time t=1𝑡1t=1italic_t = 1. The base case will be

z^1z1=F^(z0,x0)F(z0,x0)ε,normsubscript^𝑧1subscript𝑧1norm^𝐹subscript𝑧0subscript𝑥0𝐹subscript𝑧0subscript𝑥0𝜀||\hat{z}_{1}-z_{1}||=||\hat{F}(z_{0},x_{0})-F(z_{0},x_{0})||\leq\varepsilon,| | over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | = | | over^ start_ARG italic_F end_ARG ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_F ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | | ≤ italic_ε , (D.8)

by our choice of n𝑛nitalic_n and initial condition, and that [z0,x0]Uzxsubscript𝑧0subscript𝑥0subscript𝑈𝑧𝑥[z_{0},x_{0}]\in U_{zx}[ italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] ∈ italic_U start_POSTSUBSCRIPT italic_z italic_x end_POSTSUBSCRIPT. Importantly, this implies also that z^1z1<εnormsubscript^𝑧1subscript𝑧1𝜀||\hat{z}_{1}-z_{1}||<\varepsilon| | over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | < italic_ε. Because ε<c𝜀𝑐\varepsilon<citalic_ε < italic_c this means that [z1^x1]Uzxsuperscriptdelimited-[]^subscript𝑧1subscript𝑥1topsubscript𝑈𝑧𝑥[\hat{z_{1}}\>\>x_{1}]^{\top}\in U_{zx}[ over^ start_ARG italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_z italic_x end_POSTSUBSCRIPT.

For t=1𝑡1t=1italic_t = 1, ε=t=0t1KFtε𝜀superscriptsubscriptsuperscript𝑡0𝑡1superscriptsubscript𝐾𝐹superscript𝑡𝜀\varepsilon=\sum_{t^{\prime}=0}^{t-1}K_{F}^{t^{\prime}}\varepsilonitalic_ε = ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_ε. We thus make the induction hypothesis that z^tzt<t=0t1KFtεnormsubscript^𝑧𝑡subscript𝑧𝑡superscriptsubscriptsuperscript𝑡0𝑡1superscriptsubscript𝐾𝐹superscript𝑡𝜀||\hat{z}_{t}-z_{t}||<\sum_{t^{\prime}=0}^{t-1}K_{F}^{t^{\prime}}\varepsilon| | over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | < ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_ε and that [z^txt]Uzxsuperscriptdelimited-[]subscript^𝑧𝑡subscript𝑥𝑡topsubscript𝑈𝑧𝑥[\hat{z}_{t}\>\>x_{t}]^{\top}\in U_{zx}[ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_z italic_x end_POSTSUBSCRIPT. If T=1𝑇1T=1italic_T = 1 we are finished. If T>1𝑇1T>1italic_T > 1 we assume 1<t<T1𝑡𝑇1<t<T1 < italic_t < italic_T and use this hypothesis to prove the induction step:

z^t+1zt+1normsubscript^𝑧𝑡1subscript𝑧𝑡1\displaystyle||\hat{z}_{t+1}-z_{t+1}||| | over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | | F^(z^t,xt)F(z^t,xt)+F(z^t,xt)F(zt,xt)absentnorm^𝐹subscript^𝑧𝑡subscript𝑥𝑡𝐹subscript^𝑧𝑡subscript𝑥𝑡norm𝐹subscript^𝑧𝑡subscript𝑥𝑡𝐹subscript𝑧𝑡subscript𝑥𝑡\displaystyle\leq||\hat{F}(\hat{z}_{t},x_{t})-F(\hat{z}_{t},x_{t})||+||F(\hat{% z}_{t},x_{t})-F(z_{t},x_{t})||≤ | | over^ start_ARG italic_F end_ARG ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_F ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | + | | italic_F ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_F ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | (D.9)
ε+KFz^tzt=εt=0tKFt<cT.absent𝜀subscript𝐾𝐹normsubscript^𝑧𝑡subscript𝑧𝑡𝜀superscriptsubscriptsuperscript𝑡0𝑡superscriptsubscript𝐾𝐹superscript𝑡𝑐𝑇\displaystyle\leq\varepsilon+K_{F}||\hat{z}_{t}-z_{t}||=\varepsilon\sum_{t^{% \prime}=0}^{t}K_{F}^{t^{\prime}}<\frac{c}{T}.≤ italic_ε + italic_K start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT | | over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | = italic_ε ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT < divide start_ARG italic_c end_ARG start_ARG italic_T end_ARG . (D.10)

Because cTc𝑐𝑇𝑐\frac{c}{T}\leq cdivide start_ARG italic_c end_ARG start_ARG italic_T end_ARG ≤ italic_c, [z^t+1xt+1]Uzxsuperscriptdelimited-[]subscript^𝑧𝑡1subscript𝑥𝑡1topsubscript𝑈𝑧𝑥[\hat{z}_{t+1}\>\>x_{t+1}]^{\top}\in U_{zx}[ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_z italic_x end_POSTSUBSCRIPT. Then

t=1Ty^tytRCTεt=0TKFtϵ.superscriptsubscript𝑡1𝑇normsubscript^𝑦𝑡subscript𝑦𝑡subscript𝑅𝐶𝑇𝜀superscriptsubscript𝑡0𝑇superscriptsubscript𝐾𝐹𝑡italic-ϵ\sum_{t=1}^{T}||\hat{y}_{t}-y_{t}||\leq R_{C}T\varepsilon\sum_{t=0}^{T}K_{F}^{% t}\leq\epsilon.∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | | over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | ≤ italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_T italic_ε ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ≤ italic_ϵ . (D.11)

While we have approximated the dynamical system it is not yet in the standard rate-style RNN form. However, we can obtain the rate form by changing from tracking z^^𝑧\hat{z}over^ start_ARG italic_z end_ARG to a different dynamical variable: rtnsubscript𝑟𝑡superscript𝑛r_{t}\in\mathbb{R}^{n}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Because A𝐴Aitalic_A is assumed to be maximum rank, we can find a minimal norm r0nsubscript𝑟0superscript𝑛r_{0}\in\mathbb{R}^{n}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT such that z0=Ar0subscript𝑧0𝐴subscript𝑟0z_{0}=Ar_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_A italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and r0rnormsubscript𝑟0superscript𝑟||r_{0}||\leq r^{\prime}| | italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | ≤ italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for any rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that also satisfies z0=Arsubscript𝑧0𝐴superscript𝑟z_{0}=Ar^{\prime}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_A italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Take this as the initial value for an RNN with dynamics:

rt+1=αrt+ϕ(WArt+Bxt+b)subscript𝑟𝑡1𝛼subscript𝑟𝑡italic-ϕ𝑊𝐴subscript𝑟𝑡𝐵subscript𝑥𝑡𝑏r_{t+1}=-\alpha r_{t}+\phi(WAr_{t}+Bx_{t}+b)italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = - italic_α italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ϕ ( italic_W italic_A italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_B italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b ) (D.12)

It is easy to see that z^t=Artsubscript^𝑧𝑡𝐴subscript𝑟𝑡\hat{z}_{t}=Ar_{t}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_A italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for all 0tT0𝑡𝑇0\leq t\leq T0 ≤ italic_t ≤ italic_T. It follows that y^t=CArttsubscript^𝑦𝑡𝐶𝐴subscript𝑟𝑡for-all𝑡\hat{y}_{t}=CAr_{t}\>\forall tover^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_C italic_A italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∀ italic_t. Thus, this RNN approximates the original partially observed dynamical system.

Lastly, we explain how the hidden states of the RNN will remain in a compact set. Each rt=ϕ(Wz^t1+Bxt1+b)Φ(z0,x0,,xt1)subscript𝑟𝑡italic-ϕ𝑊subscript^𝑧𝑡1𝐵subscript𝑥𝑡1𝑏Φsubscript𝑧0subscript𝑥0subscript𝑥𝑡1r_{t}=\phi(W\hat{z}_{t-1}+Bx_{t-1}+b)\equiv\Phi(z_{0},x_{0},\dots,x_{t-1})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_ϕ ( italic_W over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_B italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_b ) ≡ roman_Φ ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) for all 1tT1𝑡𝑇1\leq t\leq T1 ≤ italic_t ≤ italic_T and any choice of z^0=z0Uzsubscript^𝑧0subscript𝑧0subscript𝑈𝑧\hat{z}_{0}=z_{0}\in U_{z}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. ΦΦ\Phiroman_Φ is continuous by the continuity of the activation functions and the boundedness of the parameters. Thus, the set Ur(1)={r:r=Φ(z0,x0,,xt1),z0Uz,xsUx 0s<t}superscriptsubscript𝑈𝑟1conditional-set𝑟formulae-sequence𝑟Φsubscript𝑧0subscript𝑥0subscript𝑥𝑡1formulae-sequencesubscript𝑧0subscript𝑈𝑧subscript𝑥𝑠subscript𝑈𝑥for-all 0𝑠𝑡U_{r}^{(1)}=\{r:r=\Phi(z_{0},x_{0},\dots,x_{t-1}),z_{0}\in U_{z},x_{s}\in U_{x% }\>\forall\>0\leq s<t\}italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = { italic_r : italic_r = roman_Φ ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∀ 0 ≤ italic_s < italic_t } contains all rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and is compact, as the arguments of ΦΦ\Phiroman_Φ are taken from compact sets.

It remains to treat the initial values of the RNN, r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. These are defined implicitly as Ar0=z0,z0Uzformulae-sequence𝐴subscript𝑟0subscript𝑧0subscript𝑧0subscript𝑈𝑧Ar_{0}=z_{0},z_{0}\in U_{z}italic_A italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. We will show that we can span all z0Uzsubscript𝑧0subscript𝑈𝑧z_{0}\in U_{z}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT by selecting r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from a compact set. By assumption A𝐴Aitalic_A is full rank so col(A)=scol𝐴superscript𝑠\mathrm{col}(A)=\mathbb{R}^{s}roman_col ( italic_A ) = blackboard_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and, thus, we can find a subset of the columns of A𝐴Aitalic_A that form a basis for ssuperscript𝑠\mathbb{R}^{s}blackboard_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. Let these columns be {A:i}iνsubscriptsubscript𝐴:absent𝑖𝑖𝜈\{A_{:i}\}_{i\in\nu}{ italic_A start_POSTSUBSCRIPT : italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ italic_ν end_POSTSUBSCRIPT. Construct a matrix Aνsubscript𝐴𝜈A_{\nu}italic_A start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT by deleting the columns whose indices are not contained in ν𝜈\nuitalic_ν. Because the columns of Aνsubscript𝐴𝜈A_{\nu}italic_A start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT are a basis the matrix is invertible. Define Ur(0)={r:ri=r~iifiν,ri=0ifiν,r~=Aν1z0,z0Uz}superscriptsubscript𝑈𝑟0conditional-set𝑟formulae-sequencesubscript𝑟𝑖subscript~𝑟𝑖if𝑖𝜈subscript𝑟𝑖0if𝑖𝜈formulae-sequence~𝑟superscriptsubscript𝐴𝜈1subscript𝑧0subscript𝑧0subscript𝑈𝑧U_{r}^{(0)}=\{r:r_{i}=\tilde{r}_{i}\>\mathrm{if}\>i\in\nu,r_{i}=0\>\mathrm{if}% \>i\not\in\nu,\tilde{r}=A_{\nu}^{-1}z_{0},z_{0}\in U_{z}\}italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = { italic_r : italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_if italic_i ∈ italic_ν , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 roman_if italic_i ∉ italic_ν , over~ start_ARG italic_r end_ARG = italic_A start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT }. For each z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in Uzsubscript𝑈𝑧U_{z}italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT we have an initial condition r0Ur(0)subscript𝑟0superscriptsubscript𝑈𝑟0r_{0}\in U_{r}^{(0)}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and, moreover, r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a continuous map** of z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (via Aν1superscriptsubscript𝐴𝜈1A_{\nu}^{-1}italic_A start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT). Since z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is taken from a compact set Ur(0)superscriptsubscript𝑈𝑟0U_{r}^{(0)}italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT must also be compact.

Finally, define Ur=Ur(0)Ur(1)subscript𝑈𝑟superscriptsubscript𝑈𝑟0superscriptsubscript𝑈𝑟1U_{r}=U_{r}^{(0)}\cup U_{r}^{(1)}italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∪ italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT. From the above this contains all hidden states of the RNN for the finite time trajectories of interest from any initial condition. As a union of two compact sets it is also compact. We will use Ursubscript𝑈𝑟U_{r}italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT in Lemma 2. We note that the size of the set will likely depend, via the dependency on A𝐴Aitalic_A, on the function being approximated and on the neural network hidden layer size n𝑛nitalic_n.

Lemma 4.

Eq.B.7 holds under the assumptions of Lemma 1.

Proof.

This follows by adding and subtracting Aijϕ(Bj:x+bj)superscriptsubscript𝐴𝑖𝑗italic-ϕsubscript𝐵:𝑗absent𝑥subscript𝑏𝑗A_{ij}^{\mathcal{M}}\phi(B_{j:}x+b_{j})italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT italic_ϕ ( italic_B start_POSTSUBSCRIPT italic_j : end_POSTSUBSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) inside the summations in the norm and the expanded neural network expression, and then successively bounding things using the assumptions on the domain and the activation function:

ymynnormsuperscriptsubscript𝑦𝑚subscript𝑦𝑛\displaystyle||y_{m}^{\mathcal{M}}-y_{n}||| | italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | | i=1lj=1n|Aij[ϕ(Bj:x+bj)ϕ(Bj:x+bj)](AijAij)ϕ(Bj:x+bj)|absentsuperscriptsubscript𝑖1𝑙superscriptsubscript𝑗1𝑛superscriptsubscript𝐴𝑖𝑗delimited-[]italic-ϕsuperscriptsubscript𝐵:𝑗absent𝑥superscriptsubscript𝑏𝑗italic-ϕsubscript𝐵:𝑗absent𝑥subscript𝑏𝑗subscript𝐴𝑖𝑗superscriptsubscript𝐴𝑖𝑗italic-ϕsubscript𝐵:𝑗absent𝑥subscript𝑏𝑗\displaystyle\leq\sum_{i=1}^{l}\sum_{j=1}^{n}\Big{|}A_{ij}^{\mathcal{M}}\big{[% }\phi(B_{j:}^{\mathcal{M}}x+b_{j}^{\mathcal{M}})-\phi(B_{j:}x+b_{j})\big{]}-(A% _{ij}-A_{ij}^{\mathcal{M}})\phi(B_{j:}x+b_{j})\Big{|}≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT [ italic_ϕ ( italic_B start_POSTSUBSCRIPT italic_j : end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ) - italic_ϕ ( italic_B start_POSTSUBSCRIPT italic_j : end_POSTSUBSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] - ( italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ) italic_ϕ ( italic_B start_POSTSUBSCRIPT italic_j : end_POSTSUBSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) | (D.13)
i=1lj=1n[γ¯Kϕ|(Bj:Bj:)x+bjbj|+εMϕ]absentsuperscriptsubscript𝑖1𝑙superscriptsubscript𝑗1𝑛delimited-[]¯𝛾subscript𝐾italic-ϕsuperscriptsubscript𝐵:𝑗absentsubscript𝐵:𝑗absent𝑥superscriptsubscript𝑏𝑗subscript𝑏𝑗𝜀subscript𝑀italic-ϕ\displaystyle\leq\sum_{i=1}^{l}\sum_{j=1}^{n}\Big{[}\bar{\gamma}K_{\phi}\big{|% }(B_{j:}^{\mathcal{M}}-B_{j:})x+b_{j}^{\mathcal{M}}-b_{j}\big{|}+\varepsilon M% _{\phi}\Big{]}≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT [ over¯ start_ARG italic_γ end_ARG italic_K start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT | ( italic_B start_POSTSUBSCRIPT italic_j : end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_j : end_POSTSUBSCRIPT ) italic_x + italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT - italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | + italic_ε italic_M start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ] (D.14)
nl[γ¯Kϕ(εMx+ε)+εMϕ]=nl[γ¯Kϕ(Mx+1)+Mϕ]ε.absent𝑛𝑙delimited-[]¯𝛾subscript𝐾italic-ϕ𝜀subscript𝑀𝑥𝜀𝜀subscript𝑀italic-ϕ𝑛𝑙delimited-[]¯𝛾subscript𝐾italic-ϕsubscript𝑀𝑥1subscript𝑀italic-ϕ𝜀\displaystyle\leq nl\big{[}\bar{\gamma}K_{\phi}(\varepsilon M_{x}+\varepsilon)% +\varepsilon M_{\phi}\big{]}=nl\big{[}\bar{\gamma}K_{\phi}(M_{x}+1)+M_{\phi}% \big{]}\varepsilon.≤ italic_n italic_l [ over¯ start_ARG italic_γ end_ARG italic_K start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_ε italic_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_ε ) + italic_ε italic_M start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ] = italic_n italic_l [ over¯ start_ARG italic_γ end_ARG italic_K start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + 1 ) + italic_M start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ] italic_ε . (D.15)

Lemma 5.

Eq.C.6 holds under the assumptions of Lemma 2.

Proof.

We have:

r0r0normsuperscriptsubscript𝑟0subscript𝑟0\displaystyle||r_{0}^{\mathcal{M}}-r_{0}||| | italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | =0absent0\displaystyle=0= 0 (D.16)
r1r1normsuperscriptsubscript𝑟1subscript𝑟1\displaystyle||r_{1}^{\mathcal{M}}-r_{1}||| | italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | αr0r0+βKϕ[γ¯nr0r0+εn(Rr+Rx+1)]absent𝛼normsuperscriptsubscript𝑟0subscript𝑟0𝛽subscript𝐾italic-ϕdelimited-[]¯𝛾𝑛normsuperscriptsubscript𝑟0subscript𝑟0𝜀𝑛subscript𝑅𝑟subscript𝑅𝑥1\displaystyle\leq\alpha||r_{0}^{\mathcal{M}}-r_{0}||+\beta K_{\phi}\big{[}\bar% {\gamma}n||r_{0}^{\mathcal{M}}-r_{0}||+\varepsilon n(R_{r}+R_{x}+1)\big{]}≤ italic_α | | italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | + italic_β italic_K start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT [ over¯ start_ARG italic_γ end_ARG italic_n | | italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | + italic_ε italic_n ( italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + 1 ) ] (D.17)
=βKϕγ¯nR~ε1,absent𝛽subscript𝐾italic-ϕ¯𝛾𝑛~𝑅𝜀1\displaystyle=\beta K_{\phi}\bar{\gamma}n\tilde{R}\varepsilon\leq 1,= italic_β italic_K start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT over¯ start_ARG italic_γ end_ARG italic_n over~ start_ARG italic_R end_ARG italic_ε ≤ 1 , (D.18)

where we used that r0=r0subscript𝑟0superscriptsubscript𝑟0r_{0}=r_{0}^{\mathcal{M}}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT and thus both are contained in Ursubscript𝑈𝑟U_{r}italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT so that the arguments of ϕitalic-ϕ\phiitalic_ϕ are contained in U~~𝑈\tilde{U}over~ start_ARG italic_U end_ARG, allowing us to use the Lipschitz result and constant Kϕsubscript𝐾italic-ϕK_{\phi}italic_K start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT. We note that the final inequality follows from the choice of ε𝜀\varepsilonitalic_ε and means that, since r1Ursubscript𝑟1subscript𝑈𝑟r_{1}\in U_{r}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, r1=r1+rsuperscriptsubscript𝑟1subscript𝑟1superscript𝑟r_{1}^{\mathcal{M}}=r_{1}+r^{\prime}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT = italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with r1normsuperscript𝑟1||r^{\prime}||\leq 1| | italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | | ≤ 1. This means that for the next step forward in time the arguments of ϕitalic-ϕ\phiitalic_ϕ will once again be in U~~𝑈\tilde{U}over~ start_ARG italic_U end_ARG so that the Lipschitz conditions will continue to hold. Using t=1𝑡1t=1italic_t = 1 as a base case we make the induction hypothesis:

rtrtnormsuperscriptsubscript𝑟𝑡subscript𝑟𝑡\displaystyle||r_{t}^{\mathcal{M}}-r_{t}||| | italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | nβKϕR~t=0t1(α+βKϕnγ¯)tεabsent𝑛𝛽subscript𝐾italic-ϕ~𝑅superscriptsubscript𝑡0𝑡1superscript𝛼𝛽subscript𝐾italic-ϕ𝑛¯𝛾𝑡𝜀\displaystyle\leq n\beta K_{\phi}\tilde{R}\sum_{t=0}^{t-1}(\alpha+\beta K_{% \phi}n\bar{\gamma})^{t}\varepsilon≤ italic_n italic_β italic_K start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT over~ start_ARG italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ( italic_α + italic_β italic_K start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_n over¯ start_ARG italic_γ end_ARG ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ε (D.19)
rtrtnormsuperscriptsubscript𝑟𝑡subscript𝑟𝑡\displaystyle||r_{t}^{\mathcal{M}}-r_{t}||| | italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | 1.absent1\displaystyle\leq 1.≤ 1 . (D.20)

Using the induction hypothesis it is straightforward to prove the induction step, and to show that rt+1rt+1normsuperscriptsubscript𝑟𝑡1subscript𝑟𝑡1||r_{t+1}^{\mathcal{M}}-r_{t+1}||| | italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | | remains smaller than 1111 and thus that the arguments of ϕitalic-ϕ\phiitalic_ϕ remain in U~~𝑈\tilde{U}over~ start_ARG italic_U end_ARG. Proving the induction step and then bounding the output matrix elements by γ¯¯𝛾\bar{\gamma}over¯ start_ARG italic_γ end_ARG yields the result.