3 Problem setup
In this section, we formally define deep SSM models in way that popular
architectures from the literature are contained, the generalization error and
the concept of PAC bounds. As our results stand for both continuous-time (CT) and
discrete-time (DT) SSMs, in some cases we introduce parallel definitions
corresponding to each case.
Notation. We denote scalars with lowercase characters, vectors with
lowercase bold characters and matrices with uppercase characters. For a matrix
A 𝐴 A italic_A let A i subscript 𝐴 𝑖 A_{i} italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote its i 𝑖 i italic_i th row. The symbol ⊙ direct-product \odot ⊙ denotes the elementwise
product. We use [ n ] delimited-[] 𝑛 [n] [ italic_n ] to denote the set { 1 , 2 , … , n } 1 2 … 𝑛 \{1,2,\ldots,n\} { 1 , 2 , … , italic_n } for n ∈ ℕ 𝑛 ℕ n\in\mathbb{N} italic_n ∈ blackboard_N .
For vector valued time dependent functions
originating from a finite set, the notation 𝐮 i ( j ) ( t ) superscript subscript 𝐮 𝑖 𝑗 𝑡 \mathbf{u}_{i}^{(j)}(t) bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ( italic_t ) refers to the j 𝑗 j italic_j th
coordinate of the i 𝑖 i italic_i th function at time t 𝑡 t italic_t . Similarly, in the discrete time
case, the notation 𝐮 i ( j ) [ k ] superscript subscript 𝐮 𝑖 𝑗 delimited-[] 𝑘 \mathbf{u}_{i}^{(j)}[k] bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT [ italic_k ] refers to the j 𝑗 j italic_j th component of the i 𝑖 i italic_i th
time series at time step k 𝑘 k italic_k .
Σ Σ \Sigma roman_Σ denotes a dynamical system specified in
the context. The constant n in subscript 𝑛 in n_{\text{in}} italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT refers to the dimension of the input sequence,
T 𝑇 T italic_T refers to its length in time, while n out subscript 𝑛 out n_{\text{out}} italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT is the dimension of the output
(not necessarily a sequence).
Denote by ℓ T 2 , 2 ( ℝ n ) subscript superscript ℓ 2 2
𝑇 superscript ℝ 𝑛 \ell^{2,2}_{T}(\mathbb{R}^{n}) roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) and ℓ T ∞ , ∞ ( ℝ n ) subscript superscript ℓ
𝑇 superscript ℝ 𝑛 \ell^{\infty,\infty}_{T}(\mathbb{R}^{n}) roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) the Banach spaces generated by the all finite
sequences over ℝ n superscript ℝ 𝑛 \mathbb{R}^{n} blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT of length T 𝑇 T italic_T with the norm
∥ 𝐮 ∥ ℓ T 2 , 2 ( ℝ n ) 2 = ∑ k = 0 T − 1 ∥ 𝐮 [ k ] ∥ 2 2 subscript superscript delimited-∥∥ 𝐮 2 subscript superscript ℓ 2 2
𝑇 superscript ℝ 𝑛 superscript subscript 𝑘 0 𝑇 1 superscript subscript delimited-∥∥ 𝐮 delimited-[] 𝑘 2 2 \left\lVert\mathbf{u}\right\rVert^{2}_{\ell^{2,2}_{T}(\mathbb{R}^{n})}=\sum_{k%
=0}^{T-1}\left\lVert\mathbf{u}[k]\right\rVert_{2}^{2} ∥ bold_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∥ bold_u [ italic_k ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and ∥ 𝐮 ∥ ℓ T ∞ , ∞ ( ℝ n ) = sup k = 0 , … , T − 1 ∥ 𝐮 [ k ] ∥ ∞ subscript delimited-∥∥ 𝐮 subscript superscript ℓ
𝑇 superscript ℝ 𝑛 subscript supremum 𝑘 0 … 𝑇 1
subscript delimited-∥∥ 𝐮 delimited-[] 𝑘 \left\lVert\mathbf{u}\right\rVert_{\ell^{\infty,\infty}_{T}(\mathbb{R}^{n})}=%
\sup\limits_{k=0,\ldots,T-1}\left\lVert\mathbf{u}[k]\right\rVert_{\infty} ∥ bold_u ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT italic_k = 0 , … , italic_T - 1 end_POSTSUBSCRIPT ∥ bold_u [ italic_k ] ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT respectively. More
generally, denote by ℓ 2 , 2 ( ℝ n ) superscript ℓ 2 2
superscript ℝ 𝑛 \ell^{2,2}(\mathbb{R}^{n}) roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) and ℓ ∞ , ∞ ( ℝ n ) superscript ℓ
superscript ℝ 𝑛 \ell^{\infty,\infty}(\mathbb{R}^{n}) roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) the Banach spaces generated by the all inifinite
sequences over ℝ n superscript ℝ 𝑛 \mathbb{R}^{n} blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT such that the quantities ∥ 𝐮 ∥ ℓ 2 , 2 ( ℝ n ) 2 = ∑ k = 0 ∞ ∥ 𝐮 [ k ] ∥ 2 2 subscript superscript delimited-∥∥ 𝐮 2 superscript ℓ 2 2
superscript ℝ 𝑛 superscript subscript 𝑘 0 superscript subscript delimited-∥∥ 𝐮 delimited-[] 𝑘 2 2 \left\lVert\mathbf{u}\right\rVert^{2}_{\ell^{2,2}(\mathbb{R}^{n})}=\sum_{k=0}^%
{\infty}\left\lVert\mathbf{u}[k]\right\rVert_{2}^{2} ∥ bold_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∥ bold_u [ italic_k ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and
∥ 𝐮 ∥ ℓ ∞ , ∞ ( ℝ n ) = sup k ∥ 𝐮 [ k ] ∥ ∞ subscript delimited-∥∥ 𝐮 superscript ℓ
superscript ℝ 𝑛 subscript supremum 𝑘 subscript delimited-∥∥ 𝐮 delimited-[] 𝑘 \left\lVert\mathbf{u}\right\rVert_{\ell^{\infty,\infty}(\mathbb{R}^{n})}=\sup%
\limits_{k}\left\lVert\mathbf{u}[k]\right\rVert_{\infty} ∥ bold_u ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ bold_u [ italic_k ] ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT are well defined and finite. If 𝐮 = 𝐮 [ 0 ] … , 𝐮 [ T − 1 ] 𝐮 𝐮 delimited-[] 0 … 𝐮 delimited-[] 𝑇 1
\mathbf{u}=\mathbf{u}[0]\ldots,\mathbf{u}[T-1] bold_u = bold_u [ 0 ] … , bold_u [ italic_T - 1 ] is a finite sequence of length T 𝑇 T italic_T , then we can interpret it as an
infinite sequence 𝐮 = 𝐮 [ 0 ] … , 𝐮 [ T − 1 ] , 0 , 0 , … 𝐮 𝐮 delimited-[] 0 … 𝐮 delimited-[] 𝑇 1 0 0 …
\mathbf{u}=\mathbf{u}[0]\ldots,\mathbf{u}[T-1],0,0,\ldots bold_u = bold_u [ 0 ] … , bold_u [ italic_T - 1 ] , 0 , 0 , … ; elements of which
are zero after the T 𝑇 T italic_T th element. For a Banach space
𝒳 𝒳 \mathcal{X} caligraphic_X , B 𝒳 ( r ) = { x ∈ 𝒳 ∣ ∥ x ∥ 𝒳 ≤ r } subscript 𝐵 𝒳 𝑟 conditional-set 𝑥 𝒳 subscript delimited-∥∥ 𝑥 𝒳 𝑟 B_{\mathcal{X}}(r)=\{x\in\mathcal{X}\mid\left\lVert x\right\rVert_{\mathcal{X}%
}\leq r\} italic_B start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ( italic_r ) = { italic_x ∈ caligraphic_X ∣ ∥ italic_x ∥ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ≤ italic_r } denotes the ball of radius r > 0 𝑟 0 r>0 italic_r > 0 centered in
zero.
We use
the symbols E ( u , y ) ∼ 𝒟 subscript 𝐸 similar-to 𝑢 𝑦 𝒟 E_{(u,y)\sim\mathcal{D}} italic_E start_POSTSUBSCRIPT ( italic_u , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT , P ( u , y ) ∼ 𝒟 subscript 𝑃 similar-to 𝑢 𝑦 𝒟 P_{(u,y)\sim\mathcal{D}} italic_P start_POSTSUBSCRIPT ( italic_u , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT ,
E 𝐒 ∼ 𝒟 N subscript 𝐸 similar-to 𝐒 superscript 𝒟 𝑁 E_{\mathbf{S}\sim\mathcal{D}^{N}} italic_E start_POSTSUBSCRIPT bold_S ∼ caligraphic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and P 𝐒 ∼ 𝒟 N subscript 𝑃 similar-to 𝐒 superscript 𝒟 𝑁 P_{\mathbf{S}\sim\mathcal{D}^{N}} italic_P start_POSTSUBSCRIPT bold_S ∼ caligraphic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to
denote expectations and probabilities w.r.t. a probability measure 𝒟 𝒟 \mathcal{D} caligraphic_D
and its N 𝑁 N italic_N -fold product 𝒟 N superscript 𝒟 𝑁 \mathcal{D}^{N} caligraphic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT respectively (see Section
3.2 ). The notation 𝐒 ∼ 𝒟 N similar-to 𝐒 superscript 𝒟 𝑁 \mathbf{S}\sim\mathcal{D}^{N} bold_S ∼ caligraphic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT
tacitly assumes that 𝐒 ∈ ( 𝒰 × 𝒴 ) N 𝐒 superscript 𝒰 𝒴 𝑁 \mathbf{S}\in(\mathcal{U}\times\mathcal{Y})^{N} bold_S ∈ ( caligraphic_U × caligraphic_Y ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , i.e.
𝐒 𝐒 \mathbf{S} bold_S is made of N 𝑁 N italic_N triplets of input and output trajectories.
3.1 Deep SSMs
A State-Space Model (SSM) is a discrete-time linear dynamical system of
the form
Σ { 𝐱 [ k ] = A 𝐱 [ k − 1 ] + B 𝐮 [ k ] , 𝐱 [ 0 ] = 0 𝐲 [ k ] = C 𝐱 [ k ] + D 𝐮 [ k ] Σ cases otherwise formulae-sequence 𝐱 delimited-[] 𝑘 𝐴 𝐱 delimited-[] 𝑘 1 𝐵 𝐮 delimited-[] 𝑘 𝐱 delimited-[] 0 0 otherwise 𝐲 delimited-[] 𝑘 𝐶 𝐱 delimited-[] 𝑘 𝐷 𝐮 delimited-[] 𝑘 \Sigma\begin{cases}&\mathbf{x}[k]=A\mathbf{x}[k-1]+B\mathbf{u}[k],\leavevmode%
\nobreak\ \mathbf{x}[0]=0\\
&\mathbf{y}[k]=C\mathbf{x}[k]+D\mathbf{u}[k]\end{cases} roman_Σ { start_ROW start_CELL end_CELL start_CELL bold_x [ italic_k ] = italic_A bold_x [ italic_k - 1 ] + italic_B bold_u [ italic_k ] , bold_x [ 0 ] = 0 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_y [ italic_k ] = italic_C bold_x [ italic_k ] + italic_D bold_u [ italic_k ] end_CELL end_ROW
(1)
where A ∈ ℝ n x × n x , B ∈ ℝ n x × n u , C ∈ ℝ n y × n x formulae-sequence 𝐴 superscript ℝ subscript 𝑛 𝑥 subscript 𝑛 𝑥 formulae-sequence 𝐵 superscript ℝ subscript 𝑛 𝑥 subscript 𝑛 𝑢 𝐶 superscript ℝ subscript 𝑛 𝑦 subscript 𝑛 𝑥 A\in\mathbb{R}^{n_{x}\times n_{x}},B\in\mathbb{R}^{n_{x}\times n_{u}},C\in%
\mathbb{R}^{n_{y}\times n_{x}} italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and D ∈ ℝ n y × n u 𝐷 superscript ℝ subscript 𝑛 𝑦 subscript 𝑛 𝑢 D\in\mathbb{R}^{n_{y}\times n_{u}} italic_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are matrices,
is some function of 𝐮 [ k ] , 𝐱 [ k ] 𝐮 delimited-[] 𝑘 𝐱 delimited-[] 𝑘
\mathbf{u}[k],\mathbf{x}[k] bold_u [ italic_k ] , bold_x [ italic_k ] , and k = 1 , 2 , … , T 𝑘 1 2 … 𝑇
k=1,2,\ldots,T italic_k = 1 , 2 , … , italic_T , where T 𝑇 T italic_T is the number of time steps. We consider
the value of T 𝑇 T italic_T to be fixed. The reason behind this is rather technical, namely
to conviniently handle the Pooling layer (see Definition 6 ).
We emphasize that the generalization bound in Theorem 18 is
independent of T 𝑇 T italic_T . For regression tasks the model does not contain any pooling
layer, thus this restriction is not needed.
Input-output maps of SSMs.
The SSM (1 ) can be run for finite or infinite number of time steps,
and hence it realizes a sequence to sequence transformation. SSM induce as
sequence to sequence transformations: they map any input sequence 𝐮 𝐮 \mathbf{u} bold_u of any
length (including infinite length) to the unique output sequence 𝐲 𝐲 \mathbf{y} bold_y of the
same length, i.e. an SSM Σ Σ \Sigma roman_Σ induces the input-output map of Σ Σ \Sigma roman_Σ ,
denoted by 𝒮 Σ subscript 𝒮 Σ \mathcal{S}_{\Sigma} caligraphic_S start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT . Moreover, 𝒮 Σ subscript 𝒮 Σ \mathcal{S}_{\Sigma} caligraphic_S start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT is causal,
𝐲 [ t ] 𝐲 delimited-[] 𝑡 \mathbf{y}[t] bold_y [ italic_t ] depends only on the first t + 1 𝑡 1 t+1 italic_t + 1 inputs. The outputs of SSMs are linear
combinations of past inputs, in fact, input-output maps can be expressed by a
convolution 𝐲 [ k ] = 𝒮 Σ ( 𝐮 ) [ k ] = ∑ j = 0 k H j 𝐮 ( k − j ) 𝐲 delimited-[] 𝑘 subscript 𝒮 Σ 𝐮 delimited-[] 𝑘 superscript subscript 𝑗 0 𝑘 subscript 𝐻 𝑗 𝐮 𝑘 𝑗 \mathbf{y}[k]=\mathcal{S}_{\Sigma}(\mathbf{u})[k]=\sum_{j=0}^{k}H_{j}\mathbf{u%
}(k-j) bold_y [ italic_k ] = caligraphic_S start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ( bold_u ) [ italic_k ] = ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_u ( italic_k - italic_j ) ,
where H 0 = D subscript 𝐻 0 𝐷 H_{0}=D italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_D and H j = C A j − 1 B subscript 𝐻 𝑗 𝐶 superscript 𝐴 𝑗 1 𝐵 H_{j}=CA^{j-1}B italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_C italic_A start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT italic_B , j > 0 𝑗 0 j>0 italic_j > 0 . That is, SSMs mix inputs along the
time axis, while preserving causality
Stability and input-output maps of SSMs as operators on ℓ p , p superscript ℓ 𝑝 𝑝
\ell^{p,p} roman_ℓ start_POSTSUPERSCRIPT italic_p , italic_p end_POSTSUPERSCRIPT ,
p = ∞ , 2 𝑝 2
p=\infty,2 italic_p = ∞ , 2 . For SSMs to be robust transformations of sequences to sequences,
one needs stability. Stability is one, if not the most important concept used in
dynamical systems and control theory Antoulas (2005 ) . Intuitively, the
solutions of a stable system are continuous in the initial state and input. In
particular, for stable systems, a small perturbation in past inputs will not
result in an increasing error in future outputs.
Definition 1 (Antoulas (2005 ) )
SSM of the form (1 ) is internally stable, if the matrix A 𝐴 A italic_A is
Schur, meaning all the eigenvalues of A 𝐴 A italic_A are inside the complex unit disk.
In particular, a sufficient (but not necessary) condition for stability is that
A 𝐴 A italic_A is a contraction, i.e. ‖ A ‖ 2 < 1 subscript norm 𝐴 2 1 \|A\|_{2}<1 ∥ italic_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < 1 .
A stable SSM Σ Σ \Sigma roman_Σ is not only robust to perturbations, but it is well-known
Antoulas (2005 ) that its input-output map 𝒮 Σ subscript 𝒮 Σ \mathcal{S}_{\Sigma} caligraphic_S start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT acts as a
linear operator ℓ p , p ( ℝ n u ) → ℓ r , r ( ℝ n y ) → superscript ℓ 𝑝 𝑝
superscript ℝ subscript 𝑛 𝑢 superscript ℓ 𝑟 𝑟
superscript ℝ subscript 𝑛 𝑦 \ell^{p,p}(\mathbb{R}^{n_{u}})\to\ell^{r,r}(\mathbb{R}^{n_{y}}) roman_ℓ start_POSTSUPERSCRIPT italic_p , italic_p end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) → roman_ℓ start_POSTSUPERSCRIPT italic_r , italic_r end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ,
for any choice p , r ∈ { ∞ , 2 } 𝑝 𝑟
2 p,r\in\{\infty,2\} italic_p , italic_r ∈ { ∞ , 2 } , as for any 𝐮 ∈ ℓ p , p ( ℝ n u ) 𝐮 superscript ℓ 𝑝 𝑝
superscript ℝ subscript 𝑛 𝑢 \mathbf{u}\in\ell^{p,p}(\mathbb{R}^{n_{u}}) bold_u ∈ roman_ℓ start_POSTSUPERSCRIPT italic_p , italic_p end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , 𝒮 Σ ( 𝐮 ) ∈ ℓ r , r ( ℝ n y ) subscript 𝒮 Σ 𝐮 superscript ℓ 𝑟 𝑟
superscript ℝ subscript 𝑛 𝑦 \mathcal{S}_{\Sigma}(\mathbf{u})\in\ell^{r,r}(\mathbb{R}^{n_{y}}) caligraphic_S start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ( bold_u ) ∈ roman_ℓ start_POSTSUPERSCRIPT italic_r , italic_r end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) . In particular, 𝒮 Σ subscript 𝒮 Σ \mathcal{S}_{\Sigma} caligraphic_S start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT has a
well-defined induced norm as a linear operator, defined in the usual way,
∥ Σ ∥ p , r = sup 𝐮 ∈ ℓ p , p ( ℝ n u ) ∥ 𝒮 Σ ( 𝐮 ) ∥ ℓ r , r ( ℝ n y ) ∥ 𝐮 ) ∥ ℓ r , r ( ℝ n y ) \left\lVert\Sigma\right\rVert_{p,r}=\sup_{\mathbf{u}\in\ell^{p,p}(\mathbb{R}^{%
n_{u}})}\frac{\left\lVert\mathcal{S}_{\Sigma}(\mathbf{u})\right\rVert_{\ell^{r%
,r}(\mathbb{R}^{n_{y}})}}{\left\lVert\mathbf{u})\right\rVert_{\ell^{r,r}(%
\mathbb{R}^{n_{y}})}} ∥ roman_Σ ∥ start_POSTSUBSCRIPT italic_p , italic_r end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT bold_u ∈ roman_ℓ start_POSTSUPERSCRIPT italic_p , italic_p end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT divide start_ARG ∥ caligraphic_S start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ( bold_u ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT italic_r , italic_r end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_u ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT italic_r , italic_r end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT end_ARG .
In this paper, we will use the induced norms ∥ Σ ∥ 2 , ∞ subscript delimited-∥∥ Σ 2
\left\lVert\Sigma\right\rVert_{2,\infty} ∥ roman_Σ ∥ start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT and
∥ Σ ∥ ∞ , ∞ subscript delimited-∥∥ Σ
\left\lVert\Sigma\right\rVert_{\infty,\infty} ∥ roman_Σ ∥ start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT , which can be upper bounded by the following two
norms defined on SSMs.
Definition 2 (Chellaboina et al. (1999 ) )
For a SSM Σ Σ \Sigma roman_Σ of the form (1 ) define the
ℓ 1 subscript ℓ 1 \ell_{1} roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and H 2 subscript 𝐻 2 H_{2} italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm of Σ Σ \Sigma roman_Σ , denoted by ∥ Σ ∥ 1 subscript delimited-∥∥ Σ 1 \left\lVert\Sigma\right\rVert_{1} ∥ roman_Σ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
and ∥ Σ ∥ 2 subscript delimited-∥∥ Σ 2 \left\lVert\Sigma\right\rVert_{2} ∥ roman_Σ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT respectively,
∥ Σ ∥ 1 := max 1 ≤ i ≤ n y ∥ D i ∥ 1 + ∑ k = 0 ∞ ∥ C i A k B ∥ 1 , ∥ Σ ∥ 2 := ∥ D ∥ F 2 + ∑ k = 0 ∞ ∥ C A k B ∥ F 2 \displaystyle\left\lVert\Sigma\right\rVert_{1}:=\max\limits_{1\leq i\leq n_{y}%
}\left\lVert D_{i}\right\rVert_{1}+\sum\limits_{k=0}^{\infty}\left\lVert C_{i}%
A^{k}B\right\rVert_{1},\quad\left\lVert\Sigma\right\rVert_{2}:=\sqrt{\left%
\lVert D\right\rVert_{F}^{2}+\sum\limits_{k=0}^{\infty}\left\lVert CA^{k}B%
\right\rVert_{F}^{2}} ∥ roman_Σ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := roman_max start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∥ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_B ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ∥ roman_Σ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT := square-root start_ARG ∥ italic_D ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∥ italic_C italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_B ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
Lemma 3 (Chellaboina et al. (1999 ) )
For a system of form (1 )
∥ Σ ∥ ∞ ≤ ∥ Σ ∥ 1 subscript delimited-∥∥ Σ subscript delimited-∥∥ Σ 1 \left\lVert\Sigma\right\rVert_{\infty}\leq\left\lVert\Sigma\right\rVert_{1} ∥ roman_Σ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ ∥ roman_Σ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and
∥ Σ ∥ 2 , ∞ ≤ ∥ Σ ∥ 2 subscript delimited-∥∥ Σ 2
subscript delimited-∥∥ Σ 2 \left\lVert\Sigma\right\rVert_{2,\infty}\leq\left\lVert\Sigma\right\rVert_{2} ∥ roman_Σ ∥ start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT ≤ ∥ roman_Σ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .
The norms defined above will play a cruicial role in in the main result of the
paper, as they will allow us to bound the Rademacher complexity of the deep SSM
model.
Relationship with continuous-time models.
In the literature, SSMs are often derived by discretizing a continuous-time
linear differential equation in time (e.g. Gu and Dao (2023 ) and
references therein). If the discretization step Δ Δ \Delta roman_Δ is a fixed constant,
then we obtain time-invariant linear system of the form (1 ).
In some models, the discretization step can depend on the input and state of the
continuous-time state-space representation. In the latter case, one still
obtains a linear discrete-time state-space representation, but then the matrices
describing the system equation depend on the current input and state, and hence
the matrices are time-dependent. In this paper we will stick to time-invariant
models (1 ), for the sake of simplicity.
Deep SSM models.
In this paper, we focus our attention to deep SSM models comoposed of multiple
linear SSMs. The main reason is that most of the state-of-the-art architectures
for long-range sequences are based on linear system.
A key characteristics of deep SSM models are the combination of SSM layers with
nonlinear transformations, typically some kind of neural network that is
constant in time. The general definition we use is the following.
Definition 5
A DT-SSM block (or simply SSM block) is a function f DTB : ℓ p , q ( ℝ n u ) → ℓ r , s ( ℝ n u ) : superscript 𝑓 DTB → superscript ℓ 𝑝 𝑞
superscript ℝ subscript 𝑛 𝑢 superscript ℓ 𝑟 𝑠
superscript ℝ subscript 𝑛 𝑢 f^{\text{DTB}}:\ell^{p,q}(\mathbb{R}^{n_{u}})\to\ell^{r,s}(\mathbb{R}^{n_{u}}) italic_f start_POSTSUPERSCRIPT DTB end_POSTSUPERSCRIPT : roman_ℓ start_POSTSUPERSCRIPT italic_p , italic_q end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) → roman_ℓ start_POSTSUPERSCRIPT italic_r , italic_s end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) that is
composed of a stable SSM followed by a nonlinear transformation that is
constant in time. That is, f DTB ( 𝐮 ) [ k ] = g ( 𝒮 Σ ( 𝐮 ) [ k ] ) + α 𝐮 [ k ] superscript 𝑓 DTB 𝐮 delimited-[] 𝑘 𝑔 subscript 𝒮 Σ 𝐮 delimited-[] 𝑘 𝛼 𝐮 delimited-[] 𝑘 f^{\text{DTB}}(\mathbf{u})[k]=g(\mathcal{S}_{\Sigma}(\mathbf{u})[k])+\alpha%
\mathbf{u}[k] italic_f start_POSTSUPERSCRIPT DTB end_POSTSUPERSCRIPT ( bold_u ) [ italic_k ] = italic_g ( caligraphic_S start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ( bold_u ) [ italic_k ] ) + italic_α bold_u [ italic_k ] for some g : ℝ n u → ℝ n u : 𝑔 → superscript ℝ subscript 𝑛 𝑢 superscript ℝ subscript 𝑛 𝑢 g:\mathbb{R}^{n_{u}}\to\mathbb{R}^{n_{u}} italic_g : blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for all k ∈ [ T ] 𝑘 delimited-[] 𝑇 k\in\left[T\right] italic_k ∈ [ italic_T ] . The function g 𝑔 g italic_g is
represented as either an MLP or a GLU network (see Definitions 7 and
8 ).
We incorporate α 𝛼 \alpha italic_α so that the definition covers residual connections
(typically α 𝛼 \alpha italic_α is either 1 or 0). The previous definition is inspired by
the series of popular architectures mentioned in the introduction. Similarly,
the definition of deep SSM models is also based on these architectures. A deep
SSM model consists of SSM blocks along with an encoder, and a decoder
transformation preceded by a time-pooling layer. In practice, it is common to
apply some normalization techniques, such as batch or layer normalization.
As they are not essential for our result, we omit them from the
definition. Indeed, once training is finished, a normalization layer corresponds
to applying a neural network with linear activation function, i.e., it can be
integrated into one of the neural network layer. Since the objective of PAC bounds is to bound the generalization error for already trained models, for the purposes of PAC bounds normalization layers can be viewed as an additional neural network layer.
Definition 6
A discrete time deep SSM model for classification is a function f : ℓ p , q ( ℝ n in ) → ℝ n out : 𝑓 → superscript ℓ 𝑝 𝑞
superscript ℝ subscript 𝑛 in superscript ℝ subscript 𝑛 out f:\ell^{p,q}(\mathbb{R}^{n_{\text{in}}})\to\mathbb{R}^{n_{\text{out}}} italic_f : roman_ℓ start_POSTSUPERSCRIPT italic_p , italic_q end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) → blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT of the form f = f Dec ∘ f Pool ∘ f B L ∘ … ∘ f B 1 ∘ f Enc 𝑓 superscript 𝑓 Dec superscript 𝑓 Pool superscript 𝑓 subscript B 𝐿 … superscript 𝑓 subscript B 1 superscript 𝑓 Enc f=f^{\text{Dec}}\circ f^{\text{Pool}}\circ f^{\text{B}_{L}}\circ\ldots\circ f^%
{\text{B}_{1}}\circ f^{\text{Enc}} italic_f = italic_f start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT ∘ italic_f start_POSTSUPERSCRIPT Pool end_POSTSUPERSCRIPT ∘ italic_f start_POSTSUPERSCRIPT B start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∘ … ∘ italic_f start_POSTSUPERSCRIPT B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∘ italic_f start_POSTSUPERSCRIPT Enc end_POSTSUPERSCRIPT . The functions f Enc superscript 𝑓 Enc f^{\text{Enc}} italic_f start_POSTSUPERSCRIPT Enc end_POSTSUPERSCRIPT and f Dec superscript 𝑓 Dec f^{\text{Dec}} italic_f start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT are linear transformations
which are constant in time, while f B i superscript 𝑓 subscript B 𝑖 f^{\text{B}_{i}} italic_f start_POSTSUPERSCRIPT B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a DT-SSM block for all i 𝑖 i italic_i .
By pooling we mean the operation f Pool ( 𝐮 ) = 1 T ∑ k = 1 T 𝐮 [ k ] superscript 𝑓 Pool 𝐮 1 𝑇 superscript subscript 𝑘 1 𝑇 𝐮 delimited-[] 𝑘 f^{\text{Pool}}(\mathbf{u})=\frac{1}{T}\sum_{k=1}^{T}\mathbf{u}[k] italic_f start_POSTSUPERSCRIPT Pool end_POSTSUPERSCRIPT ( bold_u ) = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_u [ italic_k ] , an avarage pooling over the time axis.
For regression tasks, we omit the pooling layer and define the Decoder as a
linear transformation that is identical for all time steps.
Model
SSM
Block
S4 Gu et al. (2021 )
LTI, A = Λ − P Q ∗ 𝐴 Λ 𝑃 superscript 𝑄 A=\Lambda-PQ^{*} italic_A = roman_Λ - italic_P italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
block-diagonal
SSM +
nonlinear activation
S4D Gu et al. (2022 )
LTI, A = − exp ( A R e ) + i ⋅ A I m 𝐴 exp subscript 𝐴 𝑅 𝑒 ⋅ 𝑖 subscript 𝐴 𝐼 𝑚 A=-\text{exp}(A_{Re})+i\cdot A_{Im} italic_A = - exp ( italic_A start_POSTSUBSCRIPT italic_R italic_e end_POSTSUBSCRIPT ) + italic_i ⋅ italic_A start_POSTSUBSCRIPT italic_I italic_m end_POSTSUBSCRIPT
block-diagonal
SSM +
nonlinear activation
S5 Smith et al. (2022 )
LTI, diagonal A 𝐴 A italic_A
SSM +
nonlinear activation
LRU Orvieto et al. (2023 )
LTI, diagonal A 𝐴 A italic_A
complex exponential parametrization
SSM +
MLP skip connection
Table 1: Summary of popular deep SSM models.
The Encoder and Decoder layers are given by the weight matrices W Enc superscript 𝑊 Enc W^{\text{Enc}} italic_W start_POSTSUPERSCRIPT Enc end_POSTSUPERSCRIPT and
W Dec superscript 𝑊 Dec W^{\text{Dec}} italic_W start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT . Therefore f Enc ( 𝐮 ) [ k ] = W Enc 𝐮 [ k ] superscript 𝑓 Enc 𝐮 delimited-[] 𝑘 superscript 𝑊 Enc 𝐮 delimited-[] 𝑘 f^{\text{Enc}}(\mathbf{u})[k]=W^{\text{Enc}}\mathbf{u}[k] italic_f start_POSTSUPERSCRIPT Enc end_POSTSUPERSCRIPT ( bold_u ) [ italic_k ] = italic_W start_POSTSUPERSCRIPT Enc end_POSTSUPERSCRIPT bold_u [ italic_k ] and f Dec ( 𝐮 ) [ k ] = W Dec 𝐮 [ k ] superscript 𝑓 Dec 𝐮 delimited-[] 𝑘 superscript 𝑊 Dec 𝐮 delimited-[] 𝑘 f^{\text{Dec}}(\mathbf{u})[k]=W^{\text{Dec}}\mathbf{u}[k] italic_f start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT ( bold_u ) [ italic_k ] = italic_W start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT bold_u [ italic_k ] for all k ∈ [ T ] 𝑘 delimited-[] 𝑇 k\in[T] italic_k ∈ [ italic_T ] . We use the slightly abused notations
f Enc ≡ ⟨ W Enc , ⋅ ⟩ superscript 𝑓 Enc superscript 𝑊 Enc ⋅
f^{\text{Enc}}\equiv\langle W^{\text{Enc}},\cdot\rangle italic_f start_POSTSUPERSCRIPT Enc end_POSTSUPERSCRIPT ≡ ⟨ italic_W start_POSTSUPERSCRIPT Enc end_POSTSUPERSCRIPT , ⋅ ⟩ and f Dec ≡ ⟨ W Dec ⋅ ⟩ f^{\text{Dec}}\equiv\langle W^{\text{Dec}}\cdot\rangle italic_f start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT ≡ ⟨ italic_W start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT ⋅ ⟩ .
As for the Neural Network components of an SSM block, we consider the following
two variants.
Definition 7 (MLP layer)
An MLP layer is a function from ℓ ∞ , ∞ ( ℝ n u ) superscript ℓ
superscript ℝ subscript 𝑛 𝑢 \ell^{\infty,\infty}(\mathbb{R}^{n_{u}}) roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
to ℓ ∞ , ∞ ( ℝ n u ) superscript ℓ
superscript ℝ subscript 𝑛 𝑢 \ell^{\infty,\infty}(\mathbb{R}^{n_{u}}) roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) that is induced by applying a
deep neural network f : ℝ n u → ℝ n u : 𝑓 → superscript ℝ subscript 𝑛 𝑢 superscript ℝ subscript 𝑛 𝑢 f:\mathbb{R}^{n_{u}}\to\mathbb{R}^{n_{u}} italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for each
timestep. A neural network of L 𝐿 L italic_L layer is a function of the form f = f W 1 , 𝐛 1 ∘ … ∘ f W L , 𝐛 L ∘ g W L + 1 , 𝐛 L + 1 𝑓 subscript 𝑓 subscript 𝑊 1 subscript 𝐛 1
… subscript 𝑓 subscript 𝑊 𝐿 subscript 𝐛 𝐿
subscript 𝑔 subscript 𝑊 𝐿 1 subscript 𝐛 𝐿 1
f=f_{W_{1},\mathbf{b}_{1}}\circ\ldots\circ f_{W_{L},\mathbf{b}_{L}}\circ g_{W_%
{L+1},\mathbf{b}_{L+1}} italic_f = italic_f start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ … ∘ italic_f start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_g start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_L + 1 end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT italic_L + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , where f W , 𝐛 ( 𝐱 ) = ρ ( g W , 𝐛 ( 𝐱 ) ) subscript 𝑓 𝑊 𝐛
𝐱 𝜌 subscript 𝑔 𝑊 𝐛
𝐱 f_{W,\mathbf{b}}(\mathbf{x})=\rho(g_{W,\mathbf{b}}(\mathbf{x})) italic_f start_POSTSUBSCRIPT italic_W , bold_b end_POSTSUBSCRIPT ( bold_x ) = italic_ρ ( italic_g start_POSTSUBSCRIPT italic_W , bold_b end_POSTSUBSCRIPT ( bold_x ) ) is called a
hidden layer, g W , 𝐛 ( 𝐱 ) = W 𝐱 + 𝐛 subscript 𝑔 𝑊 𝐛
𝐱 𝑊 𝐱 𝐛 g_{W,\mathbf{b}}(\mathbf{x})=W\mathbf{x}+\mathbf{b} italic_g start_POSTSUBSCRIPT italic_W , bold_b end_POSTSUBSCRIPT ( bold_x ) = italic_W bold_x + bold_b is called preactivation and
ρ 𝜌 \rho italic_ρ is the activation function, which is identical for all layers of the
network and is either sigmoid or ReLU. The matrices are of the size W i ∈ ℝ n i + 1 × n i subscript 𝑊 𝑖 superscript ℝ subscript 𝑛 𝑖 1 subscript 𝑛 𝑖 W_{i}\in\mathbb{R}^{n_{i+1}\times n_{i}} italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐛 ∈ ℝ n i 𝐛 superscript ℝ subscript 𝑛 𝑖 \mathbf{b}\in\mathbb{R}^{n_{i}} bold_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT such that
n 1 = n u subscript 𝑛 1 subscript 𝑛 𝑢 n_{1}=n_{u} italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and n L + 1 = n u subscript 𝑛 𝐿 1 subscript 𝑛 𝑢 n_{L+1}=n_{u} italic_n start_POSTSUBSCRIPT italic_L + 1 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT . By slight abuse of notation, for 𝐮 ∈ ℓ ∞ , ∞ ( ℝ n u ) 𝐮 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 \mathbf{u}\in\ell^{\infty,\infty}(\mathbb{R}^{n_{u}}) bold_u ∈ roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) let f ( 𝐮 ) ∈ ℓ ∞ , ∞ ( ℝ n v ) 𝑓 𝐮 superscript ℓ
superscript ℝ subscript 𝑛 𝑣 f(\mathbf{u})\in\ell^{\infty,\infty}(\mathbb{R}^{n_{v}}) italic_f ( bold_u ) ∈ roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) such that f ( 𝐮 ) [ k ] = f ( 𝐮 [ k ] ) 𝑓 𝐮 delimited-[] 𝑘 𝑓 𝐮 delimited-[] 𝑘 f(\mathbf{u})[k]=f(\mathbf{u}[k]) italic_f ( bold_u ) [ italic_k ] = italic_f ( bold_u [ italic_k ] ) for all 1 ≤ k ≤ T 1 𝑘 𝑇 1\leq k\leq T 1 ≤ italic_k ≤ italic_T .
Definition 8 (GLU layer Smith et al. (2022 ) )
A GLU layer is a function of the form
G L U : ℓ ∞ , ∞ ( ℝ n u ) → ℓ ∞ , ∞ ( ℝ n v ) : 𝐺 𝐿 𝑈 → superscript ℓ
superscript ℝ subscript 𝑛 𝑢 superscript ℓ
superscript ℝ subscript 𝑛 𝑣 GLU:\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})\to\ell^{\infty,\infty}(\mathbb{R}%
^{n_{v}}) italic_G italic_L italic_U : roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) → roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) parametrized by a linear
operaetor W 𝑊 W italic_W such that G L U ( 𝐮 ) [ k ] = G E L U ( 𝐮 [ k ] ) ⊙ σ ( W ( G E L U ( 𝐮 [ k ] ) ) ) 𝐺 𝐿 𝑈 𝐮 delimited-[] 𝑘 direct-product 𝐺 𝐸 𝐿 𝑈 𝐮 delimited-[] 𝑘 𝜎 𝑊 𝐺 𝐸 𝐿 𝑈 𝐮 delimited-[] 𝑘 GLU(\mathbf{u})[k]=GELU(\mathbf{u}[k])\odot\sigma(W(GELU(\mathbf{u}[k]))) italic_G italic_L italic_U ( bold_u ) [ italic_k ] = italic_G italic_E italic_L italic_U ( bold_u [ italic_k ] ) ⊙ italic_σ ( italic_W ( italic_G italic_E italic_L italic_U ( bold_u [ italic_k ] ) ) ) , where σ 𝜎 \sigma italic_σ is the sigmoid function and
GELU is the Gaussian Error Linear Unit Hendrycks and Gimpel (2016 ) .
Note, that this definition of GLU layer differs from the original definition in
Dauphin et al. (2017 ) , because in deep SSM models GLU is usually applied
individually for each time step, without any time-mixing operations. See
Appendix G.1 in Smith et al. (2022 ) . The linear operation W 𝑊 W italic_W is usually represented by a
convolution operation.
3.2 Learning problem
We consider the usual supervised learning framework for sequential input data.
The considered models, parametrized by θ 𝜃 \theta italic_θ , are of the form f θ : 𝒰 + → 𝒴 : subscript 𝑓 𝜃 → superscript 𝒰 𝒴 f_{\theta}:\mathcal{U}^{+}\to\mathcal{Y} italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : caligraphic_U start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT → caligraphic_Y , where 𝒰 ⊆ ℝ n in 𝒰 superscript ℝ subscript 𝑛 in \mathcal{U}\subseteq\mathbb{R}^{n_{\text{in}}} caligraphic_U ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the input space, 𝒰 + superscript 𝒰 \mathcal{U}^{+} caligraphic_U start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT denotes the set of all
finite sequences and 𝒴 𝒴 \mathcal{Y} caligraphic_Y is the output space, either a finite set for
classification or 𝒴 ⊆ ℝ n out 𝒴 superscript ℝ subscript 𝑛 out \mathcal{Y}\subseteq\mathbb{R}^{n_{\text{out}}} caligraphic_Y ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for regression.
Without loss of generality we assume n out = 1 subscript 𝑛 out 1 n_{\text{out}}=1 italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = 1 . In practice, θ 𝜃 \theta italic_θ is
usually obtained by some learning algorithm, such as Gradient Flow or
(Stochastic) Gradient Descent. In this paper, we are agnostic regarding the
origin of θ 𝜃 \theta italic_θ as we prove a generalization bound that holds for a set of
models, therefore we omit the subscript θ 𝜃 \theta italic_θ from the notation of f 𝑓 f italic_f .
For training we will use input sequence of the same length T 𝑇 T italic_T . More precisely,
let us denote by 𝒰 T superscript 𝒰 𝑇 \mathcal{U}^{T} caligraphic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT the set of all sequences of elements of
𝒰 𝒰 \mathcal{U} caligraphic_U of length T 𝑇 T italic_T . A dataset is an i.i.d sample of the form S = { ( 𝐮 i , 𝐲 i ) } i = 1 N 𝑆 superscript subscript subscript 𝐮 𝑖 subscript 𝐲 𝑖 𝑖 1 𝑁 S=\{(\mathbf{u}_{i},\mathbf{y}_{i})\}_{i=1}^{N} italic_S = { ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT where 𝐮 i ∈ 𝒰 T subscript 𝐮 𝑖 superscript 𝒰 𝑇 \mathbf{u}_{i}\in\mathcal{U}^{T} bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and 𝐲 i ∈ 𝒴 subscript 𝐲 𝑖 𝒴 \mathbf{y}_{i}\in\mathcal{Y} bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Y for all i 𝑖 i italic_i , where the random sample is with respect to some
probability measure 𝒟 𝒟 \mathcal{D} caligraphic_D defined on
the σ 𝜎 \sigma italic_σ -algebra generated by the Borel sets of 𝒰 T × 𝒴 superscript 𝒰 𝑇 𝒴 \mathcal{U}^{T}\times\mathcal{Y} caligraphic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT × caligraphic_Y , where 𝒰 T × 𝒴 superscript 𝒰 𝑇 𝒴 \mathcal{U}^{T}\times\mathcal{Y} caligraphic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT × caligraphic_Y is considered with
respect to the standard topology as a subset of ( ℝ n in ) T × ℝ n out superscript superscript ℝ subscript 𝑛 in 𝑇 superscript ℝ subscript 𝑛 out (\mathbb{R}^{n_{\text{in}}})^{T}\times\mathbb{R}^{n_{\text{out}}} ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .
An elementwise loss function is of the form ℓ : 𝒴 × 𝒴 → ℝ : ℓ → 𝒴 𝒴 ℝ \ell:\mathcal{Y}\times\mathcal{Y}\to\mathbb{R} roman_ℓ : caligraphic_Y × caligraphic_Y → blackboard_R and is assumed to be
K ℓ subscript 𝐾 ℓ K_{\ell} italic_K start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT -Lipschitz-continuous, i.e. | ℓ ( y 1 , y 1 ′ ) − ℓ ( y 2 , y 2 ′ ) | ≤ K ℓ ( | y 1 − y 2 | + | y 1 ′ − y 2 ′ | ) ℓ subscript 𝑦 1 superscript subscript 𝑦 1 ′ ℓ subscript 𝑦 2 superscript subscript 𝑦 2 ′ subscript 𝐾 ℓ subscript 𝑦 1 subscript 𝑦 2 superscript subscript 𝑦 1 ′ superscript subscript 𝑦 2 ′ |\ell(y_{1},y_{1}^{\prime})-\ell(y_{2},y_{2}^{\prime})|\leq K_{\ell}(|y_{1}-y_%
{2}|+|y_{1}^{\prime}-y_{2}^{\prime}|) | roman_ℓ ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - roman_ℓ ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ≤ italic_K start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | + | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ) for all y 1 , y 2 , y 1 ′ , y 2 ′ ∈ ℝ subscript 𝑦 1 subscript 𝑦 2 superscript subscript 𝑦 1 ′ superscript subscript 𝑦 2 ′
ℝ y_{1},y_{2},y_{1}^{\prime},y_{2}^{\prime}\in\mathbb{R} italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R , as
well as ℓ ( y , y ) = 0 ℓ 𝑦 𝑦 0 \ell(y,y)=0 roman_ℓ ( italic_y , italic_y ) = 0 for all y ∈ ℝ 𝑦 ℝ y\in\mathbb{R} italic_y ∈ blackboard_R . Let ℒ e m p S ( f ) = 1 N ∑ i = 1 N l ( f ( 𝐮 i ) , y i ) superscript subscript ℒ 𝑒 𝑚 𝑝 𝑆 𝑓 1 𝑁 superscript subscript 𝑖 1 𝑁 𝑙 𝑓 subscript 𝐮 𝑖 subscript 𝑦 𝑖 \mathcal{L}_{emp}^{S}(f)=\frac{1}{N}\sum_{i=1}^{N}l(f(\mathbf{u}_{i}),y_{i}) caligraphic_L start_POSTSUBSCRIPT italic_e italic_m italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( italic_f ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_l ( italic_f ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denote the empirical loss of a
model f 𝑓 f italic_f w.r.t a dataset S 𝑆 S italic_S . We denote the true error by ℒ ( f ) = 𝐄 ( 𝐮 , y ) ∼ 𝒟 [ ℓ ( f ( 𝐮 ) , y ) ] ℒ 𝑓 subscript 𝐄 similar-to 𝐮 𝑦 𝒟 delimited-[] ℓ 𝑓 𝐮 𝑦 \mathcal{L}(f)=\mathbf{E}_{(\mathbf{u},y)\sim\mathcal{D}}[\ell(f(\mathbf{u}),y)] caligraphic_L ( italic_f ) = bold_E start_POSTSUBSCRIPT ( bold_u , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_ℓ ( italic_f ( bold_u ) , italic_y ) ] . The generalization error
or gap of a model f 𝑓 f italic_f is defined as | ℒ e m p S ( f ) − ℒ ( f ) | superscript subscript ℒ 𝑒 𝑚 𝑝 𝑆 𝑓 ℒ 𝑓 |\mathcal{L}_{emp}^{S}(f)-\mathcal{L}(f)| | caligraphic_L start_POSTSUBSCRIPT italic_e italic_m italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( italic_f ) - caligraphic_L ( italic_f ) | . In practice, we can approximate the generalization gap by the
empirical generalization gap, i.e. the loss difference on the training data and
some test data (see Devroye et al. (2013 ) ).
3.3 Assumptions
Before moving forward to discuss the main result, we summarize the assumptions
we make in the paper for the sake of readibility. Hereinafter we denote by
ℱ ℱ \mathcal{F} caligraphic_F a set of deep SSM models represented by its direct product of its
layerwise parameters. Furthermore, let ℰ ℰ \mathcal{E} caligraphic_E denote the set of all SSM
models Σ Σ \Sigma roman_Σ for which there is a model f ∈ ℱ 𝑓 ℱ f\in\mathcal{F} italic_f ∈ caligraphic_F such that
Σ Σ \Sigma roman_Σ is an SSM layer of f 𝑓 f italic_f .
Assumption 9
We assume the following properties hold.
1.
Scalar output. Let n out = 1 subscript 𝑛 out 1 n_{\text{out}}=1 italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = 1 .
2.
Lipschitz loss function. Let the elementwise loss l 𝑙 l italic_l
be L l subscript 𝐿 𝑙 L_{l} italic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT -Lipschitz continuous.
3.
Bounded input. There exist K 𝐮 > 0 subscript 𝐾 𝐮 0 K_{\mathbf{u}}>0 italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT > 0
and K y > 0 subscript 𝐾 𝑦 0 K_{y}>0 italic_K start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT > 0 such that for all input trajectory 𝐮 𝐮 \mathbf{u} bold_u
we have ∥ 𝐮 ∥ ℓ 2 , 2 ( 𝐑 n in ) ≤ K 𝐮 subscript delimited-∥∥ 𝐮 superscript ℓ 2 2
superscript 𝐑 subscript 𝑛 in subscript 𝐾 𝐮 \left\lVert\mathbf{u}\right\rVert_{\ell^{2,2}(\mathbf{R}^{n_{\text{in}}})}\leq
K%
_{\mathbf{u}} ∥ bold_u ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( bold_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ≤ italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT
and for all labels we have | y | ≤ K y 𝑦 subscript 𝐾 𝑦 |y|\leq K_{y} | italic_y | ≤ italic_K start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT .
4.
Stability. All Σ ∈ ℰ Σ ℰ \Sigma\in\mathcal{E} roman_Σ ∈ caligraphic_E are internally
stable, implying ∥ Σ ∥ p < + ∞ subscript delimited-∥∥ Σ 𝑝 \left\lVert\Sigma\right\rVert_{p}<+\infty ∥ roman_Σ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT < + ∞ for p = 1 , 2 𝑝 1 2
p=1,2 italic_p = 1 , 2 . Therefore we
assume there exist K p > 0 subscript 𝐾 𝑝 0 K_{p}>0 italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT > 0 such that sup Σ ∈ ℰ ∥ Σ ∥ p < K p subscript supremum Σ ℰ subscript delimited-∥∥ Σ 𝑝 subscript 𝐾 𝑝 \sup\limits_{\Sigma\in\mathcal{E}}\left\lVert\Sigma\right\rVert_{p}<K_{p} roman_sup start_POSTSUBSCRIPT roman_Σ ∈ caligraphic_E end_POSTSUBSCRIPT ∥ roman_Σ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT < italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT for p = 1 , 2 𝑝 1 2
p=1,2 italic_p = 1 , 2 .
5.
Bounded Encoder and Decoder. We assume the Encoder and
Decoder have bounded operator norms, i.e. sup W Enc ∈ 𝒲 Enc ∥ W Enc ∥ 2 , 2 < K Enc subscript supremum superscript 𝑊 Enc subscript 𝒲 Enc subscript delimited-∥∥ superscript 𝑊 Enc 2 2
subscript 𝐾 Enc \sup\limits_{W^{\text{Enc}}\in\mathcal{W}_{\text{Enc}}}\left\lVert W^{\text{%
Enc}}\right\rVert_{2,2}<K_{\text{Enc}} roman_sup start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT Enc end_POSTSUPERSCRIPT ∈ caligraphic_W start_POSTSUBSCRIPT Enc end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_W start_POSTSUPERSCRIPT Enc end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT < italic_K start_POSTSUBSCRIPT Enc end_POSTSUBSCRIPT and
sup W Dec ∈ 𝒲 Dec ∥ W Dec ∥ ∞ , β < K Dec subscript supremum superscript 𝑊 Dec subscript 𝒲 Dec subscript delimited-∥∥ superscript 𝑊 Dec 𝛽
subscript 𝐾 Dec \sup\limits_{W^{\text{Dec}}\in\mathcal{W}_{\text{Dec}}}\left\lVert W^{\text{%
Dec}}\right\rVert_{\infty,\beta}<K_{\text{Dec}} roman_sup start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT ∈ caligraphic_W start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_W start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ , italic_β end_POSTSUBSCRIPT < italic_K start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT for β ∈ ℕ ∪ { ∞ } 𝛽 ℕ \beta\in\mathbb{N}\cup\{\infty\} italic_β ∈ blackboard_N ∪ { ∞ } .
Assumption 1. is not restrictive as we consider classification.
The Lipschitzness in Assumption 2. holds for most of the loss functions used in
practice. We mention that even the square-loss is Lipschitz on a bounded domain.
From the practical aspect, the upped boundedness is also mild, as parameters
along the learning algorithm’s trajectory usually make l 𝑙 l italic_l bounded. In the worst
case, l 𝑙 l italic_l is bounded on a bounded domain due to being Lipschitz.
Assumption 3. is yet again standard in the literature. Even in practical
applications the input is usually normalized or standardized as a preprocessing
step before learning.
Assumption 4. is the most important one as it plays a central role in our work.
The motivation behind this assumption is twofold. First, in practical
implementation of SSM based architectures, it is very common to apply some
structured parametrization of the matrices of the systems, which leads to
learning stable matrices. In many cases, the underlying intention is numerical
stability of the learning algorithm, however we argue that the major advantage
of such parametrizations is to ensure a stable behavior of the system. Second,
similar stability assumptions are standard in control theory.
Assumption 5. is again fairly standard, as it requires that the weights of the
encoder and decoder are bounded.
4 Main results
Our main result is a Rademacher complexity based generalization bound for deep
SSM models that does not depend on the sequence length. The main challenge to
establish such bounds are threefold:
1.
without any assumption on the SSMs, their Rademacher complexity itself is not
trivial to upper bound,
2.
even if one had a bound on the Rademacher complexity of SSMs, it needs
to be extened to a block of SSMs and deep neural networks,
3.
deep SSM structures usually contain several of these blocks,
therefore any estimation on the Rademacher complexity of the model class
has to address this situation.
As to the first point, we argue that the stability of the SSMs has a crucial
role in the generalization ability of these kind of models. While applying
stable parametrization is common in practical implementations, to best of our
knowledge, the literature has been lacking any theoretical guarantees on the
effect of stability to the model’s performance. To this end, we show that the
Rademacher complexity of a set of SSMs can be upper bounded by a term which has
a tight dependence on the maximal H 2 subscript 𝐻 2 H_{2} italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm of the SSMs.
The second and third points can be translated to the task of estimating the
Rademacher complexity of a deep structure, where each block of the composition
may have a different mathematical nature. We can overcome this obstacle by
introducing a property of functions, referred to as Rademacher Contraction, that
is universal enough to include functions represented by both SSMs and neural
networks.
Definition 10 (( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -Rademacher Contraction)
Let X 1 subscript 𝑋 1 X_{1} italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X 2 subscript 𝑋 2 X_{2} italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be subsets of Banach spaces
𝒳 1 , 𝒳 2 subscript 𝒳 1 subscript 𝒳 2
\mathcal{X}_{1},\mathcal{X}_{2} caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
with norms ∥ ⋅ ∥ 𝒳 1 \|\cdot\|_{\mathcal{X}_{1}} ∥ ⋅ ∥ start_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and ∥ ⋅ ∥ 𝒳 2 \|\cdot\|_{\mathcal{X}_{2}} ∥ ⋅ ∥ start_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , and let
μ ≥ 0 𝜇 0 \mu\geq 0 italic_μ ≥ 0 and c ≥ 0 𝑐 0 c\geq 0 italic_c ≥ 0 .
A set of functions Φ = { φ : X 1 → X 2 } Φ conditional-set 𝜑 → subscript 𝑋 1 subscript 𝑋 2 \Phi=\{\varphi:X_{1}\to X_{2}\} roman_Φ = { italic_φ : italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } is said to be ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -Rademacher Contraction,
or ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC in short, w.r.t. ∥ ⋅ ∥ X 2 subscript delimited-∥∥ ⋅ subscript 𝑋 2 \left\lVert\cdot\right\rVert_{X_{2}} ∥ ⋅ ∥ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and
∥ ⋅ ∥ X 1 subscript delimited-∥∥ ⋅ subscript 𝑋 1 \left\lVert\cdot\right\rVert_{X_{1}} ∥ ⋅ ∥ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , if for all n ∈ ℕ + 𝑛 superscript ℕ n\in\mathbb{N}^{+} italic_n ∈ blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and Z ⊆ X 1 n 𝑍 superscript subscript 𝑋 1 𝑛 Z\subseteq X_{1}^{n} italic_Z ⊆ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT we have
𝔼 σ [ sup φ ∈ Φ sup { 𝐮 i } i = 1 n ∈ Z ∥ 1 N ∑ i = 1 N σ i φ ( 𝐮 i ) ∥ 𝒳 2 ] ≤ μ 𝔼 σ [ sup { 𝐮 i } i = 1 n ∈ Z ∥ 1 N ∑ i = 1 N σ i 𝐮 i ∥ 𝒳 1 ] + c N , subscript 𝔼 𝜎 delimited-[] subscript supremum 𝜑 Φ subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑛 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝜑 subscript 𝐮 𝑖 subscript 𝒳 2 𝜇 subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑛 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 subscript 𝒳 1 𝑐 𝑁 \mathbb{E}_{\sigma}\left[\sup\limits_{\varphi\in\Phi}\sup\limits_{\{\mathbf{u}%
_{i}\}_{i=1}^{n}\in Z}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}%
\varphi(\mathbf{u}_{i})\right\rVert_{\mathcal{X}_{2}}\right]\leq\mu\mathbb{E}_%
{\sigma}\left[\sup\limits_{\{\mathbf{u}_{i}\}_{i=1}^{n}\in Z}\left\lVert\frac{%
1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf{u}_{i}\right\rVert_{\mathcal{X}_{1%
}}\right]+\frac{c}{\sqrt{N}}, blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_φ ∈ roman_Φ end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_φ ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] ≤ italic_μ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] + divide start_ARG italic_c end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG ,
(2)
where σ i subscript 𝜎 𝑖 \sigma_{i} italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are i.i.d. Rademacher random variables, i ∈ [ N ] 𝑖 delimited-[] 𝑁 i\in[N] italic_i ∈ [ italic_N ] , i.e.
ℙ ( σ i = 1 ) = ℙ ( σ i = − 1 ) = 1 / 2 ℙ subscript 𝜎 𝑖 1 ℙ subscript 𝜎 𝑖 1 1 2 \mathbb{P}(\sigma_{i}=1)=\mathbb{P}(\sigma_{i}=-1)=1/2 blackboard_P ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) = blackboard_P ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - 1 ) = 1 / 2 .
Rademacher Contractions in the literature. Special cases of the
inequality in Definition 10 can be found in the literature. In
Golowich et al. (2018 ) the authors considered biasless ReLU networks and
proved a similar inequality using Talagrand’s contraction lemma
Ledoux and Talagrand (1991 ) . In Truong (2022b ) , the author
considered neural networks with dense and convolutional layers and derived a
PAC bound via bounding the Rademacher complexity. One of the key technical
achievements is Theorem 9, which is a more general version of the inequality
in Golowich et al. (2018 ) . This was then applied to obtain generalization
bounds for the task of learning Markov-chains in
Truong (2022a ) , however the generalization error was measured
via the marginal cost and the ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC type inequality was only applied
for time-invariant neural networks. In contrast, we prove that along with time
invariant models, stable SSMs, defined between certain Banach spaces, also
satisfy equation (2 ) and apply it to deep structures.
In a recent work Trauger and Tewari (2024 ) , the authors consider Transformers
and implicitly establish similar inequalities to (2 ) by bounding
different kinds of operator norms of the model and managed to extend it to a
stack of Transformer layers. Besides these similarities, some key differences
in our work are that Definition 10 provides an explicit way to
combine SSMs with neural networks, even in residual blocks; we do not assume
the SSM matrices to be bounded, instead we require the system norm to be
bounded via stability, which is a weaker condition; and we upper bound the
Rademacher complexity directly instead of bounding the covering number.
Interpretation of equation (2 ). Usually, the machine
learning literature deals with functions between vector spaces, whereas in the
above definition we consider functions between normed spaces. As a result, in
order to apply this definition to any component of a deep SSM model, which are
defined as maps between vector spaces in section 3.1 , we need
to equip the domain and range of these maps with some norms. The choice of
these norms is arbitrary, but the constants μ 𝜇 \mu italic_μ and c 𝑐 c italic_c depend on it, and
the analysis of a deep structure componentwise requires that the image of a
middle layer is equiepped with the same norm as the domain of the succeeding
layer. Foreshadowing the next section, this is exactly what we would like to
do, namely show, that each component of a deep SSM model parametrized by some
appropriate sets of parameters, is ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC for some μ 𝜇 \mu italic_μ and c 𝑐 c italic_c .
In case of affine layers and SSMs, this is proven in Lemma 13 and
those results hold for unbounded X 1 subscript 𝑋 1 X_{1} italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X 2 subscript 𝑋 2 X_{2} italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT sets with c = 0 𝑐 0 c=0 italic_c = 0 . Lemma
16 gives similar results for sigmoid and ReLU MLPs with c > 0 𝑐 0 c>0 italic_c > 0 ,
which are also valid for ubounded input space. As for the GLU layers, it is
possible to prove a ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC property, but the constants depend on the
size of X 1 subscript 𝑋 1 X_{1} italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . Consequently, we restrict the input space of the nonlinear
layers to a ball of a fixed radius. As these layers are preceded and succeeded
by other components, we need to ensure that if the input of the deep SSM is
from a ball of some radius r 𝑟 r italic_r , the output of each layer is also in a ball of
a radius that may depend on r 𝑟 r italic_r and the layer’s parameter set.
The following Lemma provides the way to establish a (2 ) type
inequality for deep structures whose components are each ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC.
Lemma 11 (Composition lemma)
Let Φ 1 = { φ 1 : X 1 → X 2 } subscript Φ 1 conditional-set subscript 𝜑 1 → subscript 𝑋 1 subscript 𝑋 2 \Phi_{1}=\{\varphi_{1}:X_{1}\to X_{2}\} roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } be ( μ 1 , c 1 ) subscript 𝜇 1 subscript 𝑐 1 (\mu_{1},c_{1}) ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) -RC and Φ 2 = { φ 2 : X 2 → X 3 } subscript Φ 2 conditional-set subscript 𝜑 2 → subscript 𝑋 2 subscript 𝑋 3 \Phi_{2}=\{\varphi_{2}:X_{2}\to X_{3}\} roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } be a ( μ 2 , c 2 ) subscript 𝜇 2 subscript 𝑐 2 (\mu_{2},c_{2}) ( italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) -RC. Then the set of
compositions Φ 2 ∘ Φ 1 := { φ 2 ∘ φ 1 : X 1 → X 3 ∣ φ 1 ∈ Φ 1 , φ 2 ∈ Φ 2 } assign subscript Φ 2 subscript Φ 1 conditional-set subscript 𝜑 2 subscript 𝜑 1 formulae-sequence → subscript 𝑋 1 conditional subscript 𝑋 3 subscript 𝜑 1 subscript Φ 1 subscript 𝜑 2 subscript Φ 2 \Phi_{2}\circ\Phi_{1}:=\{\varphi_{2}\circ\varphi_{1}:X_{1}\to X_{3}\mid\varphi%
_{1}\in\Phi_{1},\varphi_{2}\in\Phi_{2}\} roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := { italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∣ italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } is a ( μ 1 μ 2 , μ 2 c 1 + c 2 ) subscript 𝜇 1 subscript 𝜇 2 subscript 𝜇 2 subscript 𝑐 1 subscript 𝑐 2 (\mu_{1}\mu_{2},\mu_{2}c_{1}+c_{2}) ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) -RC.
The proof is in Appendix B . The upcoming corollary is
straightforward by induction along with the fact that the pooling layer
f Pool superscript 𝑓 Pool f^{\text{Pool}} italic_f start_POSTSUPERSCRIPT Pool end_POSTSUPERSCRIPT is ( 1 , 0 ) 1 0 (1,0) ( 1 , 0 ) -RC (see Lemma 13 ).
Corollary 12
Let f 𝑓 f italic_f be a deep SSM model, i.e. according to Definition
6 , f = f Dec ∘ f Pool ∘ f B L ∘ … ∘ f B 1 ∘ f Enc 𝑓 superscript 𝑓 Dec superscript 𝑓 Pool superscript 𝑓 subscript B 𝐿 … superscript 𝑓 subscript B 1 superscript 𝑓 Enc f=f^{\text{Dec}}\circ f^{\text{Pool}}\circ f^{\text{B}_{L}}\circ\ldots\circ f^%
{\text{B}_{1}}\circ f^{\text{Enc}} italic_f = italic_f start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT ∘ italic_f start_POSTSUPERSCRIPT Pool end_POSTSUPERSCRIPT ∘ italic_f start_POSTSUPERSCRIPT B start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∘ … ∘ italic_f start_POSTSUPERSCRIPT B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∘ italic_f start_POSTSUPERSCRIPT Enc end_POSTSUPERSCRIPT , such that f Enc superscript 𝑓 Enc f^{\text{Enc}} italic_f start_POSTSUPERSCRIPT Enc end_POSTSUPERSCRIPT , f Dec superscript 𝑓 Dec f^{\text{Dec}} italic_f start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT
and each f B i superscript 𝑓 subscript B 𝑖 f^{\text{B}_{i}} italic_f start_POSTSUPERSCRIPT B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are ( μ 0 , c 0 ) subscript 𝜇 0 subscript 𝑐 0 (\mu_{0},c_{0}) ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) -RC, ( μ L + 1 , c L + 1 ) subscript 𝜇 𝐿 1 subscript 𝑐 𝐿 1 (\mu_{L+1},c_{L+1}) ( italic_μ start_POSTSUBSCRIPT italic_L + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_L + 1 end_POSTSUBSCRIPT ) -RC and ( μ i , c i ) subscript 𝜇 𝑖 subscript 𝑐 𝑖 (\mu_{i},c_{i}) ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) -RC respectively for all i 𝑖 i italic_i . Then ψ 𝜓 \psi italic_ψ is
( ∏ i = 0 L + 1 μ i , ∑ j = 1 L [ ∏ i = j + 1 L + 1 μ i ] c j ) superscript subscript product 𝑖 0 𝐿 1 subscript 𝜇 𝑖 superscript subscript 𝑗 1 𝐿 delimited-[] superscript subscript product 𝑖 𝑗 1 𝐿 1 subscript 𝜇 𝑖 subscript 𝑐 𝑗 \left(\prod\limits_{i=0}^{L+1}\mu_{i},\sum\limits_{j=1}^{L}\left[\prod\limits_%
{i=j+1}^{L+1}\mu_{i}\right]c_{j}\right) ( ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT [ ∏ start_POSTSUBSCRIPT italic_i = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + 1 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) -RC.
What we have left are the following. First, in light of the previous
corollary, we need to show that each component of a deep SSM model is ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC for some μ 𝜇 \mu italic_μ and c 𝑐 c italic_c w.r.t. compatible normed spaces. Second, we need
to show that the Rademacher complexity of a ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC model set are
bounded in terms of μ 𝜇 \mu italic_μ and c 𝑐 c italic_c . We start with the first one.
Lemma 13
Let 𝒲 Enc subscript 𝒲 Enc \mathcal{W}_{\text{Enc}} caligraphic_W start_POSTSUBSCRIPT Enc end_POSTSUBSCRIPT , 𝒲 Dec subscript 𝒲 Dec \mathcal{W}_{\text{Dec}} caligraphic_W start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT , and ℰ ℰ \mathcal{E} caligraphic_E denote some
sets of parameters of some fixed Encoder, Decoder and SSM layers,
respectively. The corresponding function sets are
•
ℱ Enc = { f Enc = ⟨ W Enc , ⋅ ⟩ ∣ W ∈ 𝒲 Enc } subscript ℱ Enc conditional-set superscript 𝑓 Enc superscript 𝑊 Enc ⋅
𝑊 subscript 𝒲 Enc \mathcal{F}_{\text{Enc}}=\{f^{\text{Enc}}=\langle W^{\text{Enc}},\cdot\rangle%
\mid W\in\mathcal{W}_{\text{Enc}}\} caligraphic_F start_POSTSUBSCRIPT Enc end_POSTSUBSCRIPT = { italic_f start_POSTSUPERSCRIPT Enc end_POSTSUPERSCRIPT = ⟨ italic_W start_POSTSUPERSCRIPT Enc end_POSTSUPERSCRIPT , ⋅ ⟩ ∣ italic_W ∈ caligraphic_W start_POSTSUBSCRIPT Enc end_POSTSUBSCRIPT } ,
•
ℱ Dec = { f Dec = ⟨ W Dec , ⋅ ⟩ ∣ W ∈ 𝒲 Dec } subscript ℱ Dec conditional-set superscript 𝑓 Dec superscript 𝑊 Dec ⋅
𝑊 subscript 𝒲 Dec \mathcal{F}_{\text{Dec}}=\{f^{\text{Dec}}=\langle W^{\text{Dec}},\cdot\rangle%
\mid W\in\mathcal{W}_{\text{Dec}}\} caligraphic_F start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT = { italic_f start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT = ⟨ italic_W start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT , ⋅ ⟩ ∣ italic_W ∈ caligraphic_W start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT } ,
•
ℱ SSM = { 𝒮 Σ ∣ Σ ∈ ℰ } subscript ℱ SSM conditional-set subscript 𝒮 Σ Σ ℰ \mathcal{F}_{\text{SSM}}=\{\mathcal{S}_{\Sigma}\mid\Sigma\in\mathcal{E}\} caligraphic_F start_POSTSUBSCRIPT SSM end_POSTSUBSCRIPT = { caligraphic_S start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ∣ roman_Σ ∈ caligraphic_E } ,
where 𝒮 Σ subscript 𝒮 Σ \mathcal{S}_{\Sigma} caligraphic_S start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT denotes the input-output map of the dynamical
system Σ Σ \Sigma roman_Σ . Then all of these function sets are ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC
according to the following table, where for any Banach space 𝒳 𝒳 \mathcal{X} caligraphic_X ,
B 𝒳 ( t ) = { x ∈ 𝒳 ∣ ∥ x ∥ 𝒳 ≤ r } subscript 𝐵 𝒳 𝑡 conditional-set 𝑥 𝒳 subscript delimited-∥∥ 𝑥 𝒳 𝑟 B_{\mathcal{X}}(t)=\{x\in\mathcal{X}\mid\left\lVert x\right\rVert_{\mathcal{X}%
}\leq r\} italic_B start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ( italic_t ) = { italic_x ∈ caligraphic_X ∣ ∥ italic_x ∥ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ≤ italic_r } denotes the ball of radius r 𝑟 r italic_r centered in zero for arbitrary r 𝑟 r italic_r .
μ 𝜇 \mu italic_μ
c 𝑐 c italic_c
X 1 subscript 𝑋 1 X_{1} italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
X 2 subscript 𝑋 2 X_{2} italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
ℱ Enc subscript ℱ Enc \mathcal{F}_{\text{Enc}} caligraphic_F start_POSTSUBSCRIPT Enc end_POSTSUBSCRIPT
sup W Enc ∈ 𝒲 Enc ∥ W Enc ∥ 2 , 2 < K Enc subscript supremum superscript 𝑊 Enc subscript 𝒲 Enc subscript delimited-∥∥ superscript 𝑊 Enc 2 2
subscript 𝐾 Enc \sup\limits_{W^{\text{Enc}}\in\mathcal{W}_{\text{Enc}}}\left\lVert W^{\text{%
Enc}}\right\rVert_{2,2}<K_{\text{Enc}} roman_sup start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT Enc end_POSTSUPERSCRIPT ∈ caligraphic_W start_POSTSUBSCRIPT Enc end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_W start_POSTSUPERSCRIPT Enc end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT < italic_K start_POSTSUBSCRIPT Enc end_POSTSUBSCRIPT
0
B ℓ 2 , 2 ( ℝ n in ) ( r ) subscript 𝐵 superscript ℓ 2 2
superscript ℝ subscript 𝑛 in 𝑟 B_{\ell^{2,2}(\mathbb{R}^{n_{\text{in}}})}(r) italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_r )
B ℓ 2 , 2 ( ℝ n u ) ( K Enc r ) subscript 𝐵 superscript ℓ 2 2
superscript ℝ subscript 𝑛 𝑢 subscript 𝐾 Enc 𝑟 B_{\ell^{2,2}(\mathbb{R}^{n_{u}})}(K_{\text{Enc}}r) italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT Enc end_POSTSUBSCRIPT italic_r )
ℱ Dec subscript ℱ Dec \mathcal{F}_{\text{Dec}} caligraphic_F start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT
sup W Dec ∈ 𝒲 Dec ∥ W Dec ∥ ∞ , ∞ < K Dec subscript supremum superscript 𝑊 Dec subscript 𝒲 Dec subscript delimited-∥∥ superscript 𝑊 Dec
subscript 𝐾 Dec \sup\limits_{W^{\text{Dec}}\in\mathcal{W}_{\text{Dec}}}\left\lVert W^{\text{%
Dec}}\right\rVert_{\infty,\infty}<K_{\text{Dec}} roman_sup start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT ∈ caligraphic_W start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_W start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT < italic_K start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT
0
B ℓ ∞ , ∞ ( ℝ n u ) ( r ) subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 𝑟 B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}(r) italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_r )
B ℓ ∞ , ∞ ( ℝ n out ) ( K Dec r ) subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 out subscript 𝐾 Dec 𝑟 B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{\text{out}}})}(K_{\text{Dec}}r) italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT italic_r )
sup W Dec ∈ 𝒲 Dec ∥ W Dec ∥ ∞ , β < K Dec subscript supremum superscript 𝑊 Dec subscript 𝒲 Dec subscript delimited-∥∥ superscript 𝑊 Dec 𝛽
subscript 𝐾 Dec \sup\limits_{W^{\text{Dec}}\in\mathcal{W}_{\text{Dec}}}\left\lVert W^{\text{%
Dec}}\right\rVert_{\infty,\beta}<K_{\text{Dec}} roman_sup start_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT ∈ caligraphic_W start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_W start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ , italic_β end_POSTSUBSCRIPT < italic_K start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT
0
B ( ℝ n u , ∥ ⋅ ∥ ∞ ) ( r ) subscript 𝐵 superscript ℝ subscript 𝑛 𝑢 subscript delimited-∥∥ ⋅ 𝑟 B_{(\mathbb{R}^{n_{u}},\left\lVert\cdot\right\rVert_{\infty})}(r) italic_B start_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , ∥ ⋅ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_r )
B ( ℝ n out , ∥ ⋅ ∥ β ) ( K Dec r ) subscript 𝐵 superscript ℝ subscript 𝑛 out subscript delimited-∥∥ ⋅ 𝛽 subscript 𝐾 Dec 𝑟 B_{(\mathbb{R}^{n_{\text{out}}},\left\lVert\cdot\right\rVert_{\beta})}(K_{%
\text{Dec}}r) italic_B start_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , ∥ ⋅ ∥ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT italic_r )
ℱ SSM subscript ℱ SSM \mathcal{F}_{\text{SSM}} caligraphic_F start_POSTSUBSCRIPT SSM end_POSTSUBSCRIPT
sup Σ ∈ ℰ ∥ Σ ∥ 2 < K 2 subscript supremum Σ ℰ subscript delimited-∥∥ Σ 2 subscript 𝐾 2 \sup\limits_{\Sigma\in\mathcal{E}}\left\lVert\Sigma\right\rVert_{2}<K_{2} roman_sup start_POSTSUBSCRIPT roman_Σ ∈ caligraphic_E end_POSTSUBSCRIPT ∥ roman_Σ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
0
B ℓ 2 , 2 ( ℝ n u ) ( r ) subscript 𝐵 superscript ℓ 2 2
superscript ℝ subscript 𝑛 𝑢 𝑟 B_{\ell^{2,2}(\mathbb{R}^{n_{u}})}(r) italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_r )
B ℓ ∞ , ∞ ( ℝ n y ) ( K 2 r ) subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑦 subscript 𝐾 2 𝑟 B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{y}})}(K_{2}r) italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_r )
sup Σ ∈ ℰ ∥ Σ ∥ 1 < K 1 subscript supremum Σ ℰ subscript delimited-∥∥ Σ 1 subscript 𝐾 1 \sup\limits_{\Sigma\in\mathcal{E}}\left\lVert\Sigma\right\rVert_{1}<K_{1} roman_sup start_POSTSUBSCRIPT roman_Σ ∈ caligraphic_E end_POSTSUBSCRIPT ∥ roman_Σ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
0
B ℓ ∞ , ∞ ( ℝ n u ) ( r ) subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 𝑟 B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}(r) italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_r )
B ℓ ∞ , ∞ ( ℝ n y ) ( K 1 r ) subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑦 subscript 𝐾 1 𝑟 B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{y}})}(K_{1}r) italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_r )
Furthermore, the operation of f Pool superscript 𝑓 Pool f^{\text{Pool}} italic_f start_POSTSUPERSCRIPT Pool end_POSTSUPERSCRIPT , defined in Definition
6 , is ( 1 , 0 ) 1 0 (1,0) ( 1 , 0 ) -RC between X 1 = B ℓ ∞ , ∞ ( ℝ n u ) ( r ) subscript 𝑋 1 subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 𝑟 X_{1}=B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}(r) italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_r ) and X 2 = B ( ℝ n u , ∥ ⋅ ∥ ∞ ) ( r ) subscript 𝑋 2 subscript 𝐵 superscript ℝ subscript 𝑛 𝑢 subscript delimited-∥∥ ⋅ 𝑟 X_{2}=B_{(\mathbb{R}^{n_{u}},\left\lVert\cdot\right\rVert_{\infty})}(r) italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , ∥ ⋅ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_r ) .
The proof is in Appendix B . We can see that the SSM layer can
only increase the input’s complexity by the factor ∥ Σ ∥ p subscript delimited-∥∥ Σ 𝑝 \left\lVert\Sigma\right\rVert_{p} ∥ roman_Σ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , p = 1 , 2 𝑝 1 2
p=1,2 italic_p = 1 , 2 , a
quantity that gets smaller as the system gets more stable. This gets even more
crucial when dealing with long range sequences, because the Neural Network
layers are constant in time.
The last factor that influences the complexity are the Neural Network layers
that follow the SSM layers in each SSM block (Definition 5 ).
There are several results in the literature establishing upper bounds on the
Rademacher complexity of MLPs and its variants. Additionally, in some cases the
proof techniques directly imply the ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC property of these models. As a
result, we do not pay special attention to the MLP layers. We prove the
( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC property of GLU layers as they are commonly used in deep SSM
structures, and deduct the values of μ 𝜇 \mu italic_μ and c 𝑐 c italic_c from previously known results
in case of deep MLPs.
Lemma 15
Let ℱ GLU subscript ℱ GLU \mathcal{F}_{\text{GLU}} caligraphic_F start_POSTSUBSCRIPT GLU end_POSTSUBSCRIPT denote a set of GLU layers, i.e.
ℱ GLU = { G L U W ( 𝐮 ) = ( G E L U [ 𝐮 ] ) ⊙ σ ( W ( G E L U ( 𝐮 ) ) ) ∣ W ∈ 𝒲 } subscript ℱ GLU conditional-set 𝐺 𝐿 subscript 𝑈 𝑊 𝐮 direct-product 𝐺 𝐸 𝐿 𝑈 delimited-[] 𝐮 𝜎 𝑊 𝐺 𝐸 𝐿 𝑈 𝐮 𝑊 𝒲 \mathcal{F}_{\text{GLU}}=\{GLU_{W}(\mathbf{u})=(GELU[\mathbf{u}])\odot\sigma(W%
(GELU(\mathbf{u})))\mid W\in\mathcal{W}\} caligraphic_F start_POSTSUBSCRIPT GLU end_POSTSUBSCRIPT = { italic_G italic_L italic_U start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_u ) = ( italic_G italic_E italic_L italic_U [ bold_u ] ) ⊙ italic_σ ( italic_W ( italic_G italic_E italic_L italic_U ( bold_u ) ) ) ∣ italic_W ∈ caligraphic_W } . Under the Bounded Input
assumption in Assumption 9 we have that
ℱ GLU subscript ℱ GLU \mathcal{F}_{\text{GLU}} caligraphic_F start_POSTSUBSCRIPT GLU end_POSTSUBSCRIPT is ( μ , 0 ) 𝜇 0 (\mu,0) ( italic_μ , 0 ) -RC w.r.t. the spaces
X 1 = B ℓ ∞ , ∞ ( ℝ n u ) ( K 𝐮 ) = X 2 subscript 𝑋 1 subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 subscript 𝐾 𝐮 subscript 𝑋 2 X_{1}=B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}(K_{\mathbf{u}})=X_{2} italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT ) = italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , where
μ = 16 ( K 𝐮 ⋅ max { sup W ∈ 𝒲 ∥ W ∥ ∞ , ∞ , 1 } + 1 ) ( sup W ∈ 𝒲 ∥ W ∥ ∞ , ∞ + 1 ) 𝜇 16 ⋅ subscript 𝐾 𝐮 subscript supremum 𝑊 𝒲 subscript delimited-∥∥ 𝑊
1 1 subscript supremum 𝑊 𝒲 subscript delimited-∥∥ 𝑊
1 \mu=16\left(K_{\mathbf{u}}\cdot\max\left\{\sup\limits_{W\in\mathcal{W}}\left%
\lVert W\right\rVert_{\infty,\infty},1\right\}+1\right)\left(\sup\limits_{W\in%
\mathcal{W}}\left\lVert W\right\rVert_{\infty,\infty}+1\right) italic_μ = 16 ( italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT ⋅ roman_max { roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT , 1 } + 1 ) ( roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT + 1 ) .
The proof is in Appendix B .
Lemma 16
Let ℱ ρ superscript ℱ 𝜌 \mathcal{F}^{\rho} caligraphic_F start_POSTSUPERSCRIPT italic_ρ end_POSTSUPERSCRIPT denote a set of hidden layers, i.e. ℱ = { f W , b ( 𝐮 ) = ρ ( W 𝐱 + 𝐛 ) ∣ ( W , 𝐛 ) ∈ 𝒲 × ℬ } ℱ conditional-set subscript 𝑓 𝑊 𝑏
𝐮 𝜌 𝑊 𝐱 𝐛 𝑊 𝐛 𝒲 ℬ \mathcal{F}=\{f_{W,b}(\mathbf{u})=\rho(W\mathbf{x}+\mathbf{b})\mid(W,\mathbf{b%
})\in\mathcal{W}\times\mathcal{B}\} caligraphic_F = { italic_f start_POSTSUBSCRIPT italic_W , italic_b end_POSTSUBSCRIPT ( bold_u ) = italic_ρ ( italic_W bold_x + bold_b ) ∣ ( italic_W , bold_b ) ∈ caligraphic_W × caligraphic_B } and ρ 𝜌 \rho italic_ρ is either the sigmoid or the ReLU activation
function. Let us assume that there exist K W > 0 subscript 𝐾 𝑊 0 K_{W}>0 italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT > 0 and K 𝐛 > 0 subscript 𝐾 𝐛 0 K_{\mathbf{b}}>0 italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT > 0 such
that sup W ∈ 𝒲 ∥ W ∥ ∞ , ∞ < K W subscript supremum 𝑊 𝒲 subscript delimited-∥∥ 𝑊
subscript 𝐾 𝑊 \sup\limits_{W\in\mathcal{W}}\left\lVert W\right\rVert_{\infty,\infty}<K_{W} roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT < italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT and
sup 𝐛 ∈ ℬ ∥ 𝐛 ∥ ∞ < K 𝐛 subscript supremum 𝐛 ℬ subscript delimited-∥∥ 𝐛 subscript 𝐾 𝐛 \sup\limits_{\mathbf{b}\in\mathcal{B}}\left\lVert\mathbf{b}\right\rVert_{%
\infty}<K_{\mathbf{b}} roman_sup start_POSTSUBSCRIPT bold_b ∈ caligraphic_B end_POSTSUBSCRIPT ∥ bold_b ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT . Under
the Bounded Input assumption in Assumption 9 we have that
ℱ R e L U superscript ℱ 𝑅 𝑒 𝐿 𝑈 \mathcal{F}^{ReLU} caligraphic_F start_POSTSUPERSCRIPT italic_R italic_e italic_L italic_U end_POSTSUPERSCRIPT is ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC w.r.t. the spaces X 1 = B ℓ ∞ , ∞ ( ℝ n u ) ( r ) subscript 𝑋 1 subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 𝑟 X_{1}=B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}(r) italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_r ) , X 2 = B ℓ ∞ , ∞ ( ℝ n v ) ( K W r + K 𝐛 ) subscript 𝑋 2 subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑣 subscript 𝐾 𝑊 𝑟 subscript 𝐾 𝐛 X_{2}=B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{v}})}(K_{W}r+K_{\mathbf{b}}) italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_r + italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT ) , μ = 4 K W 𝜇 4 subscript 𝐾 𝑊 \mu=4K_{W} italic_μ = 4 italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT and c = 4 K W K 𝐛 𝑐 4 subscript 𝐾 𝑊 subscript 𝐾 𝐛 c=4K_{W}K_{\mathbf{b}} italic_c = 4 italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT .
Under the extra assumption that the parameter set is symmetric to the
origin, meaning that ( W , b ) ∈ 𝒲 × ℬ 𝑊 𝑏 𝒲 ℬ (W,b)\in\mathcal{W}\times\mathcal{B} ( italic_W , italic_b ) ∈ caligraphic_W × caligraphic_B implies
( − W , − b ) ∈ 𝒲 × ℬ 𝑊 𝑏 𝒲 ℬ (-W,-b)\in\mathcal{W}\times\mathcal{B} ( - italic_W , - italic_b ) ∈ caligraphic_W × caligraphic_B , we have that
ℱ s i g m o i d superscript ℱ 𝑠 𝑖 𝑔 𝑚 𝑜 𝑖 𝑑 \mathcal{F}^{sigmoid} caligraphic_F start_POSTSUPERSCRIPT italic_s italic_i italic_g italic_m italic_o italic_i italic_d end_POSTSUPERSCRIPT is ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC w.r.t. the
spaces X 1 = B ℓ ∞ , ∞ ( ℝ n u ) ( r ) subscript 𝑋 1 subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 𝑟 X_{1}=B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}(r) italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_r ) , X 2 = B ℓ ∞ , ∞ ( ℝ n v ) ( 1 ) subscript 𝑋 2 subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑣 1 X_{2}=B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{v}})}(1) italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( 1 ) , μ = K W 𝜇 subscript 𝐾 𝑊 \mu=K_{W} italic_μ = italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT
and c = K W ( K 𝐛 + 0.5 ) 𝑐 subscript 𝐾 𝑊 subscript 𝐾 𝐛 0.5 c=K_{W}(K_{\mathbf{b}}+0.5) italic_c = italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT + 0.5 ) .
The proof is in Appendix B . Combining the previous Lemma with
Lemma 11 and Lemma 21 we get the following for
deep networks.
Corollary 17
Let g W , 𝐛 ( x ) = W 𝐱 + 𝐛 subscript 𝑔 𝑊 𝐛
𝑥 𝑊 𝐱 𝐛 g_{W,\mathbf{b}}(x)=W\mathbf{x}+\mathbf{b} italic_g start_POSTSUBSCRIPT italic_W , bold_b end_POSTSUBSCRIPT ( italic_x ) = italic_W bold_x + bold_b represent an affine transformation and
consider a set of deep networks ℱ ρ = { f = f W 1 , 𝐛 1 ρ ∘ … ∘ f W L , 𝐛 L ρ ∘ g W L + 1 , 𝐛 L + 1 ∣ ( W i , 𝐛 i ) ∈ 𝒲 i × ℬ i , 1 ≤ i ≤ L + 1 } superscript ℱ 𝜌 conditional-set 𝑓 subscript superscript 𝑓 𝜌 subscript 𝑊 1 subscript 𝐛 1
… subscript superscript 𝑓 𝜌 subscript 𝑊 𝐿 subscript 𝐛 𝐿
subscript 𝑔 subscript 𝑊 𝐿 1 subscript 𝐛 𝐿 1
formulae-sequence subscript 𝑊 𝑖 subscript 𝐛 𝑖 subscript 𝒲 𝑖 subscript ℬ 𝑖 1 𝑖 𝐿 1 \mathcal{F}^{\rho}=\{f=f^{\rho}_{W_{1},\mathbf{b}_{1}}\circ\ldots\circ f^{\rho%
}_{W_{L},\mathbf{b}_{L}}\circ g_{W_{L+1},\mathbf{b}_{L+1}}\mid(W_{i},\mathbf{b%
}_{i})\in\mathcal{W}_{i}\times\mathcal{B}_{i},1\leq i\leq L+1\} caligraphic_F start_POSTSUPERSCRIPT italic_ρ end_POSTSUPERSCRIPT = { italic_f = italic_f start_POSTSUPERSCRIPT italic_ρ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ … ∘ italic_f start_POSTSUPERSCRIPT italic_ρ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_g start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_L + 1 end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT italic_L + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ ( italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 ≤ italic_i ≤ italic_L + 1 } , where f W i , 𝐛 i ρ subscript superscript 𝑓 𝜌 subscript 𝑊 𝑖 subscript 𝐛 𝑖
f^{\rho}_{W_{i},\mathbf{b}_{i}} italic_f start_POSTSUPERSCRIPT italic_ρ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT are
hidden layers with the same activations ρ 𝜌 \rho italic_ρ equal to either sigmoid or
ReLU. Let K W > 0 subscript 𝐾 𝑊 0 K_{W}>0 italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT > 0 and K 𝐛 > 0 subscript 𝐾 𝐛 0 K_{\mathbf{b}}>0 italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT > 0 such that sup W ∈ 𝒲 i ∥ W ∥ ∞ , ∞ < K W subscript supremum 𝑊 subscript 𝒲 𝑖 subscript delimited-∥∥ 𝑊
subscript 𝐾 𝑊 \sup\limits_{W\in\mathcal{W}_{i}}\left\lVert W\right\rVert_{\infty,\infty}<K_{W} roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT < italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT and sup 𝐛 ∈ ℬ i ∥ 𝐛 ∥ ∞ < K 𝐛 subscript supremum 𝐛 subscript ℬ 𝑖 subscript delimited-∥∥ 𝐛 subscript 𝐾 𝐛 \sup\limits_{\mathbf{b}\in\mathcal{B}_{i}}\left\lVert\mathbf{b}\right\rVert_{%
\infty}<K_{\mathbf{b}} roman_sup start_POSTSUBSCRIPT bold_b ∈ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_b ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT hold for all 1 ≤ i ≤ L + 1 1 𝑖 𝐿 1 1\leq i\leq L+1 1 ≤ italic_i ≤ italic_L + 1 . Under the assumptions of Lemma 16 we have that
ℱ R e L U superscript ℱ 𝑅 𝑒 𝐿 𝑈 \mathcal{F}^{ReLU} caligraphic_F start_POSTSUPERSCRIPT italic_R italic_e italic_L italic_U end_POSTSUPERSCRIPT is ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC w.r.t. with μ = 4 ( L + 1 ) K W 𝜇 4 𝐿 1 subscript 𝐾 𝑊 \mu=4(L+1)K_{W} italic_μ = 4 ( italic_L + 1 ) italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT and
c = 4 ( L + 1 ) K W K 𝐛 L ( L + 1 ) 2 𝑐 4 𝐿 1 subscript 𝐾 𝑊 subscript 𝐾 𝐛 𝐿 𝐿 1 2 c=4(L+1)K_{W}K_{\mathbf{b}}\frac{L(L+1)}{2} italic_c = 4 ( italic_L + 1 ) italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT divide start_ARG italic_L ( italic_L + 1 ) end_ARG start_ARG 2 end_ARG and ℱ s i g m o i d superscript ℱ 𝑠 𝑖 𝑔 𝑚 𝑜 𝑖 𝑑 \mathcal{F}^{sigmoid} caligraphic_F start_POSTSUPERSCRIPT italic_s italic_i italic_g italic_m italic_o italic_i italic_d end_POSTSUPERSCRIPT is ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC w.r.t. with μ = ( L + 1 ) K W 𝜇 𝐿 1 subscript 𝐾 𝑊 \mu=(L+1)K_{W} italic_μ = ( italic_L + 1 ) italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT and c = ( L + 1 ) K W ( K 𝐛 + 0.5 ) L ( L + 1 ) 2 𝑐 𝐿 1 subscript 𝐾 𝑊 subscript 𝐾 𝐛 0.5 𝐿 𝐿 1 2 c=(L+1)K_{W}(K_{\mathbf{b}}+0.5)\frac{L(L+1)}{2} italic_c = ( italic_L + 1 ) italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT + 0.5 ) divide start_ARG italic_L ( italic_L + 1 ) end_ARG start_ARG 2 end_ARG .
So far we showed that each component of a deep SSM model satisfies Definition
10 . We also proved that the composition of such components also
satisfies the definition. The main theorem summarizes these results and
exploits the fact that the Rademacher complexity of a ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC set of
models is upper bounded by terms containing μ 𝜇 \mu italic_μ and c 𝑐 c italic_c .
Theorem 18
Let ℱ ℱ \mathcal{F} caligraphic_F be a set of deep SSM models, namely let f ∈ ℱ 𝑓 ℱ f\in\mathcal{F} italic_f ∈ caligraphic_F
has the form f = f Dec ∘ f Pool ∘ f B L ∘ … ∘ f B 1 ∘ f Enc 𝑓 superscript 𝑓 Dec superscript 𝑓 Pool superscript 𝑓 subscript B 𝐿 … superscript 𝑓 subscript B 1 superscript 𝑓 Enc f=f^{\text{Dec}}\circ f^{\text{Pool}}\circ f^{\text{B}_{L}}\circ\ldots\circ f^%
{\text{B}_{1}}\circ f^{\text{Enc}} italic_f = italic_f start_POSTSUPERSCRIPT Dec end_POSTSUPERSCRIPT ∘ italic_f start_POSTSUPERSCRIPT Pool end_POSTSUPERSCRIPT ∘ italic_f start_POSTSUPERSCRIPT B start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∘ … ∘ italic_f start_POSTSUPERSCRIPT B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∘ italic_f start_POSTSUPERSCRIPT Enc end_POSTSUPERSCRIPT with layer parameter sets
𝒲 Dec subscript 𝒲 Dec \mathcal{W}_{\text{Dec}} caligraphic_W start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT , 𝒲 B L , … , 𝒲 B 1 subscript 𝒲 subscript B 𝐿 … subscript 𝒲 subscript B 1
\mathcal{W}_{\text{B}_{L}},\ldots,\mathcal{W}_{\text{B}_{1}} caligraphic_W start_POSTSUBSCRIPT B start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , caligraphic_W start_POSTSUBSCRIPT B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and
𝒲 Enc subscript 𝒲 Enc \mathcal{W}_{\text{Enc}} caligraphic_W start_POSTSUBSCRIPT Enc end_POSTSUBSCRIPT respectively, where f B i superscript 𝑓 subscript B 𝑖 f^{\text{B}_{i}} italic_f start_POSTSUPERSCRIPT B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is an SSM block for all
i 𝑖 i italic_i , i.e. f B i ( 𝐳 ) = g i ( 𝒮 Σ i ( 𝐳 ) [ k ] ) + α i 𝐳 [ k ] superscript 𝑓 subscript B 𝑖 𝐳 subscript 𝑔 𝑖 subscript 𝒮 subscript Σ 𝑖 𝐳 delimited-[] 𝑘 subscript 𝛼 𝑖 𝐳 delimited-[] 𝑘 f^{\text{B}_{i}}(\mathbf{z})=g_{i}(\mathcal{S}_{\Sigma_{i}}(\mathbf{z})[k])+%
\alpha_{i}\mathbf{z}[k] italic_f start_POSTSUPERSCRIPT B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_z ) = italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_S start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_z ) [ italic_k ] ) + italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_z [ italic_k ] and 𝒲 B i = ℰ i × 𝒲 g i subscript 𝒲 subscript B 𝑖 subscript ℰ 𝑖 subscript 𝒲 subscript 𝑔 𝑖 \mathcal{W}_{\text{B}_{i}}=\mathcal{E}_{i}\times\mathcal{W}_{g_{i}} caligraphic_W start_POSTSUBSCRIPT B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × caligraphic_W start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT .
If Assumption 9 holds and each set of nonlinearities
g i : X i → X ^ i : subscript 𝑔 𝑖 → subscript 𝑋 𝑖 subscript ^ 𝑋 𝑖 g_{i}:X_{i}\rightarrow\hat{X}_{i} italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is ( μ g i , c g i ) subscript 𝜇 subscript 𝑔 𝑖 subscript 𝑐 subscript 𝑔 𝑖 (\mu_{g_{i}},c_{g_{i}}) ( italic_μ start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) -RC, with
X 1 = B ( ℝ n u , ∥ ⋅ ∥ 2 ) ( K 2 K 𝐮 K Enc ) subscript 𝑋 1 subscript 𝐵 superscript ℝ subscript 𝑛 𝑢 subscript delimited-∥∥ ⋅ 2 subscript 𝐾 2 subscript 𝐾 𝐮 subscript 𝐾 Enc X_{1}=B_{\left(\mathbb{R}^{n_{u}},\left\lVert\cdot\right\rVert_{2}\right)}(K_{%
2}K_{\mathbf{u}}K_{\text{Enc}}) italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , ∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT Enc end_POSTSUBSCRIPT ) X ^ i = B ( ℝ n u , ∥ ⋅ ∥ ∞ ) ( r ^ i ) subscript ^ 𝑋 𝑖 subscript 𝐵 superscript ℝ subscript 𝑛 𝑢 subscript delimited-∥∥ ⋅ subscript ^ 𝑟 𝑖 \hat{X}_{i}=B_{\left(\mathbb{R}^{n_{u}},\left\lVert\cdot\right\rVert_{\infty}%
\right)}(\hat{r}_{i}) over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , ∥ ⋅ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
X i + 1 = B ( ℝ n u , ∥ ⋅ ∥ ∞ ) ( r i + 1 ) subscript 𝑋 𝑖 1 subscript 𝐵 superscript ℝ subscript 𝑛 𝑢 subscript delimited-∥∥ ⋅ subscript 𝑟 𝑖 1 X_{i+1}=B_{\left(\mathbb{R}^{n_{u}},\left\lVert\cdot\right\rVert_{\infty}%
\right)}(r_{i+1}) italic_X start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , ∥ ⋅ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) , r i + 1 ≥ K 1 ( r ^ i + | α i | r i ) subscript 𝑟 𝑖 1 subscript 𝐾 1 subscript ^ 𝑟 𝑖 subscript 𝛼 𝑖 subscript 𝑟 𝑖 r_{i+1}\geq K_{1}(\hat{r}_{i}+|\alpha_{i}|r_{i}) italic_r start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ≥ italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + | italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , where
r 1 = K 𝐮 subscript 𝑟 1 subscript 𝐾 𝐮 r_{1}=K_{\mathbf{u}} italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT , i = 1 , … , L 𝑖 1 … 𝐿
i=1,\ldots,L italic_i = 1 , … , italic_L , then the following holds with probability at
least 1 − δ 1 𝛿 1-\delta 1 - italic_δ .
ℙ S ∼ 𝒟 N [ ∀ f ∈ ℱ ℒ ( f ) − ℒ e m p S ( f ) ≤ μ K 𝐮 L l + c L l N + K l 2 log ( 4 / δ ) N ] , subscript ℙ similar-to 𝑆 superscript 𝒟 𝑁 delimited-[] formulae-sequence for-all 𝑓 ℱ ℒ 𝑓 superscript subscript ℒ 𝑒 𝑚 𝑝 𝑆 𝑓 𝜇 subscript 𝐾 𝐮 subscript 𝐿 𝑙 𝑐 subscript 𝐿 𝑙 𝑁 subscript 𝐾 𝑙 2 4 𝛿 𝑁 \mathbb{P}_{S\sim\mathcal{D}^{N}}\left[\forall f\in\mathcal{F}\quad\mathcal{L}%
(f)-\mathcal{L}_{emp}^{S}(f)\leq\frac{\mu K_{\mathbf{u}}L_{l}+cL_{l}}{\sqrt{N}%
}+K_{l}\sqrt{\frac{2\log(4/\delta)}{N}}\right], blackboard_P start_POSTSUBSCRIPT italic_S ∼ caligraphic_D start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∀ italic_f ∈ caligraphic_F caligraphic_L ( italic_f ) - caligraphic_L start_POSTSUBSCRIPT italic_e italic_m italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( italic_f ) ≤ divide start_ARG italic_μ italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_c italic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG + italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 roman_log ( 4 / italic_δ ) end_ARG start_ARG italic_N end_ARG end_ARG ] ,
(3)
μ ≤ K Enc K Dec ( μ g 1 K 2 + α 1 ) ∏ i = 2 L ( μ g i K 1 + α i ) 𝜇 subscript 𝐾 Enc subscript 𝐾 Dec subscript 𝜇 subscript 𝑔 1 subscript 𝐾 2 subscript 𝛼 1 superscript subscript product 𝑖 2 𝐿 subscript 𝜇 subscript 𝑔 𝑖 subscript 𝐾 1 subscript 𝛼 𝑖 \mu\leq K_{\text{Enc}}K_{\text{Dec}}\left(\mu_{g_{1}}K_{2}+\alpha_{1}\right)%
\prod\limits_{i=2}^{L}\left(\mu_{g_{i}}K_{1}+\alpha_{i}\right) italic_μ ≤ italic_K start_POSTSUBSCRIPT Enc end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,
c ≤ K Dec ∑ j = 1 L [ ∏ i = j + 1 L ( μ g i K 1 + α i ) ] c g j 𝑐 subscript 𝐾 Dec superscript subscript 𝑗 1 𝐿 delimited-[] superscript subscript product 𝑖 𝑗 1 𝐿 subscript 𝜇 subscript 𝑔 𝑖 subscript 𝐾 1 subscript 𝛼 𝑖 subscript 𝑐 subscript 𝑔 𝑗 c\leq K_{\text{Dec}}\sum\limits_{j=1}^{L}\left[\prod\limits_{i=j+1}^{L}\left(%
\mu_{g_{i}}K_{1}+\alpha_{i}\right)\right]c_{g_{j}} italic_c ≤ italic_K start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT [ ∏ start_POSTSUBSCRIPT italic_i = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] italic_c start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT and
K l > 0 subscript 𝐾 𝑙 0 K_{l}>0 italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT > 0 such that | l ( ⋅ , ⋅ ) | ≤ K l 𝑙 ⋅ ⋅ subscript 𝐾 𝑙 |l(\cdot,\cdot)|\leq K_{l} | italic_l ( ⋅ , ⋅ ) | ≤ italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT .
In particular, we obtain K l ≤ 2 L l max { K Dec r L + 2 , K y } subscript 𝐾 𝑙 2 subscript 𝐿 𝑙 subscript 𝐾 Dec subscript 𝑟 𝐿 2 subscript 𝐾 𝑦 K_{l}\leq 2L_{l}\max\{K_{\text{Dec}}r_{L+2},K_{y}\} italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≤ 2 italic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT roman_max { italic_K start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_L + 2 end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT } .
The proof can be found in Appendix B .
Appendix B Proofs
In this section we need to prove ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC property for linear (or affine)
transformations which are constant in time, in many cases. For better
readibility, we only do the calculations once and use it as a lemma.
Lemma 21
Let 𝐮 ∈ ℓ p , p ( ℝ n u ) = : X 1 \mathbf{u}\in\ell^{p,p}(\mathbb{R}^{n_{u}})=:X_{1} bold_u ∈ roman_ℓ start_POSTSUPERSCRIPT italic_p , italic_p end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) = : italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and let f ( 𝐮 ) = f W , 𝐛 ( 𝐮 ) := W 𝐮 + 𝐛 ∈ ℓ q , q ( ℝ n v ) = : X 2 f(\mathbf{u})=f_{W,\mathbf{b}}(\mathbf{u}):=W\mathbf{u}+\mathbf{b}\in\ell^{q,q%
}(\mathbb{R}^{n_{v}})=:X_{2} italic_f ( bold_u ) = italic_f start_POSTSUBSCRIPT italic_W , bold_b end_POSTSUBSCRIPT ( bold_u ) := italic_W bold_u + bold_b ∈ roman_ℓ start_POSTSUPERSCRIPT italic_q , italic_q end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) = : italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
where W ∈ ℝ n v × n u , 𝐛 ∈ ℝ n v formulae-sequence 𝑊 superscript ℝ subscript 𝑛 𝑣 subscript 𝑛 𝑢 𝐛 superscript ℝ subscript 𝑛 𝑣 W\in\mathbb{R}^{n_{v}\times n_{u}},\mathbf{b}\in\mathbb{R}^{n_{v}} italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , bold_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , and by
definition W 𝐮 ∈ X 2 𝑊 𝐮 subscript 𝑋 2 W\mathbf{u}\in X_{2} italic_W bold_u ∈ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT such that ( W 𝐮 + b ) [ k ] = W 𝐮 [ k ] + 𝐛 𝑊 𝐮 𝑏 delimited-[] 𝑘 𝑊 𝐮 delimited-[] 𝑘 𝐛 (W\mathbf{u}+b)[k]=W\mathbf{u}[k]+\mathbf{b} ( italic_W bold_u + italic_b ) [ italic_k ] = italic_W bold_u [ italic_k ] + bold_b , i.e.
it is an affine transformation constant in time. We consider the cases
a)
p = q = 2 𝑝 𝑞 2 p=q=2 italic_p = italic_q = 2 ,
b)
p = 2 , q = ∞ formulae-sequence 𝑝 2 𝑞 p=2,q=\infty italic_p = 2 , italic_q = ∞ ,
c)
p = q = ∞ 𝑝 𝑞 p=q=\infty italic_p = italic_q = ∞ .
Let us assume that there exist constants K W subscript 𝐾 𝑊 K_{W} italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT and K 𝐛 subscript 𝐾 𝐛 K_{\mathbf{b}} italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT such that
sup W ∈ 𝒲 ∥ W ∥ p , q ≤ K W subscript supremum 𝑊 𝒲 subscript delimited-∥∥ 𝑊 𝑝 𝑞
subscript 𝐾 𝑊 \sup\limits_{W\in\mathcal{W}}\left\lVert W\right\rVert_{p,q}\leq K_{W} roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT ≤ italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT and
sup 𝐛 ∈ ℬ ∥ 𝐛 ∥ q ≤ K 𝐛 subscript supremum 𝐛 ℬ subscript delimited-∥∥ 𝐛 𝑞 subscript 𝐾 𝐛 \sup\limits_{\mathbf{b}\in\mathcal{B}}\left\lVert\mathbf{b}\right\rVert_{q}%
\leq K_{\mathbf{b}} roman_sup start_POSTSUBSCRIPT bold_b ∈ caligraphic_B end_POSTSUBSCRIPT ∥ bold_b ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ≤ italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT . Then the
set of transformation ℱ = { f W , 𝐛 ∣ W ∈ 𝒲 , 𝐛 ∈ ℬ } ℱ conditional-set subscript 𝑓 𝑊 𝐛
formulae-sequence 𝑊 𝒲 𝐛 ℬ \mathcal{F}=\{f_{W,\mathbf{b}}\mid W\in\mathcal{W},\mathbf{b}\in\mathcal{B}\} caligraphic_F = { italic_f start_POSTSUBSCRIPT italic_W , bold_b end_POSTSUBSCRIPT ∣ italic_W ∈ caligraphic_W , bold_b ∈ caligraphic_B } is ( K W , K 𝐛 ) subscript 𝐾 𝑊 subscript 𝐾 𝐛 (K_{W},K_{\mathbf{b}}) ( italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT ) -RC w.r.t. X 1 subscript 𝑋 1 X_{1} italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X 2 subscript 𝑋 2 X_{2} italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .
Furthermore, the image of the ball B X 1 ( r ) subscript 𝐵 subscript 𝑋 1 𝑟 B_{X_{1}}(r) italic_B start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r ) under f ∈ ℱ 𝑓 ℱ f\in\mathcal{F} italic_f ∈ caligraphic_F is
contained in B X 2 ( K W r + K 𝐛 ) subscript 𝐵 subscript 𝑋 2 subscript 𝐾 𝑊 𝑟 subscript 𝐾 𝐛 B_{X_{2}}(K_{W}r+K_{\mathbf{b}}) italic_B start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_r + italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT ) , and therefore ℱ | B X 1 ( r ) evaluated-at ℱ subscript 𝐵 subscript 𝑋 1 𝑟 \mathcal{F}|_{B_{X_{1}}(r)} caligraphic_F | start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT is also ( K W , K 𝐛 ) subscript 𝐾 𝑊 subscript 𝐾 𝐛 (K_{W},K_{\mathbf{b}}) ( italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT ) -RC w.r.t the spaces B X 1 ( r ) subscript 𝐵 subscript 𝑋 1 𝑟 B_{X_{1}}(r) italic_B start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r ) and
B X 2 ( K W r + K 𝐛 ) subscript 𝐵 subscript 𝑋 2 subscript 𝐾 𝑊 𝑟 subscript 𝐾 𝐛 B_{X_{2}}(K_{W}r+K_{\mathbf{b}}) italic_B start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_r + italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT ) .
Proof
First let us prove a simple fact about Rademacher random variables that we
will need, namely if σ = { σ i } i = 1 N 𝜎 superscript subscript subscript 𝜎 𝑖 𝑖 1 𝑁 \sigma=\{\sigma_{i}\}_{i=1}^{N} italic_σ = { italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is a sequence of
i.i.d. Rademacher variables, then
𝔼 σ [ | ∑ i = 1 N σ i | ] ≤ N . subscript 𝔼 𝜎 delimited-[] superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝑁 \mathbb{E}_{\sigma}\left[\left|\sum\limits_{i=1}^{N}\sigma_{i}\right|\right]%
\leq\sqrt{N}. blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ] ≤ square-root start_ARG italic_N end_ARG .
(4)
This is true, because
𝔼 σ [ | ∑ i = 1 N σ i | ] = ( 𝔼 σ [ | ∑ i = 1 N σ i | ] ) 2 ≤ 𝔼 σ [ | ∑ i = 1 N σ i | 2 ] subscript 𝔼 𝜎 delimited-[] superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 superscript subscript 𝔼 𝜎 delimited-[] superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 2 subscript 𝔼 𝜎 delimited-[] superscript superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 2 \displaystyle\mathbb{E}_{\sigma}\left[\left|\sum\limits_{i=1}^{N}\sigma_{i}%
\right|\right]=\sqrt{\left(\mathbb{E}_{\sigma}\left[\left|\sum\limits_{i=1}^{N%
}\sigma_{i}\right|\right]\right)^{2}}\leq\sqrt{\mathbb{E}_{\sigma}\left[\left|%
\sum\limits_{i=1}^{N}\sigma_{i}\right|^{2}\right]} blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ] = square-root start_ARG ( blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ square-root start_ARG blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG
= 𝔼 σ [ ∑ i = 1 N σ i 2 + 2 ∑ i , j = 1 N σ i σ j ] = ∑ i = 1 N 𝔼 σ [ σ i 2 ] + 2 ∑ i , j = 1 N 𝔼 σ [ σ i σ j ] = N , absent subscript 𝔼 𝜎 delimited-[] superscript subscript 𝑖 1 𝑁 subscript superscript 𝜎 2 𝑖 2 superscript subscript 𝑖 𝑗
1 𝑁 subscript 𝜎 𝑖 subscript 𝜎 𝑗 superscript subscript 𝑖 1 𝑁 subscript 𝔼 𝜎 delimited-[] subscript superscript 𝜎 2 𝑖 2 superscript subscript 𝑖 𝑗
1 𝑁 subscript 𝔼 𝜎 delimited-[] subscript 𝜎 𝑖 subscript 𝜎 𝑗 𝑁 \displaystyle=\sqrt{\mathbb{E}_{\sigma}\left[\sum\limits_{i=1}^{N}\sigma^{2}_{%
i}+2\sum\limits_{i,j=1}^{N}\sigma_{i}\sigma_{j}\right]}=\sqrt{\sum\limits_{i=1%
}^{N}\mathbb{E}_{\sigma}\left[\sigma^{2}_{i}\right]+2\sum\limits_{i,j=1}^{N}%
\mathbb{E}_{\sigma}\left[\sigma_{i}\sigma_{j}\right]}=\sqrt{N}, = square-root start_ARG blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 2 ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] end_ARG = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] + 2 ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] end_ARG = square-root start_ARG italic_N end_ARG ,
where the first inequality follows from Jensen’s inequality and the last
equality follows from the linearity of the expectation, and the facts that
σ i subscript 𝜎 𝑖 \sigma_{i} italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are Rademacher variables and form and i.i.d sample.
Let Y ∈ { X 1 , B X 1 ( r ) } 𝑌 subscript 𝑋 1 subscript 𝐵 subscript 𝑋 1 𝑟 Y\in\{X_{1},B_{X_{1}}(r)\} italic_Y ∈ { italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r ) } .
a) Let 𝐛 ¯ ∈ Y ¯ 𝐛 𝑌 \underline{\mathbf{b}}\in Y under¯ start_ARG bold_b end_ARG ∈ italic_Y denote the sequence for which
𝐛 ¯ [ k ] = 𝐛 ¯ 𝐛 delimited-[] 𝑘 𝐛 \underline{\mathbf{b}}[k]=\mathbf{b} under¯ start_ARG bold_b end_ARG [ italic_k ] = bold_b for all k ∈ [ T ] 𝑘 delimited-[] 𝑇 k\in[T] italic_k ∈ [ italic_T ] . For any Z ⊆ Y 𝑍 𝑌 Z\subseteq Y italic_Z ⊆ italic_Y we
have
𝔼 σ [ sup ( W , 𝐛 ) ∈ 𝒲 × ℬ sup { 𝐮 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i ( W 𝐮 i + 𝐛 ) ∥ Y ] subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝐛 𝒲 ℬ subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝑊 subscript 𝐮 𝑖 𝐛 𝑌 \displaystyle\mathbb{E}_{\sigma}\left[\sup\limits_{(W,\mathbf{b})\in\mathcal{W%
}\times\mathcal{B}}\sup\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}\left\lVert%
\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}(W\mathbf{u}_{i}+\mathbf{b})\right%
\rVert_{Y}\right] blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT ( italic_W , bold_b ) ∈ caligraphic_W × caligraphic_B end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_W bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_b ) ∥ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ]
≤ 𝔼 σ [ sup W ∈ 𝒲 sup { 𝐮 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i W 𝐮 i ∥ Y ] + 𝔼 σ [ sup 𝐛 ∈ ℬ ∥ 1 N ∑ i = 1 N σ i 𝐛 ¯ ∥ Y ] absent subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝑊 subscript 𝐮 𝑖 𝑌 subscript 𝔼 𝜎 delimited-[] subscript supremum 𝐛 ℬ subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 ¯ 𝐛 𝑌 \displaystyle\leq\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup%
\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}\left\lVert\frac{1}{N}\sum\limits_{%
i=1}^{N}\sigma_{i}W\mathbf{u}_{i}\right\rVert_{Y}\right]+\mathbb{E}_{\sigma}%
\left[\sup\limits_{\mathbf{b}\in\mathcal{B}}\left\lVert\frac{1}{N}\sum\limits_%
{i=1}^{N}\sigma_{i}\underline{\mathbf{b}}\right\rVert_{Y}\right] ≤ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ] + blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT bold_b ∈ caligraphic_B end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT under¯ start_ARG bold_b end_ARG ∥ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ]
= 𝔼 σ [ sup W ∈ 𝒲 sup { 𝐮 i } i = 1 N ∈ Z ∑ k = 1 T ∥ 1 N ∑ i = 1 N σ i W 𝐮 i [ k ] ∥ 2 2 ] + 𝔼 σ [ | 1 N ∑ i = 1 N σ i | sup 𝐛 ∈ ℬ ∥ 𝐛 ¯ ∥ Y ] absent subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 superscript subscript 𝑘 1 𝑇 superscript subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝑊 subscript 𝐮 𝑖 delimited-[] 𝑘 2 2 subscript 𝔼 𝜎 delimited-[] 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript supremum 𝐛 ℬ subscript delimited-∥∥ ¯ 𝐛 𝑌 \displaystyle=\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup%
\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}\sqrt{\sum\limits_{k=1}^{T}\left%
\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}W\mathbf{u}_{i}[k]\right\rVert%
_{2}^{2}}\right]+\mathbb{E}_{\sigma}\left[\left|\frac{1}{N}\sum\limits_{i=1}^{%
N}\sigma_{i}\right|\sup\limits_{\mathbf{b}\in\mathcal{B}}\left\lVert\underline%
{\mathbf{b}}\right\rVert_{Y}\right] = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] + blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ | divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | roman_sup start_POSTSUBSCRIPT bold_b ∈ caligraphic_B end_POSTSUBSCRIPT ∥ under¯ start_ARG bold_b end_ARG ∥ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ]
≤ 𝔼 σ [ sup W ∈ 𝒲 sup { 𝐮 i } i = 1 N ∈ Z ∑ k = 1 T ∥ W ( 1 N ∑ i = 1 N σ i 𝐮 i [ k ] ) ∥ 2 2 ] + 1 N sup 𝐛 ∈ ℬ ∥ 𝐛 ∥ 2 absent subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 superscript subscript 𝑘 1 𝑇 superscript subscript delimited-∥∥ 𝑊 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 delimited-[] 𝑘 2 2 1 𝑁 subscript supremum 𝐛 ℬ subscript delimited-∥∥ 𝐛 2 \displaystyle\leq\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup%
\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}\sqrt{\sum\limits_{k=1}^{T}\left%
\lVert W\left(\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf{u}_{i}[k]%
\right)\right\rVert_{2}^{2}}\right]+\frac{1}{\sqrt{N}}\sup\limits_{\mathbf{b}%
\in\mathcal{B}}\left\lVert\mathbf{b}\right\rVert_{2} ≤ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ italic_W ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] + divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG roman_sup start_POSTSUBSCRIPT bold_b ∈ caligraphic_B end_POSTSUBSCRIPT ∥ bold_b ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
≤ 𝔼 σ [ sup W ∈ 𝒲 sup { 𝐮 i } i = 1 N ∈ Z ∥ W ∥ 2 , 2 ∑ k = 1 T ∥ 1 N ∑ i = 1 N σ i 𝐮 i [ k ] ∥ 2 2 ] + 1 N sup 𝐛 ∈ ℬ ∥ 𝐛 ∥ 2 absent subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 𝑊 2 2
superscript subscript 𝑘 1 𝑇 superscript subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 delimited-[] 𝑘 2 2 1 𝑁 subscript supremum 𝐛 ℬ subscript delimited-∥∥ 𝐛 2 \displaystyle\leq\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup%
\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}\left\lVert W\right\rVert_{2,2}%
\sqrt{\sum\limits_{k=1}^{T}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{%
i}\mathbf{u}_{i}[k]\right\rVert_{2}^{2}}\right]+\frac{1}{\sqrt{N}}\sup\limits_%
{\mathbf{b}\in\mathcal{B}}\left\lVert\mathbf{b}\right\rVert_{2} ≤ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] + divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG roman_sup start_POSTSUBSCRIPT bold_b ∈ caligraphic_B end_POSTSUBSCRIPT ∥ bold_b ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
= sup W ∈ 𝒲 ∥ W ∥ 2 , 2 𝔼 σ [ sup { 𝐮 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i 𝐮 i ∥ Y ] + 1 N sup 𝐛 ∈ ℬ ∥ 𝐛 ∥ 2 , absent subscript supremum 𝑊 𝒲 subscript delimited-∥∥ 𝑊 2 2
subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 𝑌 1 𝑁 subscript supremum 𝐛 ℬ subscript delimited-∥∥ 𝐛 2 \displaystyle=\sup\limits_{W\in\mathcal{W}}\left\lVert W\right\rVert_{2,2}%
\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}\left%
\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf{u}_{i}\right\rVert_{Y}%
\right]+\frac{1}{\sqrt{N}}\sup\limits_{\mathbf{b}\in\mathcal{B}}\left\lVert%
\mathbf{b}\right\rVert_{2}, = roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ] + divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG roman_sup start_POSTSUBSCRIPT bold_b ∈ caligraphic_B end_POSTSUBSCRIPT ∥ bold_b ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
where the first inequality follows from the triangle inequality, the first
equality is the definition of the norm, the second inequality follows from the
equation (4 ) and the linearity of W 𝑊 W italic_W , and the last
inequality is a standard inequality for matrix norms.
b) The first inequalities referring to the bias term are exactly the
same and hold for the infinity norm as well as in case a) . We only
have to deal with the term
𝔼 σ [ sup W ∈ 𝒲 sup { 𝐮 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i W 𝐮 i ∥ Y ] , subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝑊 subscript 𝐮 𝑖 𝑌 \displaystyle\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup\limits%
_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N%
}\sigma_{i}W\mathbf{u}_{i}\right\rVert_{Y}\right], blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ] ,
but now 𝐮 i ∈ ℓ 2 , 2 ( ℝ n u ) subscript 𝐮 𝑖 superscript ℓ 2 2
superscript ℝ subscript 𝑛 𝑢 \mathbf{u}_{i}\in\ell^{2,2}(\mathbb{R}^{n_{u}}) bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) and W 𝐮 i ∈ ℓ ∞ , ∞ ( ℝ n v ) 𝑊 subscript 𝐮 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑣 W\mathbf{u}_{i}\in\ell^{\infty,\infty}(\mathbb{R}^{n_{v}}) italic_W bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) . We have
𝔼 σ [ sup W ∈ 𝒲 sup { 𝐮 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i W 𝐮 i ∥ Y ] = 𝔼 σ [ sup W ∈ 𝒲 sup { 𝐮 i } i = 1 N ∈ Z ∥ W ( 1 N ∑ i = 1 N σ i 𝐮 i ) ∥ Y ] subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝑊 subscript 𝐮 𝑖 𝑌 subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 𝑊 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 𝑌 \displaystyle\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup\limits%
_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N%
}\sigma_{i}W\mathbf{u}_{i}\right\rVert_{Y}\right]=\mathbb{E}_{\sigma}\left[%
\sup\limits_{W\in\mathcal{W}}\sup\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}%
\left\lVert W\left(\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf{u}_{i}%
\right)\right\rVert_{Y}\right] blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ] = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ italic_W ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ]
= 𝔼 σ [ sup W ∈ 𝒲 sup { 𝐮 i } i = 1 N ∈ Z sup 1 ≤ k ≤ T ∥ W ( 1 N ∑ i = 1 N σ i 𝐮 i [ k ] ) ∥ ∞ ] absent subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript supremum 1 𝑘 𝑇 subscript delimited-∥∥ 𝑊 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 delimited-[] 𝑘 \displaystyle=\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup%
\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}\sup\limits_{1\leq k\leq T}\left%
\lVert W\left(\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf{u}_{i}[k]%
\right)\right\rVert_{\infty}\right] = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_T end_POSTSUBSCRIPT ∥ italic_W ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ]
≤ 𝔼 σ [ sup W ∈ 𝒲 sup { 𝐮 i } i = 1 N ∈ Z sup 1 ≤ k ≤ T ∥ W ∥ 2 , ∞ ∥ 1 N ∑ i = 1 N σ i 𝐮 i [ k ] ∥ 2 ] absent subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript supremum 1 𝑘 𝑇 subscript delimited-∥∥ 𝑊 2
subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 delimited-[] 𝑘 2 \displaystyle\leq\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup%
\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}\sup\limits_{1\leq k\leq T}\left%
\lVert W\right\rVert_{2,\infty}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}%
\sigma_{i}\mathbf{u}_{i}[k]\right\rVert_{2}\right] ≤ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_T end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ]
= 𝔼 σ [ sup W ∈ 𝒲 ∥ W ∥ 2 , ∞ sup { 𝐮 i } i = 1 N ∈ Z sup 1 ≤ k ≤ T ∥ 1 N ∑ i = 1 N σ i 𝐮 i [ k ] ∥ 2 2 ] absent subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript delimited-∥∥ 𝑊 2
subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript supremum 1 𝑘 𝑇 superscript subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 delimited-[] 𝑘 2 2 \displaystyle=\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\left%
\lVert W\right\rVert_{2,\infty}\sup\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}%
\sqrt{\sup\limits_{1\leq k\leq T}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}%
\sigma_{i}\mathbf{u}_{i}[k]\right\rVert_{2}^{2}}\right] = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT square-root start_ARG roman_sup start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_T end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ]
≤ 𝔼 σ [ sup W ∈ 𝒲 ∥ W ∥ 2 , ∞ sup { 𝐮 i } i = 1 N ∈ Z ∑ k = 1 T ∥ 1 N ∑ i = 1 N σ i 𝐮 i [ k ] ∥ 2 2 ] absent subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript delimited-∥∥ 𝑊 2
subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 superscript subscript 𝑘 1 𝑇 superscript subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 delimited-[] 𝑘 2 2 \displaystyle\leq\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\left%
\lVert W\right\rVert_{2,\infty}\sup\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}%
\sqrt{\sum\limits_{k=1}^{T}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{%
i}\mathbf{u}_{i}[k]\right\rVert_{2}^{2}}\right] ≤ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ]
≤ sup W ∈ 𝒲 ∥ W ∥ 2 , ∞ 𝔼 σ [ sup { 𝐮 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i 𝐮 i ∥ ℓ 2 , 2 ( ℝ n u ) ] , absent subscript supremum 𝑊 𝒲 subscript delimited-∥∥ 𝑊 2
subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 superscript ℓ 2 2
superscript ℝ subscript 𝑛 𝑢 \displaystyle\leq\sup\limits_{W\in\mathcal{W}}\left\lVert W\right\rVert_{2,%
\infty}\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z%
}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf{u}_{i}\right%
\rVert_{\ell^{2,2}(\mathbb{R}^{n_{u}})}\right], ≤ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ] ,
where the second equality is the definition of the norm, the first inequality
is the standard inequality for induced matrix norms.
c) Again, we only need to deal with the term
𝔼 σ [ sup W ∈ 𝒲 sup { 𝐮 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i W 𝐮 i ∥ Y ] , subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝑊 subscript 𝐮 𝑖 𝑌 \displaystyle\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup\limits%
_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N%
}\sigma_{i}W\mathbf{u}_{i}\right\rVert_{Y}\right], blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ] ,
but now 𝐮 i ∈ ℓ ∞ , ∞ ( ℝ n u ) subscript 𝐮 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 \mathbf{u}_{i}\in\ell^{\infty,\infty}(\mathbb{R}^{n_{u}}) bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) and W 𝐮 i ∈ ℓ ∞ , ∞ ( ℝ n v ) 𝑊 subscript 𝐮 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑣 W\mathbf{u}_{i}\in\ell^{\infty,\infty}(\mathbb{R}^{n_{v}}) italic_W bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) . We have
𝔼 σ [ sup W ∈ 𝒲 sup { 𝐮 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i W 𝐮 i ∥ Y ] = 𝔼 σ [ sup W ∈ 𝒲 sup { 𝐮 i } i = 1 N ∈ Z ∥ W ( 1 N ∑ i = 1 N σ i 𝐮 i ) ∥ Y ] subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝑊 subscript 𝐮 𝑖 𝑌 subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 𝑊 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 𝑌 \displaystyle\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup\limits%
_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N%
}\sigma_{i}W\mathbf{u}_{i}\right\rVert_{Y}\right]=\mathbb{E}_{\sigma}\left[%
\sup\limits_{W\in\mathcal{W}}\sup\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}%
\left\lVert W\left(\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf{u}_{i}%
\right)\right\rVert_{Y}\right] blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ] = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ italic_W ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ]
= 𝔼 σ [ sup W ∈ 𝒲 sup { 𝐮 i } i = 1 N ∈ Z sup 1 ≤ k ≤ T ∥ W ( 1 N ∑ i = 1 N σ i 𝐮 i [ k ] ) ∥ ∞ ] absent subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript supremum 1 𝑘 𝑇 subscript delimited-∥∥ 𝑊 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 delimited-[] 𝑘 \displaystyle=\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup%
\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}\sup\limits_{1\leq k\leq T}\left%
\lVert W\left(\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf{u}_{i}[k]%
\right)\right\rVert_{\infty}\right] = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_T end_POSTSUBSCRIPT ∥ italic_W ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ]
≤ 𝔼 σ [ sup W ∈ 𝒲 sup { 𝐮 i } i = 1 N ∈ Z sup 1 ≤ k ≤ T ∥ W ∥ ∞ , ∞ ∥ 1 N ∑ i = 1 N σ i 𝐮 i [ k ] ∥ ∞ ] absent subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript supremum 1 𝑘 𝑇 subscript delimited-∥∥ 𝑊
subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 delimited-[] 𝑘 \displaystyle\leq\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup%
\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}\sup\limits_{1\leq k\leq T}\left%
\lVert W\right\rVert_{\infty,\infty}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N%
}\sigma_{i}\mathbf{u}_{i}[k]\right\rVert_{\infty}\right] ≤ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_T end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ]
= sup W ∈ 𝒲 ∥ W ∥ ∞ , ∞ 𝔼 σ [ sup { 𝐮 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i 𝐮 i ∥ ℓ ∞ , ∞ ( ℝ n u ) ] , absent subscript supremum 𝑊 𝒲 subscript delimited-∥∥ 𝑊
subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 \displaystyle=\sup\limits_{W\in\mathcal{W}}\left\lVert W\right\rVert_{\infty,%
\infty}\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z%
}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf{u}_{i}\right%
\rVert_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}\right], = roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ] ,
where the inequality is again the standard inequality for induced matrix norms.
We can see that the calculactions hold if the transformations are restricted to
the ball B X 1 ( r ) subscript 𝐵 subscript 𝑋 1 𝑟 B_{X_{1}}(r) italic_B start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r ) for any choice of X 1 subscript 𝑋 1 X_{1} italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT we consider. The radius can grow
as
∥ W 𝐮 + 𝐛 ∥ ℓ q , q ( ℝ n v ) ≤ ∥ W 𝐮 ∥ ℓ q , q ( ℝ n v ) + ∥ 𝐛 ¯ ∥ ℓ q , q ( ℝ n v ) ≤ ∥ W ∥ p , q ∥ 𝐮 ∥ ℓ p , p ( ℝ n u ) + ∥ 𝐛 ∥ q . subscript delimited-∥∥ 𝑊 𝐮 𝐛 superscript ℓ 𝑞 𝑞
superscript ℝ subscript 𝑛 𝑣 subscript delimited-∥∥ 𝑊 𝐮 superscript ℓ 𝑞 𝑞
superscript ℝ subscript 𝑛 𝑣 subscript delimited-∥∥ ¯ 𝐛 superscript ℓ 𝑞 𝑞
superscript ℝ subscript 𝑛 𝑣 subscript delimited-∥∥ 𝑊 𝑝 𝑞
subscript delimited-∥∥ 𝐮 superscript ℓ 𝑝 𝑝
superscript ℝ subscript 𝑛 𝑢 subscript delimited-∥∥ 𝐛 𝑞 \displaystyle\left\lVert W\mathbf{u}+\mathbf{b}\right\rVert_{\ell^{q,q}(%
\mathbb{R}^{n_{v}})}\leq\left\lVert W\mathbf{u}\right\rVert_{\ell^{q,q}(%
\mathbb{R}^{n_{v}})}+\left\lVert\underline{\mathbf{b}}\right\rVert_{\ell^{q,q}%
(\mathbb{R}^{n_{v}})}\leq\left\lVert W\right\rVert_{p,q}\left\lVert\mathbf{u}%
\right\rVert_{\ell^{p,p}(\mathbb{R}^{n_{u}})}+\left\lVert\mathbf{b}\right%
\rVert_{q}. ∥ italic_W bold_u + bold_b ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT italic_q , italic_q end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ≤ ∥ italic_W bold_u ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT italic_q , italic_q end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT + ∥ under¯ start_ARG bold_b end_ARG ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT italic_q , italic_q end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ≤ ∥ italic_W ∥ start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT ∥ bold_u ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT italic_p , italic_p end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT + ∥ bold_b ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT .
Proof [Proof of Lemma 11 ]
Let Z ⊆ X 1 N 𝑍 superscript subscript 𝑋 1 𝑁 Z\subseteq X_{1}^{N} italic_Z ⊆ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and Z ~ = { { φ 1 ( 𝐮 i ) } i = 1 N ∣ φ 1 ∈ Φ 1 } ~ 𝑍 conditional-set superscript subscript subscript 𝜑 1 subscript 𝐮 𝑖 𝑖 1 𝑁 subscript 𝜑 1 subscript Φ 1 \tilde{Z}=\{\{\varphi_{1}(\mathbf{u}_{i})\}_{i=1}^{N}\mid\varphi_{1}\in\Phi_{1}\} over~ start_ARG italic_Z end_ARG = { { italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∣ italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } . We have
𝔼 σ [ sup φ 2 ∈ Φ 2 sup φ 1 ∈ Φ 1 sup { 𝐮 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i φ 2 ( φ 1 ( 𝐮 i ) ) ∥ X 3 ] subscript 𝔼 𝜎 delimited-[] subscript supremum subscript 𝜑 2 subscript Φ 2 subscript supremum subscript 𝜑 1 subscript Φ 1 subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝜑 2 subscript 𝜑 1 subscript 𝐮 𝑖 subscript 𝑋 3 \displaystyle\mathbb{E}_{\sigma}\left[\sup\limits_{\varphi_{2}\in\Phi_{2}}\sup%
\limits_{\varphi_{1}\in\Phi_{1}}\sup\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z%
}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\varphi_{2}(\varphi_{1}(%
\mathbf{u}_{i}))\right\rVert_{X_{3}}\right] blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]
= 𝔼 σ [ sup φ 2 ∈ Φ 2 sup { 𝐯 i } i = 1 N ∈ Z ~ ∥ 1 N ∑ i = 1 N σ i φ 2 ( 𝐯 i ) ∥ X 3 ] absent subscript 𝔼 𝜎 delimited-[] subscript supremum subscript 𝜑 2 subscript Φ 2 subscript supremum superscript subscript subscript 𝐯 𝑖 𝑖 1 𝑁 ~ 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝜑 2 subscript 𝐯 𝑖 subscript 𝑋 3 \displaystyle=\mathbb{E}_{\sigma}\left[\sup\limits_{\varphi_{2}\in\Phi_{2}}%
\sup\limits_{\{\mathbf{v}_{i}\}_{i=1}^{N}\in\tilde{Z}}\left\lVert\frac{1}{N}%
\sum\limits_{i=1}^{N}\sigma_{i}\varphi_{2}(\mathbf{v}_{i})\right\rVert_{X_{3}}\right] = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ over~ start_ARG italic_Z end_ARG end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]
≤ μ 2 𝔼 σ [ sup φ 1 ∈ Φ 1 sup { 𝐮 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i φ 1 ( 𝐮 i ) ∥ X 2 ] + c 2 N absent subscript 𝜇 2 subscript 𝔼 𝜎 delimited-[] subscript supremum subscript 𝜑 1 subscript Φ 1 subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝜑 1 subscript 𝐮 𝑖 subscript 𝑋 2 subscript 𝑐 2 𝑁 \displaystyle\leq\mu_{2}\mathbb{E}_{\sigma}\left[\sup\limits_{\varphi_{1}\in%
\Phi_{1}}\sup\limits_{\{\mathbf{u}_{i}\}_{i=1}^{N}\in Z}\left\lVert\frac{1}{N}%
\sum\limits_{i=1}^{N}\sigma_{i}\varphi_{1}(\mathbf{u}_{i})\right\rVert_{X_{2}}%
\right]+\frac{c_{2}}{\sqrt{N}} ≤ italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] + divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG
≤ μ 2 μ 1 𝔼 σ [ sup { 𝐮 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i 𝐮 i ∥ X 1 ] + μ 2 c 1 N + c 2 N absent subscript 𝜇 2 subscript 𝜇 1 subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐮 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 subscript 𝑋 1 subscript 𝜇 2 subscript 𝑐 1 𝑁 subscript 𝑐 2 𝑁 \displaystyle\leq\mu_{2}\mu_{1}\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf%
{u}_{i}\}_{i=1}^{N}\in Z}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}%
\mathbf{u}_{i}\right\rVert_{X_{1}}\right]+\mu_{2}\frac{c_{1}}{\sqrt{N}}+\frac{%
c_{2}}{\sqrt{N}} ≤ italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] + italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG + divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG
Proof [Proof of Lemma 13 ]
Encoder and Decoder. The Encoder is case a) , while the
Decoder is case b) in Lemma 21 .
SSM. As discussed in Section 3.1 , an SSM is
equivalent to a linear transformation called its input-output map.
Therefore, by Lemma 21 , the SSM is ( μ , 0 ) 𝜇 0 (\mu,0) ( italic_μ , 0 ) -RC in both
cases, where μ 𝜇 \mu italic_μ is the operator norm of the input-output map. Combining
this with Lemma 3 yields the result.
Pooling.
For any Z ⊆ ℓ ∞ , ∞ ( ℝ n u ) 𝑍 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 Z\subseteq\ell^{\infty,\infty}(\mathbb{R}^{n_{u}}) italic_Z ⊆ roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) we have
𝔼 σ [ sup { 𝐳 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i f Pool ( 𝐳 i ) ∥ ∞ ] subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 superscript 𝑓 Pool subscript 𝐳 𝑖 \displaystyle\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf{z}_{i}\}_{i=1}^{N%
}\in Z}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}f^{\text{Pool}}(%
\mathbf{z}_{i})\right\rVert_{\infty}\right] blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT Pool end_POSTSUPERSCRIPT ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ]
= 𝔼 σ [ sup { 𝐳 i } i = 1 N ∈ Z sup 1 ≤ j ≤ n u | 1 N ∑ i = 1 N σ i ( 1 T ∑ k = 1 T 𝐳 i ( j ) [ k ] ) | ] absent subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript supremum 1 𝑗 subscript 𝑛 𝑢 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 1 𝑇 superscript subscript 𝑘 1 𝑇 superscript subscript 𝐳 𝑖 𝑗 delimited-[] 𝑘 \displaystyle=\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf{z}_{i}\}_{i=1}^{%
N}\in Z}\sup\limits_{1\leq j\leq n_{u}}\left|\frac{1}{N}\sum\limits_{i=1}^{N}%
\sigma_{i}\left(\frac{1}{T}\sum\limits_{k=1}^{T}\mathbf{z}_{i}^{(j)}[k]\right)%
\right|\right] = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT | divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT [ italic_k ] ) | ]
= 𝔼 σ [ sup { 𝐳 i } i = 1 N ∈ Z sup 1 ≤ j ≤ n u | 1 T ∑ k = 1 T ( 1 N ∑ i = 1 N σ i 𝐳 i ( j ) [ k ] ) | ] absent subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript supremum 1 𝑗 subscript 𝑛 𝑢 1 𝑇 superscript subscript 𝑘 1 𝑇 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 superscript subscript 𝐳 𝑖 𝑗 delimited-[] 𝑘 \displaystyle=\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf{z}_{i}\}_{i=1}^{%
N}\in Z}\sup\limits_{1\leq j\leq n_{u}}\left|\frac{1}{T}\sum\limits_{k=1}^{T}%
\left(\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf{z}_{i}^{(j)}[k]\right)%
\right|\right] = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT | divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT [ italic_k ] ) | ]
≤ 𝔼 σ [ sup { 𝐳 i } i = 1 N ∈ Z 1 T ∑ k = 1 T sup 1 ≤ j ≤ n u | 1 N ∑ i = 1 N σ i 𝐳 i ( j ) [ k ] | ] absent subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 1 𝑇 superscript subscript 𝑘 1 𝑇 subscript supremum 1 𝑗 subscript 𝑛 𝑢 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 superscript subscript 𝐳 𝑖 𝑗 delimited-[] 𝑘 \displaystyle\leq\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf{z}_{i}\}_{i=1%
}^{N}\in Z}\frac{1}{T}\sum\limits_{k=1}^{T}\sup\limits_{1\leq j\leq n_{u}}%
\left|\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf{z}_{i}^{(j)}[k]\right|\right] ≤ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT | divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT [ italic_k ] | ]
= 𝔼 σ [ sup { 𝐳 i } i = 1 N ∈ Z 1 T ∑ k = 1 T ∥ 1 N ∑ i = 1 N σ i 𝐳 i [ k ] ∥ ∞ ] absent subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 1 𝑇 superscript subscript 𝑘 1 𝑇 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐳 𝑖 delimited-[] 𝑘 \displaystyle=\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf{z}_{i}\}_{i=1}^{%
N}\in Z}\frac{1}{T}\sum\limits_{k=1}^{T}\left\lVert\frac{1}{N}\sum\limits_{i=1%
}^{N}\sigma_{i}\mathbf{z}_{i}[k]\right\rVert_{\infty}\right] = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ]
≤ 𝔼 σ [ sup { 𝐳 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i 𝐳 i ∥ ℓ ∞ , ∞ ( ℝ n u ) ] absent subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐳 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 \displaystyle\leq\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf{z}_{i}\}_{i=1%
}^{N}\in Z}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf{z}_{i}%
\right\rVert_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}\right] ≤ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ]
Proof [Proof of Lemma 15 ]
First of all, we show that the function h : ( ℝ 2 , ∥ ⋅ ∥ 2 ) → ( ℝ , | ⋅ | ) h:(\mathbb{R}^{2},\left\lVert\cdot\right\rVert_{2})\to(\mathbb{R},|\cdot|) italic_h : ( blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) → ( blackboard_R , | ⋅ | ) defined as h ( 𝐱 ) = x 1 ⋅ σ ( x 2 ) ℎ 𝐱 ⋅ subscript 𝑥 1 𝜎 subscript 𝑥 2 h(\mathbf{x})=x_{1}\cdot\sigma(x_{2}) italic_h ( bold_x ) = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_σ ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is
2 ( K + 1 ) 2 𝐾 1 \sqrt{2}(K+1) square-root start_ARG 2 end_ARG ( italic_K + 1 ) -Lipschitz on a bounded domain, where | x i | ≤ K subscript 𝑥 𝑖 𝐾 |x_{i}|\leq K | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ italic_K for all 𝐱 ∈ ℝ 2 𝐱 superscript ℝ 2 \mathbf{x}\in\mathbb{R}^{2} bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT we consider. We will later specify the
value of K 𝐾 K italic_K in relation to Assumption 9 . By the sigmoid being
1-Lipschitz, we have
| h ( 𝐱 ) − h ( 𝐲 ) | = | x 1 σ ( x 2 ) − y 1 σ ( x 2 ) + y 1 σ ( x 2 ) − y 1 σ ( y 2 ) | ≤ ℎ 𝐱 ℎ 𝐲 subscript 𝑥 1 𝜎 subscript 𝑥 2 subscript 𝑦 1 𝜎 subscript 𝑥 2 subscript 𝑦 1 𝜎 subscript 𝑥 2 subscript 𝑦 1 𝜎 subscript 𝑦 2 absent \displaystyle|h(\mathbf{x})-h(\mathbf{y})|=|x_{1}\sigma(x_{2})-y_{1}\sigma(x_{%
2})+y_{1}\sigma(x_{2})-y_{1}\sigma(y_{2})|\leq | italic_h ( bold_x ) - italic_h ( bold_y ) | = | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | ≤
| ( x 1 − y 1 ) σ ( x 2 ) | + | y 1 ( σ ( x 2 ) − σ ( y 2 ) ) | ≤ | x 1 − y 1 | + | y 1 | | x 2 − y 2 | subscript 𝑥 1 subscript 𝑦 1 𝜎 subscript 𝑥 2 subscript 𝑦 1 𝜎 subscript 𝑥 2 𝜎 subscript 𝑦 2 subscript 𝑥 1 subscript 𝑦 1 subscript 𝑦 1 subscript 𝑥 2 subscript 𝑦 2 \displaystyle|(x_{1}-y_{1})\sigma(x_{2})|+|y_{1}(\sigma(x_{2})-\sigma(y_{2}))|%
\leq|x_{1}-y_{1}|+|y_{1}||x_{2}-y_{2}| | ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_σ ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | + | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_σ ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_σ ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) | ≤ | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | + | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT |
≤ 2 ( K + 1 ) ∥ 𝐱 − 𝐲 ∥ 2 absent 2 𝐾 1 subscript delimited-∥∥ 𝐱 𝐲 2 \displaystyle\leq\sqrt{2}(K+1)\left\lVert\mathbf{x}-\mathbf{y}\right\rVert_{2} ≤ square-root start_ARG 2 end_ARG ( italic_K + 1 ) ∥ bold_x - bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
Second, we recall Corollary 4 in Maurer (2016 ) .
Theorem 23 (Maurer (2016 ) )
Let 𝒳 𝒳 \mathcal{X} caligraphic_X be any set, ( 𝐱 1 , … , 𝐱 N ) ∈ 𝒳 N subscript 𝐱 1 … subscript 𝐱 𝑁 superscript 𝒳 𝑁 (\mathbf{x}_{1},\ldots,\mathbf{x}_{N})\in\mathcal{X}^{N} ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ∈ caligraphic_X start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ,
let ℱ ℱ \mathcal{F} caligraphic_F be a set of functions f : 𝒳 → ℓ 2 ( ℝ m ) : 𝑓 → 𝒳 superscript ℓ 2 superscript ℝ 𝑚 f:\mathcal{X}\to\ell^{2}(\mathbb{R}^{m}) italic_f : caligraphic_X → roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) and let h : ℓ 2 ( ℝ m ) → ℝ : ℎ → superscript ℓ 2 superscript ℝ 𝑚 ℝ h:\ell^{2}(\mathbb{R}^{m})\to\mathbb{R} italic_h : roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) → blackboard_R be
an L 𝐿 L italic_L -Lipschitz function. Under f k subscript 𝑓 𝑘 f_{k} italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denoting the k 𝑘 k italic_k -th component
function of f 𝑓 f italic_f and σ i k subscript 𝜎 𝑖 𝑘 \sigma_{ik} italic_σ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT being a doubly indexed Rademacher
variable, we have
𝔼 σ [ sup f ∈ ℱ ∑ i = 1 N σ i h ( f ( 𝐱 i ) ) ≤ 2 L 𝔼 σ [ sup f ∈ ℱ ∑ i = 1 N ∑ k = 1 m σ i k f k ( 𝐱 i ) ] ] . subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑓 ℱ superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 ℎ 𝑓 subscript 𝐱 𝑖 2 𝐿 subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑓 ℱ superscript subscript 𝑖 1 𝑁 superscript subscript 𝑘 1 𝑚 subscript 𝜎 𝑖 𝑘 subscript 𝑓 𝑘 subscript 𝐱 𝑖 \displaystyle\mathbb{E}_{\sigma}\left[\sup\limits_{f\in\mathcal{F}}\sum\limits%
_{i=1}^{N}\sigma_{i}h(f(\mathbf{x}_{i}))\leq\sqrt{2}L\mathbb{E}_{\sigma}\left[%
\sup\limits_{f\in\mathcal{F}}\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{m}\sigma_%
{ik}f_{k}(\mathbf{x}_{i})\right]\right]. blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h ( italic_f ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ≤ square-root start_ARG 2 end_ARG italic_L blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] ] .
We wish to apply Theorem 23 to GLU layers.
For any Z ⊆ ℓ ∞ , ∞ ( ℝ n u ) 𝑍 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 Z\subseteq\ell^{\infty,\infty}(\mathbb{R}^{n_{u}}) italic_Z ⊆ roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) we have
𝔼 σ [ sup W ∈ 𝒲 sup { 𝐳 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i G L U W ( 𝐳 i ) ∥ ℓ ∞ , ∞ ( ℝ n u ) ] subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝐺 𝐿 subscript 𝑈 𝑊 subscript 𝐳 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 \displaystyle\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup\limits%
_{\{\mathbf{z}_{i}\}_{i=1}^{N}\in Z}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N%
}\sigma_{i}GLU_{W}(\mathbf{z}_{i})\right\rVert_{\ell^{\infty,\infty}(\mathbb{R%
}^{n_{u}})}\right] blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_G italic_L italic_U start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ]
= 𝔼 σ [ sup W ∈ 𝒲 sup { 𝐳 i } i = 1 N ∈ Z sup 1 ≤ k ≤ T sup 1 ≤ j ≤ n u | 1 N ∑ i = 1 N σ i G L U W ( j ) ( 𝐳 i ) [ k ] | ] absent subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript supremum 1 𝑘 𝑇 subscript supremum 1 𝑗 subscript 𝑛 𝑢 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝐺 𝐿 superscript subscript 𝑈 𝑊 𝑗 subscript 𝐳 𝑖 delimited-[] 𝑘 \displaystyle=\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup%
\limits_{\{\mathbf{z}_{i}\}_{i=1}^{N}\in Z}\sup\limits_{1\leq k\leq T}\sup%
\limits_{1\leq j\leq n_{u}}\left|\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}GLU%
_{W}^{(j)}(\mathbf{z}_{i})[k]\right|\right] = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_T end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT | divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_G italic_L italic_U start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) [ italic_k ] | ]
Now this is an alternative version of the Rademacher complexity, where we
take the absolute value of the Rademacher average. In order to apply Theorem
23 , we reduce the problem to the usual Rademacher complexity.
In turn, we can apply the last chain of inequalities in the proof of
Proposition 6.2 in Hajek and Raginsky (2019 ) . Concretely,
by denoting 𝐎 = { 𝟎 } i = 1 N 𝐎 superscript subscript 0 𝑖 1 𝑁 \mathbf{O}=\{\mathbf{0}\}_{i=1}^{N} bold_O = { bold_0 } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and
noticing that G L U W ( 0 ) = 0 𝐺 𝐿 subscript 𝑈 𝑊 0 0 GLU_{W}(0)=0 italic_G italic_L italic_U start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( 0 ) = 0 , we have
𝔼 σ [ sup W ∈ 𝒲 sup { 𝐳 i } i = 1 N ∈ Z sup 1 ≤ k ≤ T sup 1 ≤ j ≤ n u | 1 N ∑ i = 1 N σ i G L U W ( j ) ( 𝐳 i ) [ k ] | ] subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript supremum 1 𝑘 𝑇 subscript supremum 1 𝑗 subscript 𝑛 𝑢 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝐺 𝐿 superscript subscript 𝑈 𝑊 𝑗 subscript 𝐳 𝑖 delimited-[] 𝑘 \displaystyle\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup\limits%
_{\{\mathbf{z}_{i}\}_{i=1}^{N}\in Z}\sup\limits_{1\leq k\leq T}\sup\limits_{1%
\leq j\leq n_{u}}\left|\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}GLU_{W}^{(j)}%
(\mathbf{z}_{i})[k]\right|\right] blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_T end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT | divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_G italic_L italic_U start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) [ italic_k ] | ]
≤ 2 𝔼 σ [ sup W ∈ 𝒲 sup { 𝐳 i } i = 1 N ∈ Z ∪ { 𝐎 } sup 1 ≤ k ≤ T sup 1 ≤ j ≤ n u 1 N ∑ i = 1 N σ i G L U W ( j ) ( 𝐳 i ) [ k ] ] absent 2 subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 𝐎 subscript supremum 1 𝑘 𝑇 subscript supremum 1 𝑗 subscript 𝑛 𝑢 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝐺 𝐿 superscript subscript 𝑈 𝑊 𝑗 subscript 𝐳 𝑖 delimited-[] 𝑘 \displaystyle\leq 2\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup%
\limits_{\{\mathbf{z}_{i}\}_{i=1}^{N}\in Z\cup\{\mathbf{O}\}}\sup\limits_{1%
\leq k\leq T}\sup\limits_{1\leq j\leq n_{u}}\frac{1}{N}\sum\limits_{i=1}^{N}%
\sigma_{i}GLU_{W}^{(j)}(\mathbf{z}_{i})[k]\right] ≤ 2 blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z ∪ { bold_O } end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_T end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_G italic_L italic_U start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) [ italic_k ] ]
Let 𝐱 i = i subscript 𝐱 𝑖 𝑖 \mathbf{x}_{i}=i bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_i , i = 1 , … , N 𝑖 1 … 𝑁
i=1,\ldots,N italic_i = 1 , … , italic_N and let ℱ = { f W , z ¯ , k , j ∣ ( W , z ¯ , k , j ) ∈ 𝒲 × ( Z ∪ { 0 } ) × [ T ] × [ n u ] } ℱ conditional-set subscript 𝑓 𝑊 ¯ 𝑧 𝑘 𝑗
𝑊 ¯ 𝑧 𝑘 𝑗 𝒲 𝑍 0 delimited-[] 𝑇 delimited-[] subscript 𝑛 𝑢 \mathcal{F}=\{f_{W,\underline{z},k,j}\mid(W,\underline{z},k,j)\in\mathcal{W}%
\times(Z\cup\{0\})\times[T]\times[n_{u}]\} caligraphic_F = { italic_f start_POSTSUBSCRIPT italic_W , under¯ start_ARG italic_z end_ARG , italic_k , italic_j end_POSTSUBSCRIPT ∣ ( italic_W , under¯ start_ARG italic_z end_ARG , italic_k , italic_j ) ∈ caligraphic_W × ( italic_Z ∪ { 0 } ) × [ italic_T ] × [ italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ] } such that
f W , z ¯ , k , j ( 𝐱 i ) = [ G E L U ( 𝐳 i [ k ] ) ( j ) ( W ( G E L U ( 𝐳 i [ k ] ) ) ) ( j ) ] T subscript 𝑓 𝑊 ¯ 𝑧 𝑘 𝑗
subscript 𝐱 𝑖 superscript matrix 𝐺 𝐸 𝐿 𝑈 superscript subscript 𝐳 𝑖 delimited-[] 𝑘 𝑗 superscript 𝑊 𝐺 𝐸 𝐿 𝑈 subscript 𝐳 𝑖 delimited-[] 𝑘 𝑗 𝑇 f_{W,\underline{z},k,j}(\mathbf{x}_{i})=\begin{bmatrix}GELU(\mathbf{z}_{i}[k])%
^{(j)}&(W(GELU(\mathbf{z}_{i}[k])))^{(j)}\end{bmatrix}^{T} italic_f start_POSTSUBSCRIPT italic_W , under¯ start_ARG italic_z end_ARG , italic_k , italic_j end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = [ start_ARG start_ROW start_CELL italic_G italic_E italic_L italic_U ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ) start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_CELL start_CELL ( italic_W ( italic_G italic_E italic_L italic_U ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ) ) ) start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT for z ¯ = { 𝐳 i } i = 1 N ∈ Z ¯ 𝑧 superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 \underline{z}=\{\mathbf{z}_{i}\}_{i=1}^{N}\in Z under¯ start_ARG italic_z end_ARG = { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z .
Since Z ⊆ ( B ℓ ∞ , ∞ ( ℝ n u ) ( K 𝐮 ) ) N 𝑍 superscript subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 subscript 𝐾 𝐮 𝑁 Z\subseteq(B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}(K_{\mathbf{u}}))^{N} italic_Z ⊆ ( italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , it follows for all { 𝐳 i } i = 1 N ∈ Z superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 \{\mathbf{z}_{i}\}_{i=1}^{N}\in Z { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z and for all k ∈ ℕ 𝑘 ℕ k\in\mathbb{N} italic_k ∈ blackboard_N that ‖ 𝐳 i [ k ] ‖ ∞ ≤ K 𝐮 subscript norm subscript 𝐳 𝑖 delimited-[] 𝑘 subscript 𝐾 𝐮 \|\mathbf{z}_{i}[k]\|_{\infty}\leq K_{\mathbf{u}} ∥ bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT , and hence | G E L U ( 𝐳 i [ k ] ) ( j ) | < K 𝐮 𝐺 𝐸 𝐿 𝑈 superscript subscript 𝐳 𝑖 delimited-[] 𝑘 𝑗 subscript 𝐾 𝐮 |GELU(\mathbf{z}_{i}[k])^{(j)}|<K_{\mathbf{u}} | italic_G italic_E italic_L italic_U ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ) start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT | < italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT , leading to
| W ( G E L U ( 𝐳 i [ k ] ) ) ( j ) | < sup W ∈ 𝒲 ∥ W ∥ ∞ , ∞ ⋅ K 𝐮 ⋅ evaluated-at 𝑊 superscript 𝐺 𝐸 𝐿 𝑈 subscript 𝐳 𝑖 delimited-[] 𝑘 𝑗 bra subscript supremum 𝑊 𝒲 𝑊
subscript 𝐾 𝐮 |W(GELU(\mathbf{z}_{i}[k]))^{(j)}|<\sup\limits_{W\in\mathcal{W}}\|W\|_{\infty,%
\infty}\cdot K_{\mathbf{u}} | italic_W ( italic_G italic_E italic_L italic_U ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ) ) start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT | < roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT ⋅ italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT . In particular,
G L U W ( j ) ( 𝐳 i ) [ k ] = h ( f W , z ¯ , k , j ( 𝐱 i ) ) = h | B ( f W , z ¯ , k , j ( 𝐱 i ) ) 𝐺 𝐿 subscript superscript 𝑈 𝑗 𝑊 subscript 𝐳 𝑖 delimited-[] 𝑘 ℎ subscript 𝑓 𝑊 ¯ 𝑧 𝑘 𝑗
subscript 𝐱 𝑖 evaluated-at ℎ 𝐵 subscript 𝑓 𝑊 ¯ 𝑧 𝑘 𝑗
subscript 𝐱 𝑖 GLU^{(j)}_{W}(\mathbf{z}_{i})[k]=h(f_{W,\underline{z},k,j}(\mathbf{x}_{i}))=h|%
_{B}(f_{W,\underline{z},k,j}(\mathbf{x}_{i})) italic_G italic_L italic_U start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) [ italic_k ] = italic_h ( italic_f start_POSTSUBSCRIPT italic_W , under¯ start_ARG italic_z end_ARG , italic_k , italic_j end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = italic_h | start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_W , under¯ start_ARG italic_z end_ARG , italic_k , italic_j end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , where h | B evaluated-at ℎ 𝐵 h|_{B} italic_h | start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is the restriction of
h ℎ h italic_h to B = { x ∈ ℝ 2 ∣ ‖ x ‖ ∞ < K } 𝐵 conditional-set 𝑥 superscript ℝ 2 subscript norm 𝑥 𝐾 B=\{x\in\mathbb{R}^{2}\mid\|x\|_{\infty}<K\} italic_B = { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ ∥ italic_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < italic_K } , and hence h | B evaluated-at ℎ 𝐵 h|_{B} italic_h | start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT
is 2 ( K + 1 ) 2 𝐾 1 \sqrt{2}(K+1) square-root start_ARG 2 end_ARG ( italic_K + 1 ) -Lipschitz. Therefore we can set K = max { K 𝐮 , sup W ∈ 𝒲 ‖ W ‖ ∞ , ∞ ⋅ K 𝐮 } 𝐾 subscript 𝐾 𝐮 subscript supremum 𝑊 𝒲 ⋅ subscript norm 𝑊
subscript 𝐾 𝐮 K=\max\{K_{\mathbf{u}},\sup\limits_{W\in\mathcal{W}}\|W\|_{\infty,\infty}\cdot
K%
_{\mathbf{u}}\} italic_K = roman_max { italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT , roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT ⋅ italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT } .
We are ready to apply Theorem 23 , together with the GLU
definition and its 2 ( K + 1 ) 2 𝐾 1 \sqrt{2}(K+1) square-root start_ARG 2 end_ARG ( italic_K + 1 ) -Lipschitzness, we have
2 𝔼 σ [ sup W ∈ 𝒲 sup { 𝐳 i } i = 1 N ∈ Z ∪ { 𝐎 } sup 1 ≤ k ≤ T sup 1 ≤ j ≤ n u 1 N ∑ i = 1 N σ i G L U W ( j ) ( 𝐳 i ) [ k ] ] 2 subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 𝐎 subscript supremum 1 𝑘 𝑇 subscript supremum 1 𝑗 subscript 𝑛 𝑢 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝐺 𝐿 superscript subscript 𝑈 𝑊 𝑗 subscript 𝐳 𝑖 delimited-[] 𝑘 \displaystyle 2\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup%
\limits_{\{\mathbf{z}_{i}\}_{i=1}^{N}\in Z\cup\{\mathbf{O}\}}\sup\limits_{1%
\leq k\leq T}\sup\limits_{1\leq j\leq n_{u}}\frac{1}{N}\sum\limits_{i=1}^{N}%
\sigma_{i}GLU_{W}^{(j)}(\mathbf{z}_{i})[k]\right] 2 blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z ∪ { bold_O } end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_T end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_G italic_L italic_U start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) [ italic_k ] ]
= 2 𝔼 σ [ sup f ∈ ℱ 1 N ∑ i = 1 N σ i f ( 𝐱 i ) ] ≤ 4 ( K + 1 ) 𝔼 σ [ sup f ∈ ℱ 1 N ∑ i = 1 N σ i G E L U ( 𝐳 i [ k ] ) ( j ) ] ⏟ A absent 2 subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑓 ℱ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝑓 subscript 𝐱 𝑖 4 𝐾 1 subscript ⏟ subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑓 ℱ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝐺 𝐸 𝐿 𝑈 superscript subscript 𝐳 𝑖 delimited-[] 𝑘 𝑗 𝐴 \displaystyle=2\mathbb{E}_{\sigma}\left[\sup\limits_{f\in\mathcal{F}}\frac{1}{%
N}\sum\limits_{i=1}^{N}\sigma_{i}f(\mathbf{x}_{i})\right]\leq 4(K+1)%
\underbrace{\mathbb{E}_{\sigma}\left[\sup\limits_{f\in\mathcal{F}}\frac{1}{N}%
\sum\limits_{i=1}^{N}\sigma_{i}GELU(\mathbf{z}_{i}[k])^{(j)}\right]}_{A} = 2 blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] ≤ 4 ( italic_K + 1 ) under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_G italic_E italic_L italic_U ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ) start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ] end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT
+ 4 ( K + 1 ) 𝔼 σ [ sup f ∈ ℱ 1 N ∑ i = 1 N σ i W ( G E L U ( 𝐳 i ) ) ( j ) [ k ] ] ⏟ B 4 𝐾 1 subscript ⏟ subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑓 ℱ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝑊 superscript 𝐺 𝐸 𝐿 𝑈 subscript 𝐳 𝑖 𝑗 delimited-[] 𝑘 𝐵 \displaystyle+4(K+1)\underbrace{\mathbb{E}_{\sigma}\left[\sup\limits_{f\in%
\mathcal{F}}\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}W(GELU(\mathbf{z}_{i}))^%
{(j)}[k]\right]}_{B} + 4 ( italic_K + 1 ) under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W ( italic_G italic_E italic_L italic_U ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT [ italic_k ] ] end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT
Due to the definition of GELU, its 2-Lipschitzness Qi et al. (2023 )
and (Ledoux and Talagrand, 1991 , Theorem 4.12) we have
A = 𝔼 σ [ sup { 𝐳 i } i = 1 N ∈ Z ∪ { 𝐎 } ∥ 1 N ∑ i = 1 N σ i G E L U ( 𝐳 i ) ∥ ℓ ∞ , ∞ ( ℝ n u ) ] = 𝐴 subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 𝐎 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝐺 𝐸 𝐿 𝑈 subscript 𝐳 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 absent \displaystyle A=\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf{z}_{i}\}_{i=1}%
^{N}\in Z\cup\{\mathbf{O}\}}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_%
{i}GELU(\mathbf{z}_{i})\right\rVert_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}%
\right]= italic_A = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z ∪ { bold_O } end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_G italic_E italic_L italic_U ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ] =
≤ 4 𝔼 σ [ sup { 𝐳 i } i = 1 N ∈ Z ∪ { 𝐎 } ∥ 1 N ∑ i = 1 N σ i 𝐳 i ∥ ℓ ∞ , ∞ ( ℝ n u ) ] = 4 𝔼 σ [ sup { 𝐳 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i 𝐳 i ∥ ℓ ∞ , ∞ ( ℝ n u ) ] absent 4 subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 𝐎 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐳 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 4 subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐳 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 \displaystyle\leq 4\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf{z}_{i}\}_{i%
=1}^{N}\in Z\cup\{\mathbf{O}\}}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}%
\sigma_{i}\mathbf{z}_{i}\right\rVert_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})%
}\right]=4\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf{z}_{i}\}_{i=1}^{N}%
\in Z}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf{z}_{i}%
\right\rVert_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}\right] ≤ 4 blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z ∪ { bold_O } end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ] = 4 blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ]
and
B = 𝔼 σ [ sup W ∈ 𝒲 sup { 𝐳 i } i = 1 N ∈ { 𝐎 } ∥ 1 N ∑ i = 1 N σ i W ( G E L U ( 𝐳 i ) ) ∥ ℓ ∞ , ∞ ( ℝ n u ) ] 𝐵 subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝐎 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝑊 𝐺 𝐸 𝐿 𝑈 subscript 𝐳 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 \displaystyle B=\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup%
\limits_{\{\mathbf{z}_{i}\}_{i=1}^{N}\in\{\mathbf{O}\}}\left\lVert\frac{1}{N}%
\sum\limits_{i=1}^{N}\sigma_{i}W(GELU(\mathbf{z}_{i}))\right\rVert_{\ell^{%
\infty,\infty}(\mathbb{R}^{n_{u}})}\right] italic_B = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ { bold_O } end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W ( italic_G italic_E italic_L italic_U ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ]
≤ 4 sup W ∈ 𝒲 ∥ W ∥ ∞ , ∞ 𝔼 σ [ sup { 𝐳 i } i = 1 N ∈ Z { 𝐎 } ∥ 1 N ∑ i = 1 N σ i 𝐳 i ∥ ℓ ∞ , ∞ ( ℝ n u ) ] absent 4 subscript supremum 𝑊 𝒲 subscript delimited-∥∥ 𝑊
subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 𝐎 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐳 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 \displaystyle\leq 4\sup\limits_{W\in\mathcal{W}}\left\lVert W\right\rVert_{%
\infty,\infty}\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf{z}_{i}\}_{i=1}^{%
N}\in Z\{\mathbf{O}\}}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}%
\mathbf{z}_{i}\right\rVert_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}\right] ≤ 4 roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z { bold_O } end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ]
= 4 sup W ∈ 𝒲 ∥ W ∥ ∞ , ∞ 𝔼 σ [ sup { 𝐳 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i 𝐳 i ∥ ℓ ∞ , ∞ ( ℝ n u ) ] absent 4 subscript supremum 𝑊 𝒲 subscript delimited-∥∥ 𝑊
subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐳 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 \displaystyle=4\sup\limits_{W\in\mathcal{W}}\left\lVert W\right\rVert_{\infty,%
\infty}\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf{z}_{i}\}_{i=1}^{N}\in Z%
}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf{z}_{i}\right%
\rVert_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}\right] = 4 roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ]
Here we used the linearity of W 𝑊 W italic_W and the exact same calculation as in the
proof of Lemma 21 .
By combining the inequalities above, it follows that
𝔼 σ [ sup W ∈ 𝒲 sup { 𝐳 i } i = 1 N ∈ Z sup 1 ≤ k ≤ T sup 1 ≤ j ≤ n u | 1 N ∑ i = 1 N σ i G L U W ( j ) ( 𝐳 i ) [ k ] | ] ≤ subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝒲 subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript supremum 1 𝑘 𝑇 subscript supremum 1 𝑗 subscript 𝑛 𝑢 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝐺 𝐿 superscript subscript 𝑈 𝑊 𝑗 subscript 𝐳 𝑖 delimited-[] 𝑘 absent \displaystyle\mathbb{E}_{\sigma}\left[\sup\limits_{W\in\mathcal{W}}\sup\limits%
_{\{\mathbf{z}_{i}\}_{i=1}^{N}\in Z}\sup\limits_{1\leq k\leq T}\sup\limits_{1%
\leq j\leq n_{u}}\left|\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}GLU_{W}^{(j)}%
(\mathbf{z}_{i})[k]\right|\right]\leq blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_T end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT | divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_G italic_L italic_U start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) [ italic_k ] | ] ≤
16 ( K + 1 ) ( sup W ∈ 𝒲 ∥ W ∥ ∞ , ∞ + 1 ) 𝔼 σ [ sup { 𝐳 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i 𝐳 i ∥ ℓ ∞ , ∞ ( ℝ n u ) ] 16 𝐾 1 subscript supremum 𝑊 𝒲 subscript delimited-∥∥ 𝑊
1 subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐳 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 \displaystyle 16(K+1)\left(\sup\limits_{W\in\mathcal{W}}\left\lVert W\right%
\rVert_{\infty,\infty}+1\right)\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf%
{z}_{i}\}_{i=1}^{N}\in Z}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}%
\mathbf{z}_{i}\right\rVert_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}\right] 16 ( italic_K + 1 ) ( roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT + 1 ) blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ]
Substituting the value of K 𝐾 K italic_K gives the result.
Proof [Proof of Lemma 16 ]
MLP with sigmoid activations.
Consider a single hidden layer f ( 𝐱 ) = ρ ( g ( 𝐱 ) ) 𝑓 𝐱 𝜌 𝑔 𝐱 f(\mathbf{x})=\rho(g(\mathbf{x})) italic_f ( bold_x ) = italic_ρ ( italic_g ( bold_x ) ) , where g ( x ) = W 𝐱 + 𝐛 𝑔 𝑥 𝑊 𝐱 𝐛 g(x)=W\mathbf{x}+\mathbf{b} italic_g ( italic_x ) = italic_W bold_x + bold_b
is the preactivation and let 𝒢 = { g ( x ) = W 𝐱 + 𝐛 ∣ W ∈ 𝒲 , 𝐛 ∈ ℬ } 𝒢 conditional-set 𝑔 𝑥 𝑊 𝐱 𝐛 formulae-sequence 𝑊 𝒲 𝐛 ℬ \mathcal{G}=\left\{g(x)=W\mathbf{x}+\mathbf{b}\mid W\in\mathcal{W},\mathbf{b}%
\in\mathcal{B}\right\} caligraphic_G = { italic_g ( italic_x ) = italic_W bold_x + bold_b ∣ italic_W ∈ caligraphic_W , bold_b ∈ caligraphic_B } denote the set of possible
preactivation functions. As compared to Definition 7 , we omit the
subscript from the notation of g 𝑔 g italic_g . For an input sequence 𝐳 ∈ ℓ ∞ , ∞ ( ℝ n u ) 𝐳 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 \mathbf{z}\in\ell^{\infty,\infty}(\mathbb{R}^{n_{u}}) bold_z ∈ roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) let g ( 𝐳 ) ∈ ℓ ∞ , ∞ ( ℝ n v ) 𝑔 𝐳 superscript ℓ
superscript ℝ subscript 𝑛 𝑣 g(\mathbf{z})\in\ell^{\infty,\infty}(\mathbb{R}^{n_{v}}) italic_g ( bold_z ) ∈ roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) mean that we apply g 𝑔 g italic_g for each timestep
independently, i.e. g ( 𝐳 ) [ k ] = g ( 𝐳 [ k ] ) 𝑔 𝐳 delimited-[] 𝑘 𝑔 𝐳 delimited-[] 𝑘 g(\mathbf{z})[k]=g(\mathbf{z}[k]) italic_g ( bold_z ) [ italic_k ] = italic_g ( bold_z [ italic_k ] ) . We have
𝔼 σ [ sup g ∈ 𝒢 sup { 𝐳 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i ρ ( g ( 𝐳 i ) ) ∥ ℓ ∞ , ∞ ( ℝ n u ) ] subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑔 𝒢 subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝜌 𝑔 subscript 𝐳 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 \displaystyle\mathbb{E}_{\sigma}\left[\sup\limits_{g\in\mathcal{G}}\sup\limits%
_{\{\mathbf{z}_{i}\}_{i=1}^{N}\in Z}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N%
}\sigma_{i}\rho(g(\mathbf{z}_{i}))\right\rVert_{\ell^{\infty,\infty}(\mathbb{R%
}^{n_{u}})}\right] blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ρ ( italic_g ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ]
= 𝔼 σ [ sup ( W , 𝐛 ) ∈ 𝒲 × ℬ sup { 𝐳 i } i = 1 N ∈ Z sup 1 ≤ k ≤ T ∥ 1 N ∑ i = 1 N σ i ρ ( W 𝐳 i [ k ] + 𝐛 ) ∥ ∞ ] absent subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝐛 𝒲 ℬ subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript supremum 1 𝑘 𝑇 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝜌 𝑊 subscript 𝐳 𝑖 delimited-[] 𝑘 𝐛 \displaystyle=\mathbb{E}_{\sigma}\left[\sup\limits_{(W,\mathbf{b})\in\mathcal{%
W}\times\mathcal{B}}\sup\limits_{\{\mathbf{z}_{i}\}_{i=1}^{N}\in Z}\sup\limits%
_{1\leq k\leq T}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\rho(W%
\mathbf{z}_{i}[k]+\mathbf{b})\right\rVert_{\infty}\right] = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT ( italic_W , bold_b ) ∈ caligraphic_W × caligraphic_B end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_T end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ρ ( italic_W bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] + bold_b ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ]
Let 𝐱 i = i subscript 𝐱 𝑖 𝑖 \mathbf{x}_{i}=i bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_i , i = 1 , … , N 𝑖 1 … 𝑁
i=1,\ldots,N italic_i = 1 , … , italic_N and let ℋ = { h W , 𝐛 , z ¯ , k ∣ ( W , 𝐛 , z ¯ , k ) ∈ 𝒲 × ℬ × ( Z ∪ { 0 } ) × [ T ] } ℋ conditional-set subscript ℎ 𝑊 𝐛 ¯ 𝑧 𝑘
𝑊 𝐛 ¯ 𝑧 𝑘 𝒲 ℬ 𝑍 0 delimited-[] 𝑇 \mathcal{H}=\{h_{W,\mathbf{b},\underline{z},k}\mid(W,\mathbf{b},\underline{z},%
k)\in\mathcal{W}\times\mathcal{B}\times(Z\cup\{0\})\times[T]\} caligraphic_H = { italic_h start_POSTSUBSCRIPT italic_W , bold_b , under¯ start_ARG italic_z end_ARG , italic_k end_POSTSUBSCRIPT ∣ ( italic_W , bold_b , under¯ start_ARG italic_z end_ARG , italic_k ) ∈ caligraphic_W × caligraphic_B × ( italic_Z ∪ { 0 } ) × [ italic_T ] } such that
h W , 𝐛 , z ¯ , k ( 𝐱 i ) = g ( 𝐳 i [ k ] ) subscript ℎ 𝑊 𝐛 ¯ 𝑧 𝑘
subscript 𝐱 𝑖 𝑔 subscript 𝐳 𝑖 delimited-[] 𝑘 h_{W,\mathbf{b},\underline{z},k}(\mathbf{x}_{i})=g(\mathbf{z}_{i}[k]) italic_h start_POSTSUBSCRIPT italic_W , bold_b , under¯ start_ARG italic_z end_ARG , italic_k end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_g ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ) . Under the assumption that
ℋ ℋ \mathcal{H} caligraphic_H is symmetric to the origin, meaning that h ∈ ℋ ℎ ℋ h\in\mathcal{H} italic_h ∈ caligraphic_H
implies − h ∈ ℋ ℎ ℋ -h\in\mathcal{H} - italic_h ∈ caligraphic_H (equivalently ( W , 𝐛 ) ∈ 𝒲 × ℬ 𝑊 𝐛 𝒲 ℬ (W,\mathbf{b})\in\mathcal{W}\times\mathcal{B} ( italic_W , bold_b ) ∈ caligraphic_W × caligraphic_B implies ( − W , − 𝐛 ) ∈ 𝒲 × ℬ 𝑊 𝐛 𝒲 ℬ (-W,-\mathbf{b})\in\mathcal{W}\times\mathcal{B} ( - italic_W , - bold_b ) ∈ caligraphic_W × caligraphic_B ), we can
apply (Truong, 2022b , Theorem 9) for the sigmoid activation and
hence ρ − ρ ( 0 ) 𝜌 𝜌 0 \rho-\rho(0) italic_ρ - italic_ρ ( 0 ) being odd, as follows.
𝔼 σ [ sup ( W , 𝐛 ) ∈ 𝒲 × ℬ sup { 𝐳 i } i = 1 N ∈ Z sup 1 ≤ k ≤ T ∥ 1 N ∑ i = 1 N σ i ρ ( W 𝐳 i [ k ] + 𝐛 ) ∥ ∞ ] subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝐛 𝒲 ℬ subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript supremum 1 𝑘 𝑇 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝜌 𝑊 subscript 𝐳 𝑖 delimited-[] 𝑘 𝐛 \displaystyle\mathbb{E}_{\sigma}\left[\sup\limits_{(W,\mathbf{b})\in\mathcal{W%
}\times\mathcal{B}}\sup\limits_{\{\mathbf{z}_{i}\}_{i=1}^{N}\in Z}\sup\limits_%
{1\leq k\leq T}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\rho(W%
\mathbf{z}_{i}[k]+\mathbf{b})\right\rVert_{\infty}\right] blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT ( italic_W , bold_b ) ∈ caligraphic_W × caligraphic_B end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_T end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ρ ( italic_W bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] + bold_b ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ]
= 𝔼 σ [ sup h ∈ ℋ ∥ 1 N ∑ i = 1 N σ i ρ ( h ( 𝐱 i ) ) ∥ ∞ ] absent subscript 𝔼 𝜎 delimited-[] subscript supremum ℎ ℋ subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝜌 ℎ subscript 𝐱 𝑖 \displaystyle=\mathbb{E}_{\sigma}\left[\sup\limits_{h\in\mathcal{H}}\left%
\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\rho(h(\mathbf{x}_{i}))\right%
\rVert_{\infty}\right] = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ caligraphic_H end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ρ ( italic_h ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ]
≤ 𝔼 σ [ sup h ∈ ℋ ∥ 1 N ∑ i = 1 N σ i h ( 𝐱 i ) ∥ ∞ ] + 1 2 N absent subscript 𝔼 𝜎 delimited-[] subscript supremum ℎ ℋ subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 ℎ subscript 𝐱 𝑖 1 2 𝑁 \displaystyle\leq\mathbb{E}_{\sigma}\left[\sup\limits_{h\in\mathcal{H}}\left%
\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}h(\mathbf{x}_{i})\right\rVert_%
{\infty}\right]+\frac{1}{2\sqrt{N}} ≤ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ caligraphic_H end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] + divide start_ARG 1 end_ARG start_ARG 2 square-root start_ARG italic_N end_ARG end_ARG
= 𝔼 σ [ sup ( W , 𝐛 ) ∈ 𝒲 × ℬ sup { 𝐳 i } i = 1 N ∈ Z sup 1 ≤ k ≤ T ∥ 1 N ∑ i = 1 N σ i ( W 𝐳 i [ k ] + 𝐛 ) ∥ ∞ ] + 1 2 N absent subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝐛 𝒲 ℬ subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript supremum 1 𝑘 𝑇 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝑊 subscript 𝐳 𝑖 delimited-[] 𝑘 𝐛 1 2 𝑁 \displaystyle=\mathbb{E}_{\sigma}\left[\sup\limits_{(W,\mathbf{b})\in\mathcal{%
W}\times\mathcal{B}}\sup\limits_{\{\mathbf{z}_{i}\}_{i=1}^{N}\in Z}\sup\limits%
_{1\leq k\leq T}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}(W\mathbf%
{z}_{i}[k]+\mathbf{b})\right\rVert_{\infty}\right]+\frac{1}{2\sqrt{N}} = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT ( italic_W , bold_b ) ∈ caligraphic_W × caligraphic_B end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_T end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_W bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] + bold_b ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] + divide start_ARG 1 end_ARG start_ARG 2 square-root start_ARG italic_N end_ARG end_ARG
= 𝔼 σ [ sup ( W , b ) ∈ 𝒲 × ℬ sup { 𝐳 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i ( W 𝐳 i + 𝐛 ) ∥ ℓ ∞ , ∞ ( ℝ n v ) ] + 1 2 N , absent subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝑏 𝒲 ℬ subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝑊 subscript 𝐳 𝑖 𝐛 superscript ℓ
superscript ℝ subscript 𝑛 𝑣 1 2 𝑁 \displaystyle=\mathbb{E}_{\sigma}\left[\sup\limits_{(W,b)\in\mathcal{W}\times%
\mathcal{B}}\sup\limits_{\{\mathbf{z}_{i}\}_{i=1}^{N}\in Z}\left\lVert\frac{1}%
{N}\sum\limits_{i=1}^{N}\sigma_{i}(W\mathbf{z}_{i}+\mathbf{b})\right\rVert_{%
\ell^{\infty,\infty}(\mathbb{R}^{n_{v}})}\right]+\frac{1}{2\sqrt{N}}, = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT ( italic_W , italic_b ) ∈ caligraphic_W × caligraphic_B end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_W bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_b ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ] + divide start_ARG 1 end_ARG start_ARG 2 square-root start_ARG italic_N end_ARG end_ARG ,
because the sigmoid is 1-Lipschitz and ρ ( 0 ) = 0.5 𝜌 0 0.5 \rho(0)=0.5 italic_ρ ( 0 ) = 0.5 . Now we can apply Lemma 21 to get that
𝔼 σ [ sup ( W , 𝐛 ) ∈ 𝒲 × ℬ sup { 𝐳 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i ( W 𝐳 i + 𝐛 ) ∥ ℓ ∞ , ∞ ( ℝ n v ) ] + 1 2 N subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑊 𝐛 𝒲 ℬ subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝑊 subscript 𝐳 𝑖 𝐛 superscript ℓ
superscript ℝ subscript 𝑛 𝑣 1 2 𝑁 \displaystyle\mathbb{E}_{\sigma}\left[\sup\limits_{(W,\mathbf{b})\in\mathcal{W%
}\times\mathcal{B}}\sup\limits_{\{\mathbf{z}_{i}\}_{i=1}^{N}\in Z}\left\lVert%
\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}(W\mathbf{z}_{i}+\mathbf{b})\right%
\rVert_{\ell^{\infty,\infty}(\mathbb{R}^{n_{v}})}\right]+\frac{1}{2\sqrt{N}} blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT ( italic_W , bold_b ) ∈ caligraphic_W × caligraphic_B end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_W bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_b ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ] + divide start_ARG 1 end_ARG start_ARG 2 square-root start_ARG italic_N end_ARG end_ARG
≤ sup W ∈ 𝒲 ∥ W ∥ ∞ , ∞ 𝔼 σ [ sup { 𝐳 i } i = 1 N ∈ Z ∥ 1 N ∑ i = 1 N σ i 𝐳 i ∥ ℓ ∞ , ∞ ( ℝ n u ) ] + 1 N sup 𝐛 ∈ ℬ ∥ 𝐛 ∥ ∞ + 1 2 N absent subscript supremum 𝑊 𝒲 subscript delimited-∥∥ 𝑊
subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 𝑍 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐳 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 1 𝑁 subscript supremum 𝐛 ℬ subscript delimited-∥∥ 𝐛 1 2 𝑁 \displaystyle\leq\sup\limits_{W\in\mathcal{W}}\left\lVert W\right\rVert_{%
\infty,\infty}\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf{z}_{i}\}_{i=1}^{%
N}\in Z}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf{z}_{i}%
\right\rVert_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}\right]+\frac{1}{\sqrt{%
N}}\sup\limits_{\mathbf{b}\in\mathcal{B}}\left\lVert\mathbf{b}\right\rVert_{%
\infty}+\frac{1}{2\sqrt{N}} ≤ roman_sup start_POSTSUBSCRIPT italic_W ∈ caligraphic_W end_POSTSUBSCRIPT ∥ italic_W ∥ start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ italic_Z end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ] + divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG roman_sup start_POSTSUBSCRIPT bold_b ∈ caligraphic_B end_POSTSUBSCRIPT ∥ bold_b ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 square-root start_ARG italic_N end_ARG end_ARG
Therefore, the sigmoid MLP layer is ( K W , K 𝐛 + 0.5 ) subscript 𝐾 𝑊 subscript 𝐾 𝐛 0.5 (K_{W},K_{\mathbf{b}}+0.5) ( italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT + 0.5 ) -RC. The restriction
of an MLP to the ball B ℓ ∞ , ∞ ( ℝ n u ) ( r ) subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 𝑟 B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}(r) italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_r ) maps to
the ball B ℓ ∞ , ∞ ( ℝ n v ) ( 1 ) subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑣 1 B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{v}})}(1) italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( 1 ) , because of the
elementwise sigmoid activation.
MLP with ReLU activations.
Similarly to the sigmoid case, we assume the upper bounds K W subscript 𝐾 𝑊 K_{W} italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT and K 𝐛 subscript 𝐾 𝐛 K_{\mathbf{b}} italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT
exist, but we don’t assume the symmetry of the parameter set. The proof is the
same as in the sigmoid case up to the first inequality. Here we can apply
(Ledoux and Talagrand, 1991 , Equation 4.20) (this is the same idea as in the
proof of (Golowich et al., 2018 , Lemma 2) ) to get
𝔼 σ [ sup h ∈ ℋ ∥ 1 N ∑ i = 1 N σ i ρ ( h ( 𝐱 i ) ) ∥ ∞ ] ≤ 4 𝔼 σ [ sup h ∈ ℋ ∥ 1 N ∑ i = 1 N σ i h ( 𝐱 i ) ∥ ∞ ] , subscript 𝔼 𝜎 delimited-[] subscript supremum ℎ ℋ subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝜌 ℎ subscript 𝐱 𝑖 4 subscript 𝔼 𝜎 delimited-[] subscript supremum ℎ ℋ subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 ℎ subscript 𝐱 𝑖 \displaystyle\mathbb{E}_{\sigma}\left[\sup\limits_{h\in\mathcal{H}}\left\lVert%
\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\rho(h(\mathbf{x}_{i}))\right\rVert_%
{\infty}\right]\leq 4\mathbb{E}_{\sigma}\left[\sup\limits_{h\in\mathcal{H}}%
\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}h(\mathbf{x}_{i})\right%
\rVert_{\infty}\right], blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ caligraphic_H end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ρ ( italic_h ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] ≤ 4 blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ caligraphic_H end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] ,
where we used that ρ ( x ) = R e L U ( x ) 𝜌 𝑥 𝑅 𝑒 𝐿 𝑈 𝑥 \rho(x)=ReLU(x) italic_ρ ( italic_x ) = italic_R italic_e italic_L italic_U ( italic_x ) is 1-Lipschitz and the same logic for the
alternative definition of the Rademacher complexity as in the proof of Lemma
15 , which results in a constant factor of 2. The constant 4 is then
obtained by the additional constant factor 2 from Talagrand’s lemma. The rest of
proof is identical to the sigmoid case.
The restriction of an MLP to the ball B ℓ ∞ , ∞ ( ℝ n u ) ( r ) subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 𝑟 B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}(r) italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_r ) maps to the ball B ℓ ∞ , ∞ ( ℝ n v ) ( K W r + K 𝐛 ) subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑣 subscript 𝐾 𝑊 𝑟 subscript 𝐾 𝐛 B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{v}})}(K_{W}r+K_{\mathbf{b}}) italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_r + italic_K start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT ) , because the elementwise ReLU does
not increase the infinity norm, hence we can apply Lemma 21 .
Proof [Proof of Theorem 18 ]
We wish to apply Theorem 20 to the set of deep SSM models
ℱ ℱ \mathcal{F} caligraphic_F . Let us fix a random sample S = { 𝐮 1 , … , 𝐮 N } ⊂ ( ℓ 2 , 2 ( ℝ n in ) ) N 𝑆 subscript 𝐮 1 … subscript 𝐮 𝑁 superscript superscript ℓ 2 2
superscript ℝ subscript 𝑛 in 𝑁 S=\{\mathbf{u}_{1},\ldots,\mathbf{u}_{N}\}\subset\left(\ell^{2,2}(\mathbb{R}^{%
n_{\text{in}}})\right)^{N} italic_S = { bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_u start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } ⊂ ( roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT .
As the loss function is Lipschitz according to Assumption 9 ,
we have that for any f ∈ ℱ 𝑓 ℱ f\in\mathcal{F} italic_f ∈ caligraphic_F
| l ( f ( 𝐮 ) , y ) | ≤ 2 L l max { f ( 𝐮 ) , y } ≤ 2 L l max { K Dec r L + 2 , K y } , 𝑙 𝑓 𝐮 𝑦 2 subscript 𝐿 𝑙 𝑓 𝐮 𝑦 2 subscript 𝐿 𝑙 subscript 𝐾 Dec subscript 𝑟 𝐿 2 subscript 𝐾 𝑦 \displaystyle|l(f(\mathbf{u}),y)|\leq 2L_{l}\max\{f(\mathbf{u}),y\}\leq 2L_{l}%
\max\{K_{\text{Dec}}r_{L+2},K_{y}\}, | italic_l ( italic_f ( bold_u ) , italic_y ) | ≤ 2 italic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT roman_max { italic_f ( bold_u ) , italic_y } ≤ 2 italic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT roman_max { italic_K start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_L + 2 end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT } ,
thus K l ≤ 2 L l max { K Dec r L + 2 , K y } subscript 𝐾 𝑙 2 subscript 𝐿 𝑙 subscript 𝐾 Dec subscript 𝑟 𝐿 2 subscript 𝐾 𝑦 K_{l}\leq 2L_{l}\max\{K_{\text{Dec}}r_{L+2},K_{y}\} italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≤ 2 italic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT roman_max { italic_K start_POSTSUBSCRIPT Dec end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_L + 2 end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT } .
Again by the Lipschitzness of the loss and the Contraction lemma
(Shalev-Shwartz and Ben-David, 2014 , Lemma 26.9) we have
R S ( L 0 ) ≤ L l ⋅ R S ( ℱ ) . subscript 𝑅 𝑆 subscript 𝐿 0 ⋅ subscript 𝐿 𝑙 subscript 𝑅 𝑆 ℱ \displaystyle R_{S}(L_{0})\leq L_{l}\cdot R_{S}(\mathcal{F}). italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ italic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( caligraphic_F ) .
It is enough to bound the Rademacher complexity of ℱ ℱ \mathcal{F} caligraphic_F to
conclude the proof.
Let us consider the deep SSM models as a composite of map**s as
B ℓ 2 , 2 ( ℝ n in ) ( K 𝐮 ) → Encoder B ℓ 2 , 2 ( ℝ n u ) ( K 𝐮 K Enc ) → B 1 B ℓ ∞ , ∞ ( ℝ n u ) ( r 1 ) → B 2 … → B L Encoder → subscript 𝐵 superscript ℓ 2 2
superscript ℝ subscript 𝑛 in subscript 𝐾 𝐮 subscript 𝐵 superscript ℓ 2 2
superscript ℝ subscript 𝑛 𝑢 subscript 𝐾 𝐮 subscript 𝐾 Enc subscript B 1 → subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 subscript 𝑟 1 subscript B 2 → … subscript B 𝐿 → absent \displaystyle B_{\ell^{2,2}(\mathbb{R}^{n_{\text{in}}})}(K_{\mathbf{u}})%
\xrightarrow{\text{Encoder}}B_{\ell^{2,2}(\mathbb{R}^{n_{u}})}(K_{\mathbf{u}}K%
_{\text{Enc}})\xrightarrow{\text{B}_{1}}B_{\ell^{\infty,\infty}(\mathbb{R}^{n_%
{u}})}(r_{1})\xrightarrow{\text{B}_{2}}\ldots\xrightarrow{\text{B}_{L}} italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT ) start_ARROW overEncoder → end_ARROW italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT Enc end_POSTSUBSCRIPT ) start_ARROW start_OVERACCENT B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_ARROW start_OVERACCENT B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW … start_ARROW start_OVERACCENT B start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW
B ℓ ∞ , ∞ ( ℝ n u ) ( r L + 1 ) → Pooling ( ℝ n u , ∥ ⋅ ∥ ∞ ) → Decoder ( ℝ , | ⋅ | ) \displaystyle B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}(r_{L+1})%
\xrightarrow{\text{Pooling}}(\mathbb{R}^{n_{u}},\left\lVert\cdot\right\rVert_{%
\infty})\xrightarrow{\text{Decoder}}(\mathbb{R},|\cdot|) italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_L + 1 end_POSTSUBSCRIPT ) start_ARROW overPooling → end_ARROW ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , ∥ ⋅ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) start_ARROW overDecoder → end_ARROW ( blackboard_R , | ⋅ | )
Therefore, the SSM layer in the first SSM block is considered as a map
B ℓ 2 , 2 ( ℝ n u ) ( K Enc K 𝐮 ) → B ℓ ∞ , ∞ ( ℝ n u ) ( r 1 ) → subscript 𝐵 superscript ℓ 2 2
superscript ℝ subscript 𝑛 𝑢 subscript 𝐾 Enc subscript 𝐾 𝐮 subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 subscript 𝑟 1 B_{\ell^{2,2}(\mathbb{R}^{n_{u}})}(K_{\text{Enc}}K_{\mathbf{u}})\to B_{\ell^{%
\infty,\infty}(\mathbb{R}^{n_{u}})}(r_{1}) italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT Enc end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT ) → italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , while the rest of the SSM
layers in the SSM blocks are considered as a map
B ℓ ∞ , ∞ ( ℝ n u ) ( r i ) → B ℓ ∞ , ∞ ( ℝ n u ) ( r i + 1 ) → subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 subscript 𝑟 𝑖 subscript 𝐵 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 subscript 𝑟 𝑖 1 B_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}(r_{i})\to B_{\ell^{\infty,\infty}%
(\mathbb{R}^{n_{u}})}(r_{i+1}) italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) → italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) .This
is needed, because the Encoder is constant in time, therefore the
Composition Lemma wouldn’t be able to carry the ℓ 2 , 2 superscript ℓ 2 2
\ell^{2,2} roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT norm of the
input through the chain of estimation along the entire model. This is one
of the key technical points which makes it possible to establish a time
independent bound.
By the conditions of the Theorem and the stability assumption in Assumption
9 we have that the Encoder, Decoder, Pooling, SSM and MLP
layers are each ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC for some μ 𝜇 \mu italic_μ and c 𝑐 c italic_c from Lemma
13 . By Lemma 11 we have that the composition
of an SSM layer and an MLP is ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC. A residual SSM
block is then ( μ + α , c ) 𝜇 𝛼 𝑐 (\mu+\alpha,c) ( italic_μ + italic_α , italic_c ) -RC, because
𝔼 σ [ sup g ∘ 𝒮 Σ sup { 𝐳 i } i = 1 N ∥ 1 N ∑ i = 1 N σ i ( g ( 𝒮 Σ ( 𝐳 i ) ) + α 𝐳 i ) ∥ ℓ ∞ , ∞ ( ℝ n u ) ] ≤ subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑔 subscript 𝒮 Σ subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝑔 subscript 𝒮 Σ subscript 𝐳 𝑖 𝛼 subscript 𝐳 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 absent \displaystyle\mathbb{E}_{\sigma}\left[\sup\limits_{g\circ\mathcal{S}_{\Sigma}}%
\sup\limits_{\{\mathbf{z}_{i}\}_{i=1}^{N}}\left\lVert\frac{1}{N}\sum\limits_{i%
=1}^{N}\sigma_{i}(g(\mathcal{S}_{\Sigma}(\mathbf{z}_{i}))+\alpha\mathbf{z}_{i}%
)\right\rVert_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}\right]\leq blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_g ∘ caligraphic_S start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_g ( caligraphic_S start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + italic_α bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ] ≤
𝔼 σ [ sup g ∘ 𝒮 Σ sup { 𝐳 i } i = 1 N ∥ 1 N ∑ i = 1 N σ i g ( 𝒮 Σ ( 𝐳 i ) ) ∥ ℓ ∞ , ∞ ( ℝ n u ) ] + α 𝔼 σ [ sup { 𝐳 i } i = 1 N ∥ 1 N ∑ i = 1 N σ i 𝐳 i ∥ ℓ ∞ , ∞ ( ℝ n u ) ] subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑔 subscript 𝒮 Σ subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝑔 subscript 𝒮 Σ subscript 𝐳 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 𝛼 subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐳 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 \displaystyle\mathbb{E}_{\sigma}\left[\sup\limits_{g\circ\mathcal{S}_{\Sigma}}%
\sup\limits_{\{\mathbf{z}_{i}\}_{i=1}^{N}}\left\lVert\frac{1}{N}\sum\limits_{i%
=1}^{N}\sigma_{i}g(\mathcal{S}_{\Sigma}(\mathbf{z}_{i}))\right\rVert_{\ell^{%
\infty,\infty}(\mathbb{R}^{n_{u}})}\right]+\alpha\mathbb{E}_{\sigma}\left[\sup%
\limits_{\{\mathbf{z}_{i}\}_{i=1}^{N}}\left\lVert\frac{1}{N}\sum\limits_{i=1}^%
{N}\sigma_{i}\mathbf{z}_{i}\right\rVert_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u%
}})}\right] blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_g ∘ caligraphic_S start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_g ( caligraphic_S start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ] + italic_α blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ]
≤ ( μ + α ) 𝔼 σ [ sup { 𝐳 i } i = 1 N ∥ 1 N ∑ i = 1 N σ i 𝐳 i ∥ ℓ ∞ , ∞ ( ℝ n u ) ] + c N absent 𝜇 𝛼 subscript 𝔼 𝜎 delimited-[] subscript supremum superscript subscript subscript 𝐳 𝑖 𝑖 1 𝑁 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐳 𝑖 superscript ℓ
superscript ℝ subscript 𝑛 𝑢 𝑐 𝑁 \displaystyle\leq(\mu+\alpha)\mathbb{E}_{\sigma}\left[\sup\limits_{\{\mathbf{z%
}_{i}\}_{i=1}^{N}}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf%
{z}_{i}\right\rVert_{\ell^{\infty,\infty}(\mathbb{R}^{n_{u}})}\right]+\frac{c}%
{\sqrt{N}} ≤ ( italic_μ + italic_α ) blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT { bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ∞ , ∞ end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ] + divide start_ARG italic_c end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG
Hence, by Corollary 12 , the whole deep SSM model is
( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC as a map between X 1 = B ℓ 2 , 2 ( ℝ n in ) ( K 𝐮 ) subscript 𝑋 1 subscript 𝐵 superscript ℓ 2 2
superscript ℝ subscript 𝑛 in subscript 𝐾 𝐮 X_{1}=B_{\ell^{2,2}(\mathbb{R}^{n_{\text{in}}})}(K_{\mathbf{u}}) italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT ) and X 2 = ( ℝ , | ⋅ | ) X_{2}=(\mathbb{R},|\cdot|) italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( blackboard_R , | ⋅ | ) . The Theorem is then a direct corollary of the following Lemma.
Lemma 24
Let ℱ ℱ \mathcal{F} caligraphic_F be a set of functions that is ( μ , c ) 𝜇 𝑐 (\mu,c) ( italic_μ , italic_c ) -RC w.r.t.
X 1 = B ℓ 2 , 2 ( ℝ n in ) ( K 𝐮 ) subscript 𝑋 1 subscript 𝐵 superscript ℓ 2 2
superscript ℝ subscript 𝑛 in subscript 𝐾 𝐮 X_{1}=B_{\ell^{2,2}(\mathbb{R}^{n_{\text{in}}})}(K_{\mathbf{u}}) italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT ) and X 2 = ( ℝ , | ⋅ | ) X_{2}=(\mathbb{R},|\cdot|) italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( blackboard_R , | ⋅ | ) . The Rademacher complexity of ℱ ℱ \mathcal{F} caligraphic_F
w.r.t. some dataset S 𝑆 S italic_S for which Assumption 9 holds, admits
the following inequality.
R S ( ℱ ) ≤ μ K 𝐮 + c N . subscript 𝑅 𝑆 ℱ 𝜇 subscript 𝐾 𝐮 𝑐 𝑁 \displaystyle R_{S}(\mathcal{F})\leq\frac{\mu K_{\mathbf{u}}+c}{\sqrt{N}}. italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( caligraphic_F ) ≤ divide start_ARG italic_μ italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT + italic_c end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG .
Proof
R S ( ℱ ) = R ( { ( f ( 𝐮 1 ) , … , f ( 𝐮 N ) ) T ∣ f ∈ ℱ } ) = 𝔼 σ [ sup f ∈ ℱ 1 N ∑ i = 1 N σ i f ( 𝐮 i ) ] subscript 𝑅 𝑆 ℱ 𝑅 conditional-set superscript 𝑓 subscript 𝐮 1 … 𝑓 subscript 𝐮 𝑁 𝑇 𝑓 ℱ subscript 𝔼 𝜎 delimited-[] subscript supremum 𝑓 ℱ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 𝑓 subscript 𝐮 𝑖 \displaystyle R_{S}(\mathcal{F})=R(\left\{(f(\mathbf{u}_{1}),\dots,f(\mathbf{u%
}_{N}))^{T}\mid f\in\mathcal{F}\right\})=\mathbb{E}_{\sigma}\left[\sup\limits_%
{f\in\mathcal{F}}\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}f(\mathbf{u}_{i})\right] italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( caligraphic_F ) = italic_R ( { ( italic_f ( bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_f ( bold_u start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∣ italic_f ∈ caligraphic_F } ) = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ]
≤ μ 𝔼 σ [ ∥ 1 N ∑ i = 1 N σ i 𝐮 i ∥ ℓ 2 , 2 ( ℝ n in ) ] + c N absent 𝜇 subscript 𝔼 𝜎 delimited-[] subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 superscript ℓ 2 2
superscript ℝ subscript 𝑛 in 𝑐 𝑁 \displaystyle\leq\mu\mathbb{E}_{\sigma}\left[\left\lVert\frac{1}{N}\sum\limits%
_{i=1}^{N}\sigma_{i}\mathbf{u}_{i}\right\rVert_{\ell^{2,2}(\mathbb{R}^{n_{%
\text{in}}})}\right]+\frac{c}{\sqrt{N}} ≤ italic_μ blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ] + divide start_ARG italic_c end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG
By definition
∥ 1 N ∑ i = 1 N σ i 𝐮 i ∥ ℓ 2 , 2 ( ℝ n in ) = ∑ k = 1 T ∥ 1 N ∑ i = 1 N σ i 𝐮 i [ k ] ∥ 2 2 subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 superscript ℓ 2 2
superscript ℝ subscript 𝑛 in superscript subscript 𝑘 1 𝑇 subscript superscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 delimited-[] 𝑘 2 2 \displaystyle\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf{u}_{%
i}\right\rVert_{\ell^{2,2}(\mathbb{R}^{n_{\text{in}}})}=\sqrt{\sum\limits_{k=1%
}^{T}\left\lVert\frac{1}{N}\sum\limits_{i=1}^{N}\sigma_{i}\mathbf{u}_{i}[k]%
\right\rVert^{2}_{2}} ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG
= ∑ k = 1 T ⟨ 1 N ∑ i = 1 N σ i 𝐮 i [ k ] , 1 N ∑ j = 1 N σ j 𝐮 j [ k ] ⟩ ℝ n in absent superscript subscript 𝑘 1 𝑇 subscript 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 delimited-[] 𝑘 1 𝑁 superscript subscript 𝑗 1 𝑁 subscript 𝜎 𝑗 subscript 𝐮 𝑗 delimited-[] 𝑘
superscript ℝ subscript 𝑛 in \displaystyle=\sqrt{\sum\limits_{k=1}^{T}\left\langle\frac{1}{N}\sum\limits_{i%
=1}^{N}\sigma_{i}\mathbf{u}_{i}[k],\frac{1}{N}\sum\limits_{j=1}^{N}\sigma_{j}%
\mathbf{u}_{j}[k]\right\rangle_{\mathbb{R}^{n_{\text{in}}}}} = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⟨ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] , divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_k ] ⟩ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG
= ∑ k = 1 T 1 N 2 ∑ i = 1 N ∑ j = 1 N σ i σ j ⟨ 𝐮 i [ k ] , 𝐮 j [ k ] ⟩ ℝ n in absent superscript subscript 𝑘 1 𝑇 1 superscript 𝑁 2 superscript subscript 𝑖 1 𝑁 superscript subscript 𝑗 1 𝑁 subscript 𝜎 𝑖 subscript 𝜎 𝑗 subscript subscript 𝐮 𝑖 delimited-[] 𝑘 subscript 𝐮 𝑗 delimited-[] 𝑘
superscript ℝ subscript 𝑛 in \displaystyle=\sqrt{\sum\limits_{k=1}^{T}\frac{1}{N^{2}}\sum\limits_{i=1}^{N}%
\sum\limits_{j=1}^{N}\sigma_{i}\sigma_{j}\left\langle\mathbf{u}_{i}[k],\mathbf%
{u}_{j}[k]\right\rangle_{\mathbb{R}^{n_{\text{in}}}}} = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟨ bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] , bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_k ] ⟩ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG
Therefore
𝔼 σ [ ∥ 1 N ∑ i = 1 N σ i 𝐮 i ∥ ℓ 2 , 2 ( ℝ n in ) ] = 𝔼 σ [ ∑ k = 1 T 1 N 2 ∑ i = 1 N ∑ j = 1 N σ i σ j ⟨ 𝐮 i [ k ] , 𝐮 j [ k ] ⟩ ℝ n in ] subscript 𝔼 𝜎 delimited-[] subscript delimited-∥∥ 1 𝑁 superscript subscript 𝑖 1 𝑁 subscript 𝜎 𝑖 subscript 𝐮 𝑖 superscript ℓ 2 2
superscript ℝ subscript 𝑛 in subscript 𝔼 𝜎 delimited-[] superscript subscript 𝑘 1 𝑇 1 superscript 𝑁 2 superscript subscript 𝑖 1 𝑁 superscript subscript 𝑗 1 𝑁 subscript 𝜎 𝑖 subscript 𝜎 𝑗 subscript subscript 𝐮 𝑖 delimited-[] 𝑘 subscript 𝐮 𝑗 delimited-[] 𝑘
superscript ℝ subscript 𝑛 in \displaystyle\mathbb{E}_{\sigma}\left[\left\lVert\frac{1}{N}\sum\limits_{i=1}^%
{N}\sigma_{i}\mathbf{u}_{i}\right\rVert_{\ell^{2,2}(\mathbb{R}^{n_{\text{in}}}%
)}\right]=\mathbb{E}_{\sigma}\left[\sqrt{\sum\limits_{k=1}^{T}\frac{1}{N^{2}}%
\sum\limits_{i=1}^{N}\sum\limits_{j=1}^{N}\sigma_{i}\sigma_{j}\left\langle%
\mathbf{u}_{i}[k],\mathbf{u}_{j}[k]\right\rangle_{\mathbb{R}^{n_{\text{in}}}}}\right] blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ] = blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟨ bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] , bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_k ] ⟩ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ]
≤ 𝔼 σ [ ∑ k = 1 T 1 N 2 ∑ i = 1 N ∑ j = 1 N σ i σ j ⟨ 𝐮 i [ k ] , 𝐮 j [ k ] ⟩ ℝ n in ] absent subscript 𝔼 𝜎 delimited-[] superscript subscript 𝑘 1 𝑇 1 superscript 𝑁 2 superscript subscript 𝑖 1 𝑁 superscript subscript 𝑗 1 𝑁 subscript 𝜎 𝑖 subscript 𝜎 𝑗 subscript subscript 𝐮 𝑖 delimited-[] 𝑘 subscript 𝐮 𝑗 delimited-[] 𝑘
superscript ℝ subscript 𝑛 in \displaystyle\leq\sqrt{\mathbb{E}_{\sigma}\left[\sum\limits_{k=1}^{T}\frac{1}{%
N^{2}}\sum\limits_{i=1}^{N}\sum\limits_{j=1}^{N}\sigma_{i}\sigma_{j}\left%
\langle\mathbf{u}_{i}[k],\mathbf{u}_{j}[k]\right\rangle_{\mathbb{R}^{n_{\text{%
in}}}}\right]} ≤ square-root start_ARG blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟨ bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] , bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_k ] ⟩ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] end_ARG
= ∑ k = 1 T 1 N 2 ∑ i = 1 N ∑ j = 1 N 𝔼 σ [ σ i σ j ] ⟨ 𝐮 i [ k ] , 𝐮 j [ k ] ⟩ ℝ n in absent superscript subscript 𝑘 1 𝑇 1 superscript 𝑁 2 superscript subscript 𝑖 1 𝑁 superscript subscript 𝑗 1 𝑁 subscript 𝔼 𝜎 delimited-[] subscript 𝜎 𝑖 subscript 𝜎 𝑗 subscript subscript 𝐮 𝑖 delimited-[] 𝑘 subscript 𝐮 𝑗 delimited-[] 𝑘
superscript ℝ subscript 𝑛 in \displaystyle=\sqrt{\sum\limits_{k=1}^{T}\frac{1}{N^{2}}\sum\limits_{i=1}^{N}%
\sum\limits_{j=1}^{N}\mathbb{E}_{\sigma}\left[\sigma_{i}\sigma_{j}\right]\left%
\langle\mathbf{u}_{i}[k],\mathbf{u}_{j}[k]\right\rangle_{\mathbb{R}^{n_{\text{%
in}}}}} = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] ⟨ bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] , bold_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_k ] ⟩ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG
= ∑ k = 1 T 1 N 2 ∑ i = 1 N 𝔼 σ [ σ i 2 ] ⟨ 𝐮 i [ k ] , 𝐮 i [ k ] ⟩ ℝ n in absent superscript subscript 𝑘 1 𝑇 1 superscript 𝑁 2 superscript subscript 𝑖 1 𝑁 subscript 𝔼 𝜎 delimited-[] superscript subscript 𝜎 𝑖 2 subscript subscript 𝐮 𝑖 delimited-[] 𝑘 subscript 𝐮 𝑖 delimited-[] 𝑘
superscript ℝ subscript 𝑛 in \displaystyle=\sqrt{\sum\limits_{k=1}^{T}\frac{1}{N^{2}}\sum\limits_{i=1}^{N}%
\mathbb{E}_{\sigma}\left[\sigma_{i}^{2}\right]\left\langle\mathbf{u}_{i}[k],%
\mathbf{u}_{i}[k]\right\rangle_{\mathbb{R}^{n_{\text{in}}}}} = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ⟨ bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] , bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ⟩ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG
= 1 N 2 ∑ i = 1 N ∑ k = 1 T ∥ 𝐮 i [ k ] ∥ 2 2 = 1 N 2 ∑ i = 1 N ∥ 𝐮 i ∥ ℓ 2 , 2 ( ℝ n in ) 2 ≤ 1 N 2 N K 𝐮 2 ≤ K 𝐮 N absent 1 superscript 𝑁 2 superscript subscript 𝑖 1 𝑁 superscript subscript 𝑘 1 𝑇 superscript subscript delimited-∥∥ subscript 𝐮 𝑖 delimited-[] 𝑘 2 2 1 superscript 𝑁 2 superscript subscript 𝑖 1 𝑁 subscript superscript delimited-∥∥ subscript 𝐮 𝑖 2 superscript ℓ 2 2
superscript ℝ subscript 𝑛 in 1 superscript 𝑁 2 𝑁 superscript subscript 𝐾 𝐮 2 subscript 𝐾 𝐮 𝑁 \displaystyle=\sqrt{\frac{1}{N^{2}}\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{T}%
\left\lVert\mathbf{u}_{i}[k]\right\rVert_{2}^{2}}=\sqrt{\frac{1}{N^{2}}\sum%
\limits_{i=1}^{N}\left\lVert\mathbf{u}_{i}\right\rVert^{2}_{\ell^{2,2}(\mathbb%
{R}^{n_{\text{in}}})}}\leq\sqrt{\frac{1}{N^{2}}NK_{\mathbf{u}}^{2}}\leq\frac{K%
_{\mathbf{u}}}{\sqrt{N}} = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 , 2 end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT end_ARG ≤ square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_N italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG
Hence we have
R S ( ℱ ) ≤ μ K 𝐮 + c N subscript 𝑅 𝑆 ℱ 𝜇 subscript 𝐾 𝐮 𝑐 𝑁 \displaystyle R_{S}(\mathcal{F})\leq\frac{\mu K_{\mathbf{u}}+c}{\sqrt{N}} italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( caligraphic_F ) ≤ divide start_ARG italic_μ italic_K start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT + italic_c end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG
The constants μ 𝜇 \mu italic_μ and c 𝑐 c italic_c are obtained by substituting the results of Lemma
13 into the Corollary 12 .