Pivotal Auto-Encoder via Self-Normalizing ReLU

Nelson Goldenstein, Jeremias Sulam, Yaniv Romano N. Goldenstein is with the Department of Electrical and Computer Engineering at the Technion, Israel.J. Sulam is with the Biomedical Engineering Department and the Mathematical Institute for Data Science of Johns Hopkins University, USA.Y. Romano is with the Department of Electrical and Computer Engineering and the Department of Computer Science at the Technion, Israel.
Abstract

Sparse auto-encoders are useful for extracting low-dimensional representations from high-dimensional data. However, their performance degrades sharply when the input noise at test time differs from the noise employed during training. This limitation hinders the applicability of auto-encoders in real-world scenarios where the level of noise in the input is unpredictable. In this paper, we formalize single hidden layer sparse auto-encoders as a transform learning problem. Leveraging the transform modeling interpretation, we propose an optimization problem that leads to a predictive model invariant to the noise level at test time. In other words, the same pre-trained model is able to generalize to different noise levels. The proposed optimization algorithm, derived from the square root lasso, is translated into a new, computationally efficient auto-encoding architecture. After proving that our new method is invariant to the noise level, we evaluate our approach by training networks using the proposed architecture for denoising tasks. Our experimental results demonstrate that the trained models yield a significant improvement in stability against varying types of noise compared to commonly used architectures.

Index Terms:
Sparse coding, transform learning, sparse auto-encoders, square root lasso.

I Introduction

A sparse auto-encoder is a type of artificial neural network that learns efficient data encodings through unsupervised learning [1]. The purpose of an auto-encoder is to capture the most important elements of the input to learn a lower dimensional representation for higher dimensional data, such as images [2]. It is commonly used for dimensionality reduction or feature extraction. The sparse auto-encoder architecture consists of two modules: an encoder and a decoder. The encoder compresses the input data into an encoded representation in a different domain, which is forced to be sparse. It then processes and filters the encoded data representations so that only the most important information is allowed through, preventing the model from memorizing the inputs and overfitting. The final module of the network, the decoder, decompresses the extracted sparse representations and reconstructs the data from its encoded state back to its original domain.

Interestingly, the observation that natural data can often be accurately approximated by sparse signals has been a prominent framework over the last twenty years [3, 4, 5]. Specifically, the transform model [6]—a generalized analysis sparse representation model—assumes that a signal xn𝑥superscript𝑛x\in\mathbb{R}^{n}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT has a sparse representation zdsuperscript𝑧superscript𝑑z^{*}\in\mathbb{R}^{d}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT over a particular transformation Wd×n𝑊superscript𝑑𝑛W\in\mathbb{R}^{d\times n}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_n end_POSTSUPERSCRIPT to another domain, i.e.,

Wx=z, where z0d,formulae-sequence𝑊𝑥superscript𝑧much-less-than where subscriptnormsuperscript𝑧0𝑑Wx=z^{*},\text{ where }\|z^{*}\|_{0}\ll d,italic_W italic_x = italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , where ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≪ italic_d , (1)

where 0\|\cdot\|_{0}∥ ⋅ ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT counts the number of nonzero elements of a vector. This representation is typically a higher dimensional signal, i.e., dn𝑑𝑛d\geqslant nitalic_d ⩾ italic_n, and this is the setting that we assume in this work. When n=d𝑛𝑑n=ditalic_n = italic_d and W𝑊Witalic_W is of full rank, the transformation forms a basis, whereas when d>n𝑑𝑛d>nitalic_d > italic_n, the transform is considered to be overcomplete. In situations where we observe a noisy version y𝑦yitalic_y of the clean signal x𝑥xitalic_x, corrupted by additive noise, the equation becomes

Wy=z+e,𝑊𝑦superscript𝑧𝑒Wy=z^{*}+e,italic_W italic_y = italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_e ,

where e𝑒eitalic_e denotes the error or residual in the transform domain.

In this context, the task of finding the sparse representation of a signal given W𝑊Witalic_W is called sparse coding. The first module of a sparse auto-encoder—the encoder—can be formally written as a transform sparse coding problem:

z^=arg minz12zWy22+λz1,^𝑧subscriptarg min𝑧12subscriptsuperscriptnorm𝑧𝑊𝑦22𝜆subscriptnorm𝑧1\widehat{z}=\operatorname*{\text{arg\,min}}_{z}\frac{1}{2}\|z-Wy\|^{2}_{2}+% \lambda\|z\|_{1},over^ start_ARG italic_z end_ARG = argmin start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_z - italic_W italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∥ italic_z ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (2)

where z^^𝑧\widehat{z}over^ start_ARG italic_z end_ARG is the estimated latent space representation, W𝑊Witalic_W is a known transformation, and λ+𝜆subscript\lambda\in\mathbb{R}_{+}italic_λ ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is a hyperparameter. The optimization problem in (2) minimizes the transform residual e𝑒eitalic_e with a sparse prior for z𝑧zitalic_z to estimate zsuperscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. The parameter λ𝜆\lambdaitalic_λ governs the trade-off between sparsity and residual error. Observe that the optimal selection of λ𝜆\lambdaitalic_λ is dependent on e𝑒eitalic_e since zWy=esuperscript𝑧𝑊𝑦𝑒z^{*}-Wy=eitalic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_W italic_y = italic_e; hence, the value of λ𝜆\lambdaitalic_λ ought to be calibrated to increase with the magnitude of e𝑒eitalic_e.

The closed form solution to the encoding problem (2) is

z^=proxλz1(Wy)=Sλ(Wy),^𝑧subscriptprox𝜆subscriptnorm𝑧1𝑊𝑦subscript𝑆𝜆𝑊𝑦\widehat{z}=\text{prox}_{\lambda\|z\|_{1}}\big{(}Wy\big{)}=S_{\lambda}\big{(}% Wy\big{)},over^ start_ARG italic_z end_ARG = prox start_POSTSUBSCRIPT italic_λ ∥ italic_z ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_W italic_y ) = italic_S start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_W italic_y ) ,

where proxf(v)=arg minx(f(x)+12xv2)subscriptprox𝑓𝑣subscriptarg min𝑥𝑓𝑥12superscriptdelimited-∥∥𝑥𝑣2\text{prox}_{f}(v)=\operatorname*{\text{arg\,min}}_{x}(f(x)+\tfrac{1}{2}\lVert x% -v\rVert^{2})prox start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_v ) = argmin start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_f ( italic_x ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_x - italic_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is the proximal operator and Sλsubscript𝑆𝜆S_{\lambda}italic_S start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT is the soft-thresholding operator Sλ(x)=sign(x)max(|x|λ,0)subscript𝑆𝜆𝑥sign𝑥max𝑥𝜆0S_{\lambda}(x)=\text{sign}(x)\text{max}(|x|-\lambda,0)italic_S start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_x ) = sign ( italic_x ) max ( | italic_x | - italic_λ , 0 ). If we add the assumption of non-negativity of zsuperscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the solution can be rewritten as

z^=ReLU(Wyλ),^𝑧ReLU𝑊𝑦𝜆\widehat{z}={\text{ReLU}}(Wy-\lambda),over^ start_ARG italic_z end_ARG = ReLU ( italic_W italic_y - italic_λ ) ,

where ReLU(x)=max(x,0)ReLU𝑥max𝑥0{\text{ReLU}}(x)=\text{max}(x,0)ReLU ( italic_x ) = max ( italic_x , 0 ). From this expression we understand the influence of λ𝜆\lambdaitalic_λ on the solution and its role in filtering perturbations. Higher values of λ𝜆\lambdaitalic_λ lead to lower and sparser solutions. Therefore, as mentioned earlier, the optimal value of λ𝜆\lambdaitalic_λ is a function of the noise; the stronger the noise, the larger λ𝜆\lambdaitalic_λ must be.

The decoder can also be written as an inverse operation following the same model. Formally, given W𝑊Witalic_W and z^^𝑧\widehat{z}over^ start_ARG italic_z end_ARG, we can obtain a least squares approximation to the true signal x𝑥xitalic_x by minimizing Wxz^22subscriptsuperscriptnorm𝑊𝑥^𝑧22\|Wx-\widehat{z}\|^{2}_{2}∥ italic_W italic_x - over^ start_ARG italic_z end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with respect to x𝑥xitalic_x. Thus, the recovered signal is

x^=W+z^,^𝑥superscript𝑊^𝑧\widehat{x}=W^{+}\widehat{z},over^ start_ARG italic_x end_ARG = italic_W start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT over^ start_ARG italic_z end_ARG ,

where W+superscript𝑊W^{+}italic_W start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is the pseudoinverse of W𝑊Witalic_W. In particular, if W𝑊Witalic_W has full column rank, W+=(W𝖳W)1W𝖳superscript𝑊superscriptsuperscript𝑊𝖳𝑊1superscript𝑊𝖳W^{+}=(W^{\mathsf{T}}W)^{-1}W^{\mathsf{T}}italic_W start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = ( italic_W start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_W ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT.

The connection between the transform model (1) and sparse auto-encoders is clear. The linear transformation W𝑊Witalic_W represents the weights of any combination of linear layers, including convolutional operations, and λ𝜆\lambdaitalic_λ is the bias parameter; and both are trainable parameters of the network. This connection has been successfully applied to numerous computer vision and image processing tasks [7]. When adapted to deep learning, it has demonstrated state-of-the-art performance in various machine learning applications, such as image classification [8], online video denoising [9], semantic segmentation of images [10], super-resolution [11], clustering [12], and others.

The main problem of the presented encoder algorithm (2) is the bias parameter λ𝜆\lambdaitalic_λ. The optimal selection of λ𝜆\lambdaitalic_λ is influenced by the noise, which means its ideal value varies with differing noise levels. Indeed, the optimality conditions for the correct estimation of the sparse representation zsuperscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are not guaranteed at different noise intensities, even for a known W𝑊Witalic_W. In other words, the model’s performance may deteriorate significantly if the noise level at test time is different from the one used during training [13]. Therefore, the network must be re-trained, and a different bias must be learned for each noise level [14]. Moreover, in an environment where the noise is unknown, as in most practical cases, finding the best bias becomes infeasible: it is impossible to choose the correct bias without estimating the noise level.

Our contribution. To overcome the dependence of the bias λ𝜆\lambdaitalic_λ on the noise level, we draw inspiration from the square root lasso problem, introduced by [15] and detailed in Section II-A, and propose a modification of the transform sparse coding algorithm for the encoder module

z^=arg minzzWy2+λz1.^𝑧subscriptarg min𝑧subscriptnorm𝑧𝑊𝑦2𝜆subscriptnorm𝑧1\widehat{z}=\operatorname*{\text{arg\,min}}_{z}\|z-Wy\|_{2}+\lambda\|z\|_{1}.over^ start_ARG italic_z end_ARG = argmin start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∥ italic_z - italic_W italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∥ italic_z ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (3)

Notice that the residual term is no longer quadratic. The main advantage of (3) is that the hyperparameter λ𝜆\lambdaitalic_λ is now pivotal to the noise energy. In other words, the optimal choice of λ𝜆\lambdaitalic_λ is independent of the noise level, which is difficult to estimate reliably because it is an ill-posed problem [16]. We prove in Section III that this property holds in the presence of both bounded noise and additive Gaussian noise. This stands in sharp contrast to vanilla sparse auto-encoders, where one needs to know the true standard deviation of the noise to fit the bias of the original transform sparse coding problem (2).

Furthermore, we propose an efficient and differentiable algorithm to solve the new pivotal sparse coding problem based on proximal gradient descent, as described in Section IV. Our algorithm is differentiable in the sense that it is compatible with gradient-based optimization techniques, enabling the minimization of a cost function through methods such as automatic differentiation and backpropagation. This leads to the development of a novel non-linear function called Self Normalizing ReLU (NeLU), which easily integrates into common neural network architectures. In Section V, we conduct numerical experiments using both synthetic and real data to illustrate how our approach is significantly more resilient to various noise levels.

II Related work

II-A Synthesis sparse modeling of signals

We begin by introducing a framework parallel to the analysis sparse representation model, presented in Section I, named the synthesis model. This model serves as the premise for introducing the square root lasso algorithm, which we will extend to the transform model. The synthesis model assumes that a signal xn𝑥superscript𝑛x\in\mathbb{R}^{n}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT can be represented as a linear combination of a few columns, called atoms, from a matrix Dn×d𝐷superscript𝑛𝑑D\in\mathbb{R}^{n\times d}italic_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT named dictionary. In plain language, the signal corresponds to a multiplication of a dictionary by a sparse vector zdsuperscript𝑧superscript𝑑z^{*}\in\mathbb{R}^{d}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, i.e.,

x=Dz.𝑥𝐷superscript𝑧x=Dz^{*}.italic_x = italic_D italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .

Various algorithms have been proposed to implement the sparse coding task of estimating zsuperscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT given x𝑥xitalic_x [17]. These include the matching pursuit [18] and the basis pursuit [19], also called lasso in the statistics literature [20]

minz12nyDz22+λLz1.subscript𝑧12𝑛subscriptsuperscriptnorm𝑦𝐷𝑧22subscript𝜆𝐿subscriptnorm𝑧1\min_{z}\frac{1}{2n}\|y-Dz\|^{2}_{2}+\lambda_{L}\|z\|_{1}.roman_min start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_n end_ARG ∥ italic_y - italic_D italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∥ italic_z ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (4)

As with (2), the primary drawback of lasso is that the optimal value of the parameter λLsubscript𝜆𝐿\lambda_{L}italic_λ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is dependent on the noise level and, therefore, must be adjusted for each specific noise level. For example, in a Gaussian noise environment, λLsubscript𝜆𝐿\lambda_{L}italic_λ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT proportional to σlogd/n𝜎𝑑𝑛\sigma\sqrt{\log d/n}italic_σ square-root start_ARG roman_log italic_d / italic_n end_ARG is minimax optimal for signal reconstruction, x^x2subscriptnorm^𝑥𝑥2\|\widehat{x}-x\|_{2}∥ over^ start_ARG italic_x end_ARG - italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, in high dimensions [21]. Therefore, to achieve a good estimation of zsuperscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT or prediction of x𝑥xitalic_x for an unknown noise level, one must estimate σ𝜎\sigmaitalic_σ.

Square root lasso. A modified version of the lasso has been proposed to solve the dependence of λ𝜆\lambdaitalic_λ on noise power, called square root lasso [15]

minz1nyDz2+λLz1,subscript𝑧1𝑛subscriptnorm𝑦𝐷𝑧2subscript𝜆𝐿subscriptnorm𝑧1\min_{z}\frac{1}{\sqrt{n}}\|y-Dz\|_{2}+\lambda_{\sqrt{L}}\|z\|_{1},roman_min start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ∥ italic_y - italic_D italic_z ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT square-root start_ARG italic_L end_ARG end_POSTSUBSCRIPT ∥ italic_z ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (5)

which takes the square root of the error term of (4). Belloni et al. [15] have proven that the square root lasso achieves minimax optimality of estimation and signal reconstruction error for a hyperparameter λLsubscript𝜆𝐿\lambda_{\sqrt{L}}italic_λ start_POSTSUBSCRIPT square-root start_ARG italic_L end_ARG end_POSTSUBSCRIPT, which is pivotal to the noise level. For instance, in the case of Gaussian noise, the minimax optimal λLsubscript𝜆𝐿\lambda_{\sqrt{L}}italic_λ start_POSTSUBSCRIPT square-root start_ARG italic_L end_ARG end_POSTSUBSCRIPT is proportional to logd/n𝑑𝑛\sqrt{\log d/n}square-root start_ARG roman_log italic_d / italic_n end_ARG, which is independent of σ𝜎\sigmaitalic_σ and leads to a constant parameter for all Gaussian distributions. Moreover, numerous algorithms have been developed to efficiently solve the problem based on its convex property [21].

Although the square root lasso is powerful and attractive, it has never been applied in the context of dictionary learning or adapted to form a novel neural network architecture. In this work, we draw an exciting connection between the square root lasso and transform modeling. The extension to synthesis model and dictionary learning is a direct consequence of the analysis of the transform learning since (3) can be seen as a particular case of (5), where D=I𝐷𝐼D=Iitalic_D = italic_I and the input signal is Wy𝑊𝑦Wyitalic_W italic_y. Further details are provided in Appendix B.

II-B Transform learning

As elaborated in Section I, the transform model and sparse auto-encoders are tightly connected. The forward pass in sparse encoders essentially acts as a sparse representation pursuit within the transform model. In this framework, the transformation is characterized by the weights within a set of layers, which may include convolutional ones. Thus, the transformation is also learned during the model training process. This method of deriving the transformation directly from the data is known as transform learning.

At its core, transform learning [22] employs a data-driven feature extractor to transform input data into a suitable representation. This approach can improve upon the limited ability of analytical transformation methods, such as wavelets, to handle data. As a result, transform learning often yields superior restoration performance compared to the analytical approaches [6].

Transform learning is comparable to dictionary learning from an analysis perspective [23]. In dictionary learning, a basis D𝐷Ditalic_D is trained to recover the data, x=Dz𝑥𝐷superscript𝑧x=Dz^{*}italic_x = italic_D italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, from the representation zsuperscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [24]. In contrast, transform learning aims to learn a transformation W𝑊Witalic_W to generate the representation zsuperscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Since the transformation W𝑊Witalic_W is data-dependent and the output representation zsuperscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is unknown, jointly learning both is a challenging task.

In this work, we propose a new learning problem that combines the transform model with deep learning through the interpretable framework of transform learning. Our objective is to demonstrate that our method can exhibit the interesting properties of the classical problem established in Section III. We extend our algorithm to the transform learning framework and demonstrate its effectiveness in enhancing the robustness of deep learning through experiments in Section V.

II-C Blind denoising networks

Blind denoising is the task of removing noise from an input signal when the noise magnitude is unknown at test time. This task is closely related to our objective. Several methods have been proposed to tackle this task, with the most common approach for blind denoising being to estimate the noise distribution (or simply the noise level) to identify and remove it from the signal [25, 26, 27]. However, this method lacks flexibility because it requires learning different weights for each noise level and solving the difficult problem of noise estimation.

In contrast to this approach, where all weights are learned from scratch for each noise level, some existing methods recognize that not all weights depend on the noise and only adjust the regularization parameters, such as the bias in the case of auto-encoders, for different noise levels [14]. This method reduces the overall number of network parameters to be learned, but they still need to be adjusted for each noise level.

Another standard technique, which emerged alongside the advancement of deep learning in computer vision tasks, is to train one model across a wide range of expected noise levels [28]. In this case, the denoising performance of such a model is generally inferior compared to a model trained for a specific noise distribution [29]. Moreover, it has been shown that a model trained using this method tends to focus on the average noise level of the training range, rather than learning generalizable weights for all noise levels. To address this issue, Gnansambandam et al. [29] proposed determining the optimal noise training sample distribution from a minimax risk optimization perspective. The approach proposed in [29] is orthogonal to ours, as it does not suggest modifying the network architecture but instead focuses solely on the training strategy.

In this work, we propose a new architecture for implementing (3) that is inherently noise level independent. Our theoretical study, presented in Section III, shows that the same model parameters achieve high-quality signal recovery across all noise levels when learned correctly. Indeed, our experimental results indicate that a single neural network can be trained and practically applied to handle all noise levels, without re-training or updating the bias term.

III Theory

We begin by formally restating the transform model from Section I. Specifically, we consider a sparse linear model in high dimensions for a noisy signal

y=x+ξ,𝑦𝑥𝜉y=x+\xi,italic_y = italic_x + italic_ξ ,

where xn𝑥superscript𝑛x\in\mathbb{R}^{n}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the clean sparsifiable signal and ξn𝜉superscript𝑛\xi\in\mathbb{R}^{n}italic_ξ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT denotes the random additive noise. Thus, applying a given transformation W𝑊Witalic_W to y𝑦yitalic_y yields

Wy=W(x+ξ)=Wx+Wξ=z+e.𝑊𝑦𝑊𝑥𝜉𝑊𝑥𝑊𝜉superscript𝑧𝑒Wy=W(x+\xi)=Wx+W\xi=z^{*}+e.italic_W italic_y = italic_W ( italic_x + italic_ξ ) = italic_W italic_x + italic_W italic_ξ = italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_e .

Here, the goal is to recover the clean sparse representation z=Wxsuperscript𝑧𝑊𝑥z^{*}=Wxitalic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_W italic_x, while e=Wξ𝑒𝑊𝜉e=W\xiitalic_e = italic_W italic_ξ is the error.

In the following subsections, we demonstrate that the desired properties of the square root lasso also hold for the sparse encoding problem (3), given a known transform W𝑊Witalic_W. Specifically, the parameter λ𝜆\lambdaitalic_λ is pivotal to the noise level, and as a result, not only can the solution of (3) be computed efficiently, but all parameters of the problem are also independent of the noise level. We prove that the recovery of the correct support, i.e., the group of nonzero elements {i[d]:|zi|>0}conditional-set𝑖delimited-[]𝑑subscriptsuperscript𝑧𝑖0\{i\in[d]:|z^{*}_{i}|>0\}{ italic_i ∈ [ italic_d ] : | italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | > 0 }, and the bound on the estimation error, z^z2subscriptnorm^𝑧superscript𝑧2\|\widehat{z}-z^{*}\|_{2}∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, can be extended to our proposed new optimization problem under the presence of bounded noise.

III-A Support recovery

First, we prove the recovery of the correct support in the presence of bounded noise, a prevalent scenario in the robustness literature [30, 31]. Subsequently, we extend these results to Gaussian noise within a probabilistic setting. To prove these results, the following assumptions are necessary.

Assumption 1.

The noise ξ2subscriptdelimited-∥∥𝜉2\lVert\xi\rVert_{2}∥ italic_ξ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is bounded.

Consequently, e2subscriptdelimited-∥∥𝑒2\lVert e\rVert_{2}∥ italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is also bounded since

e2smaxξ2ϵ,subscriptdelimited-∥∥𝑒2subscript𝑠subscriptdelimited-∥∥𝜉2italic-ϵ\lVert e\rVert_{2}\leqslant s_{\max}\lVert\xi\rVert_{2}\triangleq\epsilon,∥ italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⩽ italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∥ italic_ξ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≜ italic_ϵ ,

where smaxsubscript𝑠s_{\max}italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is the largest singular value of W𝑊Witalic_W.

Assumption 2.

There exists a minimal positive number η𝜂\etaitalic_η that satisfies the condition

λz1ηϵ.𝜆subscriptnormsuperscript𝑧1𝜂italic-ϵ\lambda\|z^{*}\|_{1}\leqslant\eta\epsilon.italic_λ ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⩽ italic_η italic_ϵ .

The constant η𝜂\etaitalic_η represents the ratio between the regularization term λz1𝜆subscriptnormsuperscript𝑧1\lambda\|z^{*}\|_{1}italic_λ ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and the noise threshold ϵitalic-ϵ\epsilonitalic_ϵ, which is the value of the reconstruction error term when z^=z^𝑧superscript𝑧\widehat{z}=z^{*}over^ start_ARG italic_z end_ARG = italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. A smaller value of η𝜂\etaitalic_η implies that the regularization term is proportionally smaller, leading to reduced shrinkage and potentially more accurate signal recovery.

Theorem 1.

Let Assumption 1 be satisfied, let η𝜂\etaitalic_η satisfy Assumption 2. Then, for

λ=ee2,𝜆subscriptdelimited-∥∥𝑒subscriptdelimited-∥∥𝑒2\lambda=\frac{\lVert e\rVert_{\infty}}{\lVert e\rVert_{2}},italic_λ = divide start_ARG ∥ italic_e ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ,

and z^^𝑧\widehat{z}over^ start_ARG italic_z end_ARG be the solution to (3), we get that

z^zλ(2+η)ϵ.subscriptnorm^𝑧superscript𝑧𝜆2𝜂italic-ϵ\|{\widehat{z}-z^{*}}\|_{\infty}\leqslant\lambda(2+\eta)\epsilon.∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ⩽ italic_λ ( 2 + italic_η ) italic_ϵ .

Moreover, if

minj[d]|zj|>2λ(2+η)ϵ,subscript𝑗delimited-[]𝑑superscriptsubscript𝑧𝑗2𝜆2𝜂italic-ϵ\min_{j\in[d]}|z_{j}^{*}|>2\lambda(2+\eta)\epsilon,roman_min start_POSTSUBSCRIPT italic_j ∈ [ italic_d ] end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | > 2 italic_λ ( 2 + italic_η ) italic_ϵ ,

then the estimated support

𝒮^={j[d]:|z^j|>λ(2+η)ϵ}^𝒮conditional-set𝑗delimited-[]𝑑subscript^𝑧𝑗𝜆2𝜂italic-ϵ\widehat{\mathcal{S}}=\{j\in[d]:|\widehat{z}_{j}|>\lambda(2+\eta)\epsilon\}over^ start_ARG caligraphic_S end_ARG = { italic_j ∈ [ italic_d ] : | over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | > italic_λ ( 2 + italic_η ) italic_ϵ }

recovers the true sparsity pattern 𝒮={j[d]:|zj|>0}𝒮conditional-set𝑗delimited-[]𝑑subscript𝑧𝑗0\mathcal{S}=\{j\in[d]:|z_{j}|>0\}caligraphic_S = { italic_j ∈ [ italic_d ] : | italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | > 0 } correctly, i.e.,

𝒮^=𝒮.^𝒮𝒮\widehat{\mathcal{S}}=\mathcal{S}.over^ start_ARG caligraphic_S end_ARG = caligraphic_S .
Proof of Theorem 1.

Our derivation is inspired by the one in [32]. For the solution z^^𝑧\widehat{z}over^ start_ARG italic_z end_ARG of (3), we have:

z^zsubscriptdelimited-∥∥^𝑧superscript𝑧\displaystyle\lVert\widehat{z}-z^{*}\rVert_{\infty}∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT =e^e(e^=Wyz^)absentsubscriptdelimited-∥∥^𝑒𝑒^𝑒𝑊𝑦^𝑧\displaystyle=\lVert\widehat{e}-e\rVert_{\infty}\qquad\left(\widehat{e}=Wy-% \widehat{z}\right)= ∥ over^ start_ARG italic_e end_ARG - italic_e ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( over^ start_ARG italic_e end_ARG = italic_W italic_y - over^ start_ARG italic_z end_ARG )
e^+eabsentsubscriptdelimited-∥∥^𝑒subscriptdelimited-∥∥𝑒\displaystyle\leqslant\lVert\widehat{e}\rVert_{\infty}+\lVert e\rVert_{\infty}⩽ ∥ over^ start_ARG italic_e end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∥ italic_e ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
e^+λϵ.absentsubscriptdelimited-∥∥^𝑒𝜆italic-ϵ\displaystyle\leqslant\lVert\widehat{e}\rVert_{\infty}+\lambda\epsilon.⩽ ∥ over^ start_ARG italic_e end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_λ italic_ϵ . (6)

We bound e^subscriptdelimited-∥∥^𝑒\lVert\widehat{e}\rVert_{\infty}∥ over^ start_ARG italic_e end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT using KKT sub-gradient optimality conditions,

e^subscriptdelimited-∥∥^𝑒\displaystyle\lVert\widehat{e}\rVert_{\infty}∥ over^ start_ARG italic_e end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT λe^2.absent𝜆subscriptdelimited-∥∥^𝑒2\displaystyle\leqslant\lambda\lVert\widehat{e}\rVert_{2}.⩽ italic_λ ∥ over^ start_ARG italic_e end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (7)

It now remains to bound e^2subscriptdelimited-∥∥^𝑒2\lVert\widehat{e}\rVert_{2}∥ over^ start_ARG italic_e end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which is done with 2. By minimality of the estimator,

e^2+λz^1subscriptdelimited-∥∥^𝑒2𝜆subscriptdelimited-∥∥^𝑧1\displaystyle\lVert\widehat{e}\rVert_{2}+\lambda\lVert\widehat{z}\rVert_{1}∥ over^ start_ARG italic_e end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∥ over^ start_ARG italic_z end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT e2+λz1absentsubscriptdelimited-∥∥𝑒2𝜆subscriptdelimited-∥∥superscript𝑧1\displaystyle\leqslant\lVert e\rVert_{2}+\lambda\lVert z^{*}\rVert_{1}⩽ ∥ italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
e^2subscriptdelimited-∥∥^𝑒2\displaystyle\lVert\widehat{e}\rVert_{2}∥ over^ start_ARG italic_e end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT e2+λz1absentsubscriptdelimited-∥∥𝑒2𝜆subscriptdelimited-∥∥superscript𝑧1\displaystyle\leqslant\lVert e\rVert_{2}+\lambda\lVert z^{*}\rVert_{1}⩽ ∥ italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
ϵ+ηϵ.absentitalic-ϵ𝜂italic-ϵ\displaystyle\leqslant\epsilon+\eta\epsilon.⩽ italic_ϵ + italic_η italic_ϵ . (8)

Combining equations (III-A), (7) and (III-A), we get:

z^ze^+λϵλ(η+1)ϵ+λϵ.subscriptdelimited-∥∥^𝑧superscript𝑧subscriptdelimited-∥∥^𝑒𝜆italic-ϵ𝜆𝜂1italic-ϵ𝜆italic-ϵ\displaystyle\lVert\widehat{z}-z^{*}\rVert_{\infty}\leqslant\lVert\widehat{e}% \rVert_{\infty}+\lambda\epsilon\leqslant\lambda(\eta+1)\epsilon+\lambda\epsilon.∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ⩽ ∥ over^ start_ARG italic_e end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_λ italic_ϵ ⩽ italic_λ ( italic_η + 1 ) italic_ϵ + italic_λ italic_ϵ .

This proves the bound on z^zsubscriptdelimited-∥∥^𝑧superscript𝑧\lVert\widehat{z}-z^{*}\rVert_{\infty}∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT. Finally, the correct support recovery follows directly from Theorem 4.1 in [33]. ∎

Remark 1.

It is essential to highlight that λ𝜆\lambdaitalic_λ maintains the same value across different noise levels. For example, let ξ1subscript𝜉1\xi_{1}italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT be a realization from a standard normal distribution 𝒩(0,I)𝒩0𝐼\mathcal{N}(0,I)caligraphic_N ( 0 , italic_I ) and define ξ2=σξ1subscript𝜉2𝜎subscript𝜉1\xi_{2}=\sigma\xi_{1}italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_σ italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, for any standard deviation σ+𝜎subscript\sigma\in\mathbb{R}_{+}italic_σ ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. Consequently, we obtain

λ=ee2=Wξ1Wξ12=Wξ2Wξ22.𝜆subscriptdelimited-∥∥𝑒subscriptdelimited-∥∥𝑒2subscriptdelimited-∥∥𝑊subscript𝜉1subscriptdelimited-∥∥𝑊subscript𝜉12subscriptdelimited-∥∥𝑊subscript𝜉2subscriptdelimited-∥∥𝑊subscript𝜉22\lambda=\frac{\lVert e\rVert_{\infty}}{\lVert e\rVert_{2}}=\frac{\lVert W\xi_{% 1}\rVert_{\infty}}{\lVert W\xi_{1}\rVert_{2}}=\frac{\lVert W\xi_{2}\rVert_{% \infty}}{\lVert W\xi_{2}\rVert_{2}}.italic_λ = divide start_ARG ∥ italic_e ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG = divide start_ARG ∥ italic_W italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_W italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG = divide start_ARG ∥ italic_W italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_W italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG .

This observation underscores the importance of our choice of λ𝜆\lambdaitalic_λ, as it is independent of σ𝜎\sigmaitalic_σ and remains consistent across various noise levels.

On a different note, we can deduce that the correct support can be fully recovered if the signal-to-noise ratio is sufficiently high, as formally stated in Theorem 1. Furthermore, λ𝜆\lambdaitalic_λ can be conveniently bounded in all cases by

1nλ=ee21.1𝑛𝜆subscriptdelimited-∥∥𝑒subscriptdelimited-∥∥𝑒21\frac{1}{\sqrt{n}}\leqslant\lambda=\frac{\lVert e\rVert_{\infty}}{\lVert e% \rVert_{2}}\leqslant 1.divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ⩽ italic_λ = divide start_ARG ∥ italic_e ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ⩽ 1 .

A bound for λ𝜆\lambdaitalic_λ can be used to guarantee that no false positive support error occurs. Improved bounds can be achieved with additional assumptions on the noise. In fact, we investigate this aspect for the Gaussian case in Section III-C.

III-B Estimation error

We now proceed to show that the estimation error, z^z2subscriptnorm^𝑧superscript𝑧2\|\widehat{z}-z^{*}\|_{2}∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, can also be bounded with the same choice of parameter λ𝜆\lambdaitalic_λ, under the same Assumptions 1 and 2.

Theorem 2.

Let 1 be satisfied and let η𝜂\etaitalic_η satisfy 2. Then, for z^^𝑧\widehat{z}over^ start_ARG italic_z end_ARG, the solution to (3), we get

z^z2(2+η)ϵ.subscriptnorm^𝑧superscript𝑧22𝜂italic-ϵ\|\widehat{z}-z^{*}\|_{2}\leqslant(2+\eta)\epsilon.∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⩽ ( 2 + italic_η ) italic_ϵ .
Proof of Theorem 2.

From the optimality of the solution:

z^Wy2+λz^1zWy2+λz1.subscriptnorm^𝑧𝑊𝑦2𝜆subscriptnorm^𝑧1subscriptnormsuperscript𝑧𝑊𝑦2𝜆subscriptnormsuperscript𝑧1\|\widehat{z}-Wy\|_{2}+\lambda\|\widehat{z}\|_{1}\leqslant\|z^{*}-Wy\|_{2}+% \lambda\|z^{*}\|_{1}.∥ over^ start_ARG italic_z end_ARG - italic_W italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∥ over^ start_ARG italic_z end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⩽ ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_W italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

Using Wy=z+e𝑊𝑦superscript𝑧𝑒Wy=z^{*}+eitalic_W italic_y = italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_e, and applying the reverse triangle inequality, we get

z^ze2subscriptnorm^𝑧superscript𝑧𝑒2\displaystyle\|\widehat{z}-z^{*}-e\|_{2}∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT e2+λz1λz^1absentsubscriptnorm𝑒2𝜆subscriptnormsuperscript𝑧1𝜆subscriptnorm^𝑧1\displaystyle\leqslant\|e\|_{2}+\lambda\|z^{*}\|_{1}-\lambda\|\widehat{z}\|_{1}⩽ ∥ italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_λ ∥ over^ start_ARG italic_z end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
z^z2e2subscriptnorm^𝑧superscript𝑧2subscriptnorm𝑒2\displaystyle\|\widehat{z}-z^{*}\|_{2}-\|e\|_{2}∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - ∥ italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT e2+λz1λz^1absentsubscriptnorm𝑒2𝜆subscriptnormsuperscript𝑧1𝜆subscriptnorm^𝑧1\displaystyle\leqslant\|e\|_{2}+\lambda\|z^{*}\|_{1}-\lambda\|\widehat{z}\|_{1}⩽ ∥ italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_λ ∥ over^ start_ARG italic_z end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
z^z2subscriptnorm^𝑧superscript𝑧2\displaystyle\|\widehat{z}-z^{*}\|_{2}∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 2e2+λ(z1z^1).absent2subscriptnorm𝑒2𝜆subscriptnormsuperscript𝑧1subscriptnorm^𝑧1\displaystyle\leqslant 2\|e\|_{2}+\lambda\left(\|z^{*}\|_{1}-\|\widehat{z}\|_{% 1}\right).⩽ 2 ∥ italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ( ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - ∥ over^ start_ARG italic_z end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) .

Finally, using Assumptions 1 and 2, we have

z^z2e+λz1(2+η)ϵ,norm^𝑧superscript𝑧2norm𝑒𝜆subscriptnormsuperscript𝑧12𝜂italic-ϵ\|\widehat{z}-z^{*}\|\leqslant 2\|e\|+\lambda\|z^{*}\|_{1}\leqslant(2+\eta)\epsilon,∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ⩽ 2 ∥ italic_e ∥ + italic_λ ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⩽ ( 2 + italic_η ) italic_ϵ ,

which concludes the proof. ∎

In Appendix F, we present an empirical validation of Theorem 2.

III-C Gaussian noise

We now extend the results of Section III-A to the Gaussian case. The key point of this analysis is that we can use practical values for λ𝜆\lambdaitalic_λ, which can be computed independently of the noise level. We show that these λ𝜆\lambdaitalic_λ values are suitable for any additive Gaussian noise in the input and are thus pivotal to its standard deviation σ𝜎\sigmaitalic_σ.

Assumption 3.

The entries of ξ𝜉\xiitalic_ξ are i.i.d. 𝒩(0,σ2)𝒩0superscript𝜎2\mathcal{N}(0,\sigma^{2})caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) random variables.

Assumption 4.

The rows of W𝑊Witalic_W are normalized to unit 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm.

Theorem 3.

Let 3 and 4 be satisfied, let η𝜂\etaitalic_η satisfy 2. Then, set

λ=a2logdn,𝜆𝑎2𝑑𝑛\lambda=a\sqrt{\frac{2\log d}{n}},italic_λ = italic_a square-root start_ARG divide start_ARG 2 roman_log italic_d end_ARG start_ARG italic_n end_ARG end_ARG ,

where a22𝑎22a\geqslant 2\sqrt{2}italic_a ⩾ 2 square-root start_ARG 2 end_ARG is a constant.

With probability at least 12d1a2/8(1+e2)en/2412superscript𝑑1superscript𝑎281superscript𝑒2superscript𝑒𝑛241-2d^{1-a^{2}/8}-(1+e^{2})e^{-n/24}1 - 2 italic_d start_POSTSUPERSCRIPT 1 - italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 8 end_POSTSUPERSCRIPT - ( 1 + italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_e start_POSTSUPERSCRIPT - italic_n / 24 end_POSTSUPERSCRIPT, we have

z^zλ(2+2η+1smin)nsmaxσ,subscriptdelimited-∥∥^𝑧superscript𝑧𝜆22𝜂1subscript𝑠𝑛subscript𝑠𝜎\lVert\widehat{z}-z^{*}\rVert_{\infty}\leqslant\lambda(2+2\eta+\tfrac{1}{s_{% \min}})\sqrt{n}s_{\max}\sigma,∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ⩽ italic_λ ( 2 + 2 italic_η + divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ) square-root start_ARG italic_n end_ARG italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_σ ,

where sminsubscript𝑠s_{\min}italic_s start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT and smaxsubscript𝑠s_{\max}italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT are the minimum and maximum singular values of W𝑊Witalic_W, respectively.

Proof of Theorem 3.

Let 𝒜𝒜\mathcal{A}caligraphic_A be the event

𝒜={ee2λ2smin}{sminσ2<e2n<2smaxσ}.𝒜subscriptdelimited-∥∥𝑒subscriptdelimited-∥∥𝑒2𝜆2subscript𝑠subscript𝑠𝜎2subscriptdelimited-∥∥𝑒2𝑛2subscript𝑠𝜎\mathcal{A}=\left\{\tfrac{\lVert e\rVert_{\infty}}{\lVert e\rVert_{2}}% \leqslant\tfrac{\lambda}{2s_{\min}}\right\}\cap\left\{s_{\min}\tfrac{\sigma}{% \sqrt{2}}<\tfrac{\lVert e\rVert_{2}}{\sqrt{n}}<2s_{\max}\sigma\right\}.caligraphic_A = { divide start_ARG ∥ italic_e ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ⩽ divide start_ARG italic_λ end_ARG start_ARG 2 italic_s start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG } ∩ { italic_s start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG < divide start_ARG ∥ italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG < 2 italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_σ } .

From Lemma 1 presented in Appendix A, it is deduced that (𝒜)12d1a2/8(1+e2)en/24𝒜12superscript𝑑1superscript𝑎281superscript𝑒2superscript𝑒𝑛24\mathbb{P}(\mathcal{A})\geqslant 1-2d^{1-a^{2}/8}-(1+e^{2})e^{-n/24}blackboard_P ( caligraphic_A ) ⩾ 1 - 2 italic_d start_POSTSUPERSCRIPT 1 - italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 8 end_POSTSUPERSCRIPT - ( 1 + italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_e start_POSTSUPERSCRIPT - italic_n / 24 end_POSTSUPERSCRIPT. For our model and under the event 𝒜𝒜\mathcal{A}caligraphic_A, we have

z^zsubscriptdelimited-∥∥^𝑧superscript𝑧\displaystyle\lVert\widehat{z}-z^{*}\rVert_{\infty}∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT e^+eabsentsubscriptdelimited-∥∥^𝑒subscriptdelimited-∥∥𝑒\displaystyle\leqslant\lVert\widehat{e}\rVert_{\infty}+\lVert e\rVert_{\infty}⩽ ∥ over^ start_ARG italic_e end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∥ italic_e ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
e^+λsmaxsminnσ.absentsubscriptdelimited-∥∥^𝑒𝜆subscript𝑠subscript𝑠𝑛𝜎\displaystyle\leqslant\lVert\widehat{e}\rVert_{\infty}+\lambda\frac{s_{\max}}{% s_{\min}}\sqrt{n}\sigma.⩽ ∥ over^ start_ARG italic_e end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_λ divide start_ARG italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG italic_s start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG square-root start_ARG italic_n end_ARG italic_σ . (9)

By minimality of the estimator and 2,

e^2subscriptdelimited-∥∥^𝑒2\displaystyle\lVert\widehat{e}\rVert_{2}∥ over^ start_ARG italic_e end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT e2+λz1absentsubscriptdelimited-∥∥𝑒2𝜆subscriptdelimited-∥∥superscript𝑧1\displaystyle\leqslant\lVert e\rVert_{2}+\lambda\lVert z^{*}\rVert_{1}⩽ ∥ italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
2nsmaxσ+2ηnsmaxσ.absent2𝑛subscript𝑠𝜎2𝜂𝑛subscript𝑠𝜎\displaystyle\leqslant 2\sqrt{n}s_{\max}\sigma+2\eta\sqrt{n}s_{\max}\sigma.⩽ 2 square-root start_ARG italic_n end_ARG italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_σ + 2 italic_η square-root start_ARG italic_n end_ARG italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_σ . (10)

Combining equations (7), (III-C) and (III-C), we get:

z^zλ(2nsmaxσ+2ηnsmaxσ)+λsmaxsminnσsubscriptdelimited-∥∥^𝑧superscript𝑧𝜆2𝑛subscript𝑠𝜎2𝜂𝑛subscript𝑠𝜎𝜆subscript𝑠subscript𝑠𝑛𝜎\displaystyle\lVert\widehat{z}-z^{*}\rVert_{\infty}\leqslant\lambda\left(2% \sqrt{n}s_{\max}\sigma+2\eta\sqrt{n}s_{\max}\sigma\right)+\lambda\frac{s_{\max% }}{s_{\min}}\sqrt{n}\sigma∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ⩽ italic_λ ( 2 square-root start_ARG italic_n end_ARG italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_σ + 2 italic_η square-root start_ARG italic_n end_ARG italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_σ ) + italic_λ divide start_ARG italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG italic_s start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG square-root start_ARG italic_n end_ARG italic_σ
z^zλ(2+2η+1smin)nsmaxσ.subscriptdelimited-∥∥^𝑧superscript𝑧𝜆22𝜂1subscript𝑠𝑛subscript𝑠𝜎\displaystyle\lVert\widehat{z}-z^{*}\rVert_{\infty}\leqslant\lambda(2+2\eta+% \tfrac{1}{s_{\min}})\sqrt{n}s_{\max}\sigma.∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ⩽ italic_λ ( 2 + 2 italic_η + divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ) square-root start_ARG italic_n end_ARG italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_σ .

IV Computational algorithm

IV-A An iterative solver

In this section, we introduce an iterative optimization algorithm for minimizing (3) that can be efficiently implemented and formulated as a novel sparse auto-encoder architecture. It is worth noting that this objective function corresponds to a convex optimization problem. Therefore, it inherits not only all the theoretical properties of convex optimization problems, but also the algorithms that can be used to solve it, such as the interior point method [34] or the alternating direction method of multipliers [35].

In [21], the authors studied the geometric structure of the square root lasso problem and concluded that the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss function is non-differentiable only in extreme cases of overfitting. In practice, this situation is rare when the data are corrupted by noise and a sufficiently large regularization parameter λ𝜆\lambdaitalic_λ is used to produce a sparse solution. Consequently, the data fitting term in the objective function behaves as a strongly convex and smooth function.

Algorithm 1 Self Normalizing ReLU (NeLU)

Input:

  y¯Wy¯𝑦𝑊𝑦\bar{y}\leftarrow Wyover¯ start_ARG italic_y end_ARG ← italic_W italic_y, where W𝑊Witalic_W is the transform and y𝑦yitalic_y its input signal.

Parameters:

  λ𝜆\lambdaitalic_λ – bias.
  β𝛽\betaitalic_β – step size.

Output:

  The estimated representation z^^𝑧\widehat{z}over^ start_ARG italic_z end_ARG.

Process:

  z^0^𝑧0\widehat{z}\leftarrow 0over^ start_ARG italic_z end_ARG ← 0
  While not converged:
    z^Sβλ{z^βz^y¯z^y¯2}^𝑧subscript𝑆𝛽𝜆^𝑧𝛽^𝑧¯𝑦subscriptnorm^𝑧¯𝑦2\widehat{z}\leftarrow S_{\beta\lambda}\left\{\widehat{z}-\beta\frac{\widehat{z% }-\bar{y}}{\|\widehat{z}-\bar{y}\|_{2}}\right\}over^ start_ARG italic_z end_ARG ← italic_S start_POSTSUBSCRIPT italic_β italic_λ end_POSTSUBSCRIPT { over^ start_ARG italic_z end_ARG - italic_β divide start_ARG over^ start_ARG italic_z end_ARG - over¯ start_ARG italic_y end_ARG end_ARG start_ARG ∥ over^ start_ARG italic_z end_ARG - over¯ start_ARG italic_y end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG }
  return z^^𝑧\widehat{z}over^ start_ARG italic_z end_ARG

Leveraging these attractive geometric properties, we can use proximal gradient descent to iteratively minimize (3). The theoretical analysis in [21] shows that such an optimization algorithm achieves fast local linear convergence. The same theoretical justification also applies to (3) since our optimization problem can be viewed as a simpler instance of the square root lasso, presented in Section II-A. Therefore, we propose adapting proximal gradient descent to the problem studied here, as described in Algorithm 1, which we have named Self Normalizing ReLU or NeLU for short.

The algorithm works by continuously refining the solution via iterative updates, progressing in the direction opposite to the gradient of the objective function. Each iterative update incorporates a proximal operator, which introduces a penalty term to the objective function, thereby promoting sparsity. In the specific problem at hand, the proximal operator takes the form of a soft-thresholding operator, as outlined in Section I. This operator performs a shrinkage operation on the variables, setting any values below a specified absolute threshold to zero. Importantly, the soft-thresholding operator can be replaced with ReLU to enforce nonnegative representations. The iterative process continues until the desired level of convergence is achieved.

IV-B Transform learning

We propose to adapt Algorithm 1 for transform learning by unrolling the algorithm into a layered neural network architecture, following the approach presented in [36]. The idea is to unfold an iterative algorithm and construct it as a network architecture, map** each iteration to a single operation and stacking a finite number of operations on top of each other. This approach enables the incorporation of a wide range of mathematical techniques into deep learning models [37, 38, 39]. Specifically, we achieve this unfolding by limiting the number of iterations in Algorithm 1 to N𝑁Nitalic_N iterations. The resulting architecture is depicted in Figure 1.

Refer to caption
Figure 1: The NeLU architecture: A recurrent sparse encoder model, unrolled for a predetermined number of iterations.

In addition, we propose to improve the performance of the algorithm by employing Nesterov acceleration [40]. Nesterov acceleration is a variant of momentum that speeds up the convergence of gradient descent algorithms, and has demonstrated efficacy in various contexts. By incorporating it into Algorithm 1, we aim to achieve superior performance in transform learning tasks, based on the understanding that accelerated gradient descent converges faster and operates effectively with shallower networks, which are easier to train.

Algorithm 2 Accelerated NeLU

Input:

  y¯Wy¯𝑦𝑊𝑦\bar{y}\leftarrow Wyover¯ start_ARG italic_y end_ARG ← italic_W italic_y, where W𝑊Witalic_W is the transform and y𝑦yitalic_y its input signal.

Learnable weights:

  λ𝜆\lambdaitalic_λ – bias.
  β𝛽\betaitalic_β – step size.
  α𝛼\alphaitalic_α – momentum factor.

Hyperparameters:

  N𝑁Nitalic_N – number of iterations.

Output:

  The estimated representation z^^𝑧\widehat{z}over^ start_ARG italic_z end_ARG.

Process:

  z^0^𝑧0\widehat{z}\leftarrow 0over^ start_ARG italic_z end_ARG ← 0
  v0𝑣0v\leftarrow 0italic_v ← 0
  For i=1:N:𝑖1𝑁i=1:Nitalic_i = 1 : italic_N
    vαvβz^+αvy¯z^+αvy¯2𝑣𝛼𝑣𝛽^𝑧𝛼𝑣¯𝑦subscriptnorm^𝑧𝛼𝑣¯𝑦2v\leftarrow\alpha v-\beta\frac{\widehat{z}+\alpha v-\bar{y}}{\|\widehat{z}+% \alpha v-\bar{y}\|_{2}}italic_v ← italic_α italic_v - italic_β divide start_ARG over^ start_ARG italic_z end_ARG + italic_α italic_v - over¯ start_ARG italic_y end_ARG end_ARG start_ARG ∥ over^ start_ARG italic_z end_ARG + italic_α italic_v - over¯ start_ARG italic_y end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG
    z^ReLU(z^+vβλ)^𝑧ReLU^𝑧𝑣𝛽𝜆\widehat{z}\leftarrow{\text{ReLU}}\left(\widehat{z}+v-\beta\lambda\right)over^ start_ARG italic_z end_ARG ← ReLU ( over^ start_ARG italic_z end_ARG + italic_v - italic_β italic_λ )
  return z^^𝑧\widehat{z}over^ start_ARG italic_z end_ARG

Finally, we note that all parameters of the resulting algorithm, including W𝑊Witalic_W, λ𝜆\lambdaitalic_λ, β𝛽\betaitalic_β, and α𝛼\alphaitalic_α, can be trained end-to-end. This means that the network can be trained on a dataset to learn the optimal values for these parameters, allowing it to perform well on various transform learning tasks. The final accelerated algorithm is presented in Algorithm 2.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: Experimental results for analytical transform with synthetic data. (a) Mean squared error (MSE) of 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT estimation error as a function of the noise level σ𝜎\sigmaitalic_σ, evaluated in both settings. In the oracle setting, the regularization parameter is tuned to achieve the smallest estimation error. In the theoretical setting, we use λ=12ee2𝜆12subscriptdelimited-∥∥𝑒subscriptdelimited-∥∥𝑒2\lambda=\frac{1}{2}\frac{\lVert e\rVert_{\infty}}{\lVert e\rVert_{2}}italic_λ = divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG ∥ italic_e ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG in Algorithm 1. (b) λ𝜆\lambdaitalic_λ values used for each algorithm in the previous graph. Note that the optimal λ𝜆\lambdaitalic_λ values for Algorithm 1 are constant, while they are linear for the traditional algorithm. The standard errors are below 0.02 and thus barely visible.

At each iteration, the algorithm computes the gradient of the objective function with respect to the model parameters at a point in the direction of the momentum and updates the momentum in the opposite direction of the gradient. The solution is then updated based to the momentum, taking into account the proximal operator. This operator introduces regularization to prevent overfitting and improve the generalization performance of the model.

V Experiments

In this section, we present experimental results to evaluate the effectiveness of our proposed method under three different settings. First, in Section V-A, we use synthetic data to compare the performance of the soft-thresholding algorithm (2) with that of our proposed algorithm (3), which we minimize using the iterative approach detailed in Algorithm 1, given a known transformation. Next, in Section V-B, we use the same synthetic data to assess the trainable version of our method, where the transformation matrix is also learned, as summarized in Algorithm 2. Here, we compare our method with a baseline model based on a standard sparse auto-encoder. Lastly, in Section V-C, we evaluate the performance of our trainable Algorithm 2 against a baseline convolutional neural network in the task of image denoising.

V-A Synthetic data

First, we present experiments conducted on synthetic data to demonstrate the advantage of our proposed method over the traditional sparse encoder algorithm (2). We assume that the transformation W𝑊Witalic_W is known and construct a 100×100100100100\times 100100 × 100 random matrix, where each entry is sampled from the standard normal distribution, and then normalize the rows to have unit 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm to satisfy Assumption 4. Next, we generate the input signal by creating a vector zsuperscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with fixed sparsity level, following the procedure described in [41], to obtain a signal consistent with the transform model, such that Wx0=5subscriptnorm𝑊𝑥05\|Wx\|_{0}=5∥ italic_W italic_x ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 5. Finally, we contaminate the signal with i.i.d. Gaussian noise of level σ𝜎\sigmaitalic_σ to produce the measurements y=x+ξ𝑦𝑥𝜉y=x+\xiitalic_y = italic_x + italic_ξ.

We evaluate the estimation error, z^z2subscriptnorm^𝑧superscript𝑧2\|\widehat{z}-z^{*}\|_{2}∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which measures the distance between the estimated signal z^^𝑧\widehat{z}over^ start_ARG italic_z end_ARG and the true signal zsuperscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, as a function of the noise standard deviation σ𝜎\sigmaitalic_σ. We compare the solutions obtained by minimizing (2) and (3) in two settings. In the first setting, we perform an oracle cross-validation that sweeps over a range of parameters λ𝜆\lambdaitalic_λ to find the regularization parameter that minimizes the estimation error for each algorithm. Importantly, this setting is infeasible since it requires the ground truth data to calculate the estimation error. Nevertheless, it reveals the best performance that one can hope to achieve. In the second setting, we consider a more realistic scenario where we compare the performance of the proposed Algorithm 1 using the theoretical value of λ𝜆\lambdaitalic_λ divided by 2, following Belloni’s empirical improvement [15].

Figure 2 shows that both algorithms achieve similar performance in the oracle setting. However, the optimal value of λ𝜆\lambdaitalic_λ for the soft-thresholding algorithm (2) is linearly proportional to the noise level σ𝜎\sigmaitalic_σ, while the optimal value for the proposed algorithm (3) is pivotal to it. This conclusion is consistent with our theoretical analysis presented in Section III. Additionally, we observe that the theoretical value of λ𝜆\lambdaitalic_λ for the proposed algorithm, described in Theorem 1, is very close to the actual optimal value. This indicates that the proposed optimization problem may be a reasonable alternative to the current soft-thresholding algorithm, which forms classic encoder architectures, in practical situations where the noise level is unknown.

V-B Trainable transforms

Trainable transforms often outperform analytical transformations, such as total variation and wavelets, in most signal processing applications [6]. This is because the transformation used in signal processing is frequently unknown and must be inferred from the data. This motivates the use of neural networks to simultaneously learn the transformation and the sparse representation of the data.

The first learning task is supervised sparse coding, where the input consists of signals y𝑦yitalic_y determined by the model and their corresponding synthetic sparse vectors zsuperscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT generated by a sparsifying transform, as described in Section V-A. In this case, the goal of the neural network is to learn the transformation W𝑊Witalic_W stored in its weights and accurately identify the corresponding sparse output vector given a set of input-output pairs. Mathematically, this can be expressed as an end-to-end training scheme, minimizing the cost function

minW,λ,α,βz^z22,subscript𝑊𝜆𝛼𝛽subscriptsuperscriptnorm^𝑧superscript𝑧22\min_{W,\lambda,\alpha,\beta}\|\widehat{z}-z^{*}\|^{2}_{2},roman_min start_POSTSUBSCRIPT italic_W , italic_λ , italic_α , italic_β end_POSTSUBSCRIPT ∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where z^^𝑧\widehat{z}over^ start_ARG italic_z end_ARG is the output of the network outlined in Figure 1.

Refer to caption
Figure 3: Synthetic supervised sparse coding: a comparison of mean squared error (MSE) for estimation error between a two-layer sparse encoder architecture with NeLU and a similar architecture with soft-thresholding, trained on data with a fixed noise level of 0.1. The performance is evaluated at different noise levels, averaged over 2048 realizations of the data.

To compare the effectiveness of the proposed approach, two different neural network architectures are used: one based on Algorithm 2, and a baseline sparse encoder architecture. Both networks consist of a linear layer followed by a thresholding layer. In the baseline version, the thresholding layer is the soft-thresholding non-linearity. In contrast, our proposed architecture uses the Accelerated NeLU non-linearity as presented in Algorithm 2, with the soft-thresholding operator. Both networks are trained using the AdamW optimizer [42] and minimize the mean squared error (MSE) loss with a fixed noise standard deviation of σ=0.1𝜎0.1\sigma=0.1italic_σ = 0.1. After training, we evaluate the performance of the fitted networks on different noise levels, σ𝜎\sigmaitalic_σ, than those used during training. The results, displayed in Figure 3, suggest that NeLU is significantly more robust to unseen noise levels than soft-thresholding, indicating that the NeLU non-linearity results in a predictive model that generalizes to other noise levels without additional training.

We also demonstrate the effectiveness of the proposed model in signal denoising applications. In this task, the input is a noisy signal y=x+ξ𝑦𝑥𝜉y=x+\xiitalic_y = italic_x + italic_ξ and the goal is to produce a cleaned version of the signal, x^=W+z^^𝑥superscript𝑊^𝑧\widehat{x}=W^{+}\widehat{z}over^ start_ARG italic_x end_ARG = italic_W start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT over^ start_ARG italic_z end_ARG, by removing the additive noise. To this end, a final linear layer is added to each network according to the sparse model. The first two layers perform sparse coding, i.e., they estimate the sparse representation, while the last layer projects the sparse estimation back into the input space. Sparse data is generated in the same manner as before and the MSE loss is minimized. The results, shown in Figure 4, again reveal that the NeLU leads to more stable recovery than the soft-thresholding non-linearity.

Refer to caption
Figure 4: Synthetic sparse signal denoising: a comparison of the mean squared error (MSE) for the reconstruction error, x^x^𝑥𝑥\widehat{x}-xover^ start_ARG italic_x end_ARG - italic_x. Other details are the same as in Figure 3.

V-C Natural images

In this experiment, the aforementioned networks are employed to perform natural image denoising using a patch averaging technique based on the Convolution Sparse Coding model [43]. To accomplish this, each input image is replicated stride times and translated across every dimension, producing slight shifts of the original for each replica. As a result, a single image transforms into a set of stride2superscriptstride2\textit{stride}^{2}stride start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT slightly offset variations. These shifted versions of the input are then processed collectively by the network, resulting in intermediate (shifted) denoised versions of the same input image. Finally, these intermediate denoised output images are shifted back and averaged to yield the final reconstructed output image. This process is visualized in Figure 5 for a one-dimensional (1D) signal. For more details, see [43].

The datasets and preprocessing procedures are adopted from [43]. Specifically, we use clean training images from the Waterloo Exploration dataset [44], and a validation set consisting of 432432432432 images from BSD [45]. Noisy images are generated by adding white Gaussian noise with a constant standard deviation σ=15𝜎15\sigma=15italic_σ = 15. In each iteration, we randomly crop a patch of size 1282superscript1282128^{2}128 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT from an image and obtain a random realization of the noise.

In this setting, we replace the linear layers with convolution and deconvolution layers, respectively. Concurrently, the soft-thresholding operator is substituted by ReLU as the proximal operator. The models learn 175 filters of dimensions 11×11111111\times 1111 × 11 and a stride of 8. We utilize the AdamW optimizer with a learning rate of 21022superscript1022\cdot 10^{-2}2 ⋅ 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, which is reduced by a factor of 0.7 after every 50 epochs. Additionally, the optimizer’s ϵitalic-ϵ\epsilonitalic_ϵ parameter is set to 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT to ensure stability. The models are trained for 300 epochs, minimizing the MSE loss.

To evaluate the performance of the models, we use the BSD68 dataset, which is distinct from the validation set. The experimental results, as shown in Table I, allow us to compare the performance of each model on the test dataset at varying noise levels. We can see that the proposed Algorithm 2 layer outperforms the ReLU activation function for virtually all noise levels, and the performance gap widens as the noise level deviates further from the trained noise level.

TABLE I: PSNR comparison of sparse auto-encoder models with NeLU and ReLU activations for natural image denoising. The models are trained on clean-noisy image pairs with a fixed noise level of σ=15𝜎15\sigma=15italic_σ = 15 and evaluated on the test set.
σ𝜎\sigmaitalic_σ 15 25 35 50 75 90 105 120
Noise 24.61 20.17 17.25 14.15 10.63 9.05 7.70 6.55
ReLU 28.47 25.89 23.49 20.58 17.07 15.48 14.13 12.99
NeLU 28.65 25.93 23.48 20.69 17.62 16.37 15.36 14.57

VI Conclusion

In this work, we proposed a novel sparse auto-encoder architecture as an alternative to traditional auto-encoder architectures. We offer a novel activation function, called Self Normalizing ReLU (NeLU), which is the solution of a square root lasso problem under a transform learning formulation. Importantly, as we showed in Section III, the bias parameter of our proposed NeLU layer is pivotal (i.e., invariant) to the noise level in the input signal. This feature leads to an activation function that is significantly more robust to varying noise levels in terms of signal recovery and denoising, both on synthetic data as well as in real imaging settings. Our research showcases how theoretical understanding of neural networks can give rise to improved algorithms, derived from theoretical insights and analysis.

Several open questions present opportunities for future directions. While our paper focuses on establishing foundational theory, future efforts might apply these insights on a larger scale to develop a state-of-the-art network. A potential direction would be to extrapolate the model to a multilayer architecture, building upon the work reported in [46]. The multilayer expansion strategy proposed by [47] also offers an attractive option. Incorporating additional layers into the model, and possibly broadening the analysis to convolutional neural network (CNN) architectures, could result in improved theoretical bounds and performance.

Acknowledgments

N.G. and Y.R. were supported by the Israel Science Foundation (grant No. 729/21). Y.R. also thanks the Career Advancement Fellowship, Technion, for providing research support. J.S. is supported by NSF grant CCF 2007649.

Appendix A Lemmas

Lemma 1.

Consider a Gaussian vector ξ𝒩(0,σ2I)similar-to𝜉𝒩0superscript𝜎2𝐼\xi\sim\mathcal{N}(0,\sigma^{2}I)italic_ξ ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) and a deterministic matrix W𝑊Witalic_W with normalized rows, where sminsubscript𝑠s_{\min}italic_s start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT and smaxsubscript𝑠s_{\max}italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT denote the minimum and maximum singular values of W𝑊Witalic_W, respectively. Then,

  1. 1.

    Let 𝒞1{1nWξλ2}subscript𝒞11𝑛subscriptdelimited-∥∥𝑊𝜉𝜆2\mathcal{C}_{1}\triangleq\left\{\frac{1}{\sqrt{n}}\lVert W\xi\rVert_{\infty}% \leqslant\frac{\lambda}{2}\right\}caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≜ { divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ∥ italic_W italic_ξ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ⩽ divide start_ARG italic_λ end_ARG start_ARG 2 end_ARG }. Take λ=aσlogdn𝜆𝑎𝜎𝑑𝑛\lambda=a\sigma\sqrt{\frac{\log d}{n}}italic_λ = italic_a italic_σ square-root start_ARG divide start_ARG roman_log italic_d end_ARG start_ARG italic_n end_ARG end_ARG and a>22𝑎22a>2\sqrt{2}italic_a > 2 square-root start_ARG 2 end_ARG, then:

    (𝒞1)12d1a2/8.subscript𝒞112superscript𝑑1superscript𝑎28\mathbb{P}(\mathcal{C}_{1})\geqslant 1-2d^{1-a^{2}/8}.blackboard_P ( caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⩾ 1 - 2 italic_d start_POSTSUPERSCRIPT 1 - italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 8 end_POSTSUPERSCRIPT .
  2. 2.

    Let 𝒞2{1nWξλ2σ2}subscript𝒞21𝑛subscriptdelimited-∥∥𝑊𝜉𝜆2𝜎2\mathcal{C}_{2}\triangleq\left\{\frac{1}{\sqrt{n}}\lVert W\xi\rVert_{\infty}% \leqslant\frac{\lambda}{2}\frac{\sigma}{\sqrt{2}}\right\}caligraphic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≜ { divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ∥ italic_W italic_ξ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ⩽ divide start_ARG italic_λ end_ARG start_ARG 2 end_ARG divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG }. Take λ=a2logdn𝜆𝑎2𝑑𝑛\lambda=a\sqrt{\frac{2\log d}{n}}italic_λ = italic_a square-root start_ARG divide start_ARG 2 roman_log italic_d end_ARG start_ARG italic_n end_ARG end_ARG and a>22𝑎22a>2\sqrt{2}italic_a > 2 square-root start_ARG 2 end_ARG, then:

    (𝒞2)12d1a2/8.subscript𝒞212superscript𝑑1superscript𝑎28\mathbb{P}(\mathcal{C}_{2})\geqslant 1-2d^{1-a^{2}/8}.blackboard_P ( caligraphic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⩾ 1 - 2 italic_d start_POSTSUPERSCRIPT 1 - italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 8 end_POSTSUPERSCRIPT .
  3. 3.

    Let 𝒞3{sminσ2<Wξ2n<2smaxσ}subscript𝒞3subscript𝑠𝜎2subscriptdelimited-∥∥𝑊𝜉2𝑛2subscript𝑠𝜎\mathcal{C}_{3}\triangleq\left\{s_{\min}\frac{\sigma}{\sqrt{2}}<\frac{\lVert W% \xi\rVert_{2}}{\sqrt{n}}<2s_{\max}\sigma\right\}caligraphic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ≜ { italic_s start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG < divide start_ARG ∥ italic_W italic_ξ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG < 2 italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_σ }, then:

    (𝒞3)1(1+e2)en/24.subscript𝒞311superscript𝑒2superscript𝑒𝑛24\mathbb{P}(\mathcal{C}_{3})\geqslant 1-(1+e^{2})e^{-n/24}.blackboard_P ( caligraphic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ⩾ 1 - ( 1 + italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_e start_POSTSUPERSCRIPT - italic_n / 24 end_POSTSUPERSCRIPT .
  4. 4.

    Let

    𝒜={WξWξ2λ2smin}{sminσ2<Wξ2n<2smaxσ},𝒜subscriptdelimited-∥∥𝑊𝜉subscriptdelimited-∥∥𝑊𝜉2𝜆2subscript𝑠subscript𝑠𝜎2subscriptdelimited-∥∥𝑊𝜉2𝑛2subscript𝑠𝜎\mathcal{A}=\left\{\frac{\lVert W\xi\rVert_{\infty}}{\lVert W\xi\rVert_{2}}% \leqslant\frac{\lambda}{2s_{\min}}\right\}\cap\\ \left\{s_{\min}\frac{\sigma}{\sqrt{2}}<\frac{\lVert W\xi\rVert_{2}}{\sqrt{n}}<% 2s_{\max}\sigma\right\},start_ROW start_CELL caligraphic_A = { divide start_ARG ∥ italic_W italic_ξ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_W italic_ξ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ⩽ divide start_ARG italic_λ end_ARG start_ARG 2 italic_s start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG } ∩ end_CELL end_ROW start_ROW start_CELL { italic_s start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG < divide start_ARG ∥ italic_W italic_ξ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG < 2 italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_σ } , end_CELL end_ROW

    then:

    (𝒜)1(𝒞2c)(𝒞3c).𝒜1subscriptsuperscript𝒞𝑐2subscriptsuperscript𝒞𝑐3\mathbb{P}(\mathcal{A})\geqslant 1-\mathbb{P}(\mathcal{C}^{c}_{2})-\mathbb{P}(% \mathcal{C}^{c}_{3}).blackboard_P ( caligraphic_A ) ⩾ 1 - blackboard_P ( caligraphic_C start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - blackboard_P ( caligraphic_C start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) .
Proof.

Item 1: Since ξ𝜉\xiitalic_ξ is isotropic, the law of d𝖳ξsuperscript𝑑𝖳𝜉d^{\mathsf{T}}\xiitalic_d start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_ξ is the same for all vectors dn𝑑superscript𝑛d\in\mathbb{R}^{n}italic_d ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT of the same norm. In particular, Wiξsubscript𝑊𝑖𝜉W_{i}\xiitalic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ξ, where Wisubscript𝑊𝑖W_{i}italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_ith row of W𝑊Witalic_W, and ξ1subscript𝜉1\xi_{1}italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT have the same law.

(𝒞1c)superscriptsubscript𝒞1𝑐\displaystyle\mathbb{P}\left(\mathcal{C}_{1}^{c}\right)blackboard_P ( caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) id(|Wiξ|nλ/2)absentsubscriptsuperscript𝑑𝑖subscript𝑊𝑖𝜉𝑛𝜆2\displaystyle\leqslant\sum^{d}_{i}\mathbb{P}\left(\left|W_{i}\xi\right|% \geqslant\sqrt{n}\lambda/2\right)⩽ ∑ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_P ( | italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ξ | ⩾ square-root start_ARG italic_n end_ARG italic_λ / 2 )
d(|ξ1|nλ/2)absent𝑑subscript𝜉1𝑛𝜆2\displaystyle\leqslant d\,\mathbb{P}\left(|\xi_{1}|\geqslant\sqrt{n}\lambda/2\right)⩽ italic_d blackboard_P ( | italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ⩾ square-root start_ARG italic_n end_ARG italic_λ / 2 ) (Wi2=1)subscriptdelimited-∥∥subscript𝑊𝑖21\displaystyle(\lVert W_{i}\rVert_{2}=1)( ∥ italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 )
2dexp(nλ28σ2)absent2𝑑𝑛superscript𝜆28superscript𝜎2\displaystyle\leqslant 2d\exp\left(-\frac{n\lambda^{2}}{8\sigma^{2}}\right)⩽ 2 italic_d roman_exp ( - divide start_ARG italic_n italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (Hoeffding’s Inequality)
2d1A28absent2superscript𝑑1superscript𝐴28\displaystyle\leqslant 2d^{1-\frac{A^{2}}{8}}⩽ 2 italic_d start_POSTSUPERSCRIPT 1 - divide start_ARG italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG end_POSTSUPERSCRIPT (λ=Aσlogdn).𝜆𝐴𝜎𝑑𝑛\displaystyle(\lambda=A\sigma\sqrt{\frac{\log d}{n}}).( italic_λ = italic_A italic_σ square-root start_ARG divide start_ARG roman_log italic_d end_ARG start_ARG italic_n end_ARG end_ARG ) .

Item 2 is a direct consequence of Item 1.

Item 3: Giraud [48] controls the event

{σ2ξ2n(212)σ}.𝜎2subscriptdelimited-∥∥𝜉2𝑛212𝜎\left\{\frac{\sigma}{\sqrt{2}}\leqslant\frac{\lVert\xi\rVert_{2}}{\sqrt{n}}% \leqslant\left(2-\frac{1}{\sqrt{2}}\right)\sigma\right\}.{ divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG ⩽ divide start_ARG ∥ italic_ξ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ⩽ ( 2 - divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG ) italic_σ } .

with probability (1+e2)en/241superscript𝑒2superscript𝑒𝑛24(1+e^{2})e^{-n/24}( 1 + italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_e start_POSTSUPERSCRIPT - italic_n / 24 end_POSTSUPERSCRIPT. Therefore, we can control our event by

sminξ2nWξ2nsmaxξ2n.subscript𝑠subscriptdelimited-∥∥𝜉2𝑛subscriptdelimited-∥∥𝑊𝜉2𝑛subscript𝑠subscriptdelimited-∥∥𝜉2𝑛s_{\min}\frac{\lVert\xi\rVert_{2}}{\sqrt{n}}\leqslant\frac{\lVert W\xi\rVert_{% 2}}{\sqrt{n}}\leqslant s_{\max}\frac{\lVert\xi\rVert_{2}}{\sqrt{n}}.italic_s start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT divide start_ARG ∥ italic_ξ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ⩽ divide start_ARG ∥ italic_W italic_ξ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ⩽ italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT divide start_ARG ∥ italic_ξ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG .

Proof of Item 4 is done using items 2 and 3. Indeed we have

𝒜{sminσ2<Wξ2n<2smaxσ}{Wξnλσ22}=𝒞2𝒞3.superset-of𝒜subscript𝑠𝜎2subscriptdelimited-∥∥𝑊𝜉2𝑛2subscript𝑠𝜎subscriptdelimited-∥∥𝑊𝜉𝑛𝜆𝜎22subscript𝒞2subscript𝒞3\mathcal{A}\supset\left\{s_{\min}\frac{\sigma}{\sqrt{2}}<\frac{\lVert W\xi% \rVert_{2}}{\sqrt{n}}<2s_{\max}\sigma\right\}\cap\\ \left\{\frac{\lVert W\xi\rVert_{\infty}}{\sqrt{n}}\leqslant\frac{\lambda\sigma% }{2\sqrt{2}}\right\}=\mathcal{C}_{2}\cap\mathcal{C}_{3}.start_ROW start_CELL caligraphic_A ⊃ { italic_s start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG < divide start_ARG ∥ italic_W italic_ξ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG < 2 italic_s start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_σ } ∩ end_CELL end_ROW start_ROW start_CELL { divide start_ARG ∥ italic_W italic_ξ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ⩽ divide start_ARG italic_λ italic_σ end_ARG start_ARG 2 square-root start_ARG 2 end_ARG end_ARG } = caligraphic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∩ caligraphic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT . end_CELL end_ROW

Hence (𝒜)1(𝒞2c)(𝒞3c)𝒜1subscriptsuperscript𝒞𝑐2subscriptsuperscript𝒞𝑐3\mathbb{P}(\mathcal{A})\geqslant 1-\mathbb{P}(\mathcal{C}^{c}_{2})-\mathbb{P}(% \mathcal{C}^{c}_{3})blackboard_P ( caligraphic_A ) ⩾ 1 - blackboard_P ( caligraphic_C start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - blackboard_P ( caligraphic_C start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ). ∎

Remark 2.

The bounds and probabilities presented in this analysis can be further refined through a more rigorous examination. Nevertheless, these bounds are sufficient for our primary objective, which is to demonstrate the pivotalness of λ𝜆\lambdaitalic_λ in the Gaussian scenario.

Appendix B Extension to synthesis model

Note that if we define D=W𝖳𝐷superscript𝑊𝖳D=W^{\mathsf{T}}italic_D = italic_W start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT,

z¯=D𝖳Dz.¯𝑧superscript𝐷𝖳𝐷superscript𝑧\bar{z}=D^{\mathsf{T}}Dz^{*}.over¯ start_ARG italic_z end_ARG = italic_D start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_D italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .

Then the problem (3) can be rewritten as

minzzy¯2+λz1,\min_{z}\lVert z-\bar{y}\rVert_{2}+\lambda\|z\|_{1},roman_min start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∥ italic_z - over¯ start_ARG italic_y end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∥ italic_z ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

where y¯=D𝖳y=z¯+e¯𝑦superscript𝐷𝖳𝑦¯𝑧𝑒\bar{y}=D^{\mathsf{T}}y=\bar{z}+eover¯ start_ARG italic_y end_ARG = italic_D start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_y = over¯ start_ARG italic_z end_ARG + italic_e. This is identical to the square root lasso when the dictionary is the identity matrix. Therefore, it can be solved for z¯¯𝑧\bar{z}over¯ start_ARG italic_z end_ARG using all the tools available for the square root lasso, as previously studied in [15, 32]. We extend the previous results of the paper to obtain bounds for the synthesis model.

Theorem 4.

Following the assumptions defined in Theorem 1, we have

z^zλ(2+η)ϵ+ρμ(D)z,subscriptnorm^𝑧superscript𝑧𝜆2𝜂italic-ϵ𝜌𝜇𝐷subscriptnormsuperscript𝑧\|{\widehat{z}-z^{*}}\|_{\infty}\leqslant\lambda(2+\eta)\epsilon+\rho\,\mu(D)% \|z^{*}\|_{\infty},∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ⩽ italic_λ ( 2 + italic_η ) italic_ϵ + italic_ρ italic_μ ( italic_D ) ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ,

where ρ=z0𝜌subscriptnormsuperscript𝑧0\rho=\|z^{*}\|_{0}italic_ρ = ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and μ(D)=maxij|Di𝖳Dj|𝜇𝐷subscript𝑖𝑗superscriptsubscript𝐷𝑖𝖳subscript𝐷𝑗\mu(D)=\max_{i\neq j}|D_{i}^{\mathsf{T}}D_{j}|italic_μ ( italic_D ) = roman_max start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT | italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT |. Moreover, if

minj[d]|zj|>2λ(2+η)ϵ+2ρμ(D)z,subscript𝑗delimited-[]𝑑superscriptsubscript𝑧𝑗2𝜆2𝜂italic-ϵ2𝜌𝜇𝐷subscriptnormsuperscript𝑧\min_{j\in[d]}|z_{j}^{*}|>2\lambda(2+\eta)\epsilon+2\rho\,\mu(D)\|z^{*}\|_{% \infty},roman_min start_POSTSUBSCRIPT italic_j ∈ [ italic_d ] end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | > 2 italic_λ ( 2 + italic_η ) italic_ϵ + 2 italic_ρ italic_μ ( italic_D ) ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT , (11)

then the estimated support

𝒮^={j[d]:|z^j|>λ(2+η)ϵ+ρμ(D)z}^𝒮conditional-set𝑗delimited-[]𝑑subscript^𝑧𝑗𝜆2𝜂italic-ϵ𝜌𝜇𝐷subscriptnormsuperscript𝑧\widehat{\mathcal{S}}=\{j\in[d]:|\widehat{z}_{j}|>\lambda(2+\eta)\epsilon+\rho% \,\mu(D)\|z^{*}\|_{\infty}\}over^ start_ARG caligraphic_S end_ARG = { italic_j ∈ [ italic_d ] : | over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | > italic_λ ( 2 + italic_η ) italic_ϵ + italic_ρ italic_μ ( italic_D ) ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT }

recovers the true sparsity pattern correctly, i.e., 𝒮^=𝒮^𝒮𝒮\widehat{\mathcal{S}}=\mathcal{S}over^ start_ARG caligraphic_S end_ARG = caligraphic_S.

Proof of Theorem 4.

Using Theorem 1 and the sparsity of zsuperscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we have

z^zsubscriptdelimited-∥∥^𝑧superscript𝑧\displaystyle\lVert\widehat{z}-z^{*}\rVert_{\infty}∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT z^z¯+z¯zabsentsubscriptdelimited-∥∥^𝑧¯𝑧subscriptdelimited-∥∥¯𝑧superscript𝑧\displaystyle\leqslant\lVert\widehat{z}-\bar{z}\rVert_{\infty}+\lVert\bar{z}-z% ^{*}\rVert_{\infty}⩽ ∥ over^ start_ARG italic_z end_ARG - over¯ start_ARG italic_z end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∥ over¯ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
λ(2+η)ϵ+(ID𝖳D)zabsent𝜆2𝜂italic-ϵsubscriptdelimited-∥∥𝐼superscript𝐷𝖳𝐷superscript𝑧\displaystyle\leqslant\lambda(2+\eta)\epsilon+\lVert(I-D^{\mathsf{T}}D)z^{*}% \rVert_{\infty}⩽ italic_λ ( 2 + italic_η ) italic_ϵ + ∥ ( italic_I - italic_D start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_D ) italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
=λ(2+η)ϵ+maxi|(ID𝖳D)𝒮z𝒮|iabsent𝜆2𝜂italic-ϵsubscript𝑖subscriptsubscript𝐼superscript𝐷𝖳𝐷𝒮subscriptsuperscript𝑧𝒮𝑖\displaystyle=\lambda(2+\eta)\epsilon+\max_{i}|(I-D^{\mathsf{T}}D)_{\mathcal{S% }}z^{*}_{\mathcal{S}}|_{i}= italic_λ ( 2 + italic_η ) italic_ϵ + roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ( italic_I - italic_D start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_D ) start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT | start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
λ(2+η)ϵ+maxi(ID𝖳D)𝒮i1zabsent𝜆2𝜂italic-ϵsubscript𝑖subscriptnormsubscript𝐼superscript𝐷𝖳𝐷𝒮𝑖1subscriptnormsuperscript𝑧\displaystyle\leqslant\lambda(2+\eta)\epsilon+\max_{i}\|(I-D^{\mathsf{T}}D)_{% \mathcal{S}i}\|_{1}\|z^{*}\|_{\infty}⩽ italic_λ ( 2 + italic_η ) italic_ϵ + roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ( italic_I - italic_D start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_D ) start_POSTSUBSCRIPT caligraphic_S italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
λ(2+η)ϵ+ρμ(D)z.absent𝜆2𝜂italic-ϵ𝜌𝜇𝐷subscriptnormsuperscript𝑧\displaystyle\leqslant\lambda(2+\eta)\epsilon+\rho\,\mu(D)\|z^{*}\|_{\infty}.⩽ italic_λ ( 2 + italic_η ) italic_ϵ + italic_ρ italic_μ ( italic_D ) ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT .

This proves the bound on z^zsubscriptdelimited-∥∥^𝑧superscript𝑧\lVert\widehat{z}-z^{*}\rVert_{\infty}∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT. Then, the support recovery property easily follows as in Theorem 1. ∎

Corollary 1.

From the condition (11), we can derive a bound on the sparsity of the signal for correct support recovery:

|zmin|>2λ(2+η)ϵ+2ρμ(D)|zmax|,subscriptsuperscript𝑧2𝜆2𝜂italic-ϵ2𝜌𝜇𝐷subscriptsuperscript𝑧\displaystyle|z^{*}_{\min}|>2\lambda(2+\eta)\epsilon+2\rho\,\mu(D)|z^{*}_{\max% }|,| italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT | > 2 italic_λ ( 2 + italic_η ) italic_ϵ + 2 italic_ρ italic_μ ( italic_D ) | italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT | ,
ρ<|zmin|2μ(D)|zmax|λ(2+η)ϵμ(D)|zmax|,𝜌subscriptsuperscript𝑧2𝜇𝐷subscriptsuperscript𝑧𝜆2𝜂italic-ϵ𝜇𝐷subscriptsuperscript𝑧\displaystyle\rho<\frac{|z^{*}_{\min}|}{2\mu(D)|z^{*}_{\max}|}-\frac{\lambda(2% +\eta)\epsilon}{\mu(D)|z^{*}_{\max}|},italic_ρ < divide start_ARG | italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT | end_ARG start_ARG 2 italic_μ ( italic_D ) | italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT | end_ARG - divide start_ARG italic_λ ( 2 + italic_η ) italic_ϵ end_ARG start_ARG italic_μ ( italic_D ) | italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT | end_ARG ,

where |zmin|subscriptsuperscript𝑧|z^{*}_{\min}|| italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT | and |zmax|subscriptsuperscript𝑧|z^{*}_{\max}|| italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT | are the minimum and maximum absolute values of the entries of zsuperscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, respectively. This bound closely resembles the optimality condition of the thresholding lasso algorithm [46].

Theorem 5.

Following the assumptions defined in Theorem 2, then, we get

z^z2(2+η)ϵ+μ(D)dz1.subscriptnorm^𝑧superscript𝑧22𝜂italic-ϵ𝜇𝐷𝑑subscriptnormsuperscript𝑧1\|{\widehat{z}-z^{*}}\|_{2}\leqslant(2+\eta)\epsilon+\mu(D)\sqrt{d}\|z^{*}\|_{% 1}.∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⩽ ( 2 + italic_η ) italic_ϵ + italic_μ ( italic_D ) square-root start_ARG italic_d end_ARG ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .
Proof of Theorem 5.

Using Theorem 2 and the sparsity of zsuperscript𝑧z^{*}italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we have:

z^z2subscriptdelimited-∥∥^𝑧superscript𝑧2\displaystyle\lVert\widehat{z}-z^{*}\rVert_{2}∥ over^ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT z^z¯2+z¯z2absentsubscriptdelimited-∥∥^𝑧¯𝑧2subscriptdelimited-∥∥¯𝑧superscript𝑧2\displaystyle\leqslant\lVert\widehat{z}-\bar{z}\rVert_{2}+\lVert\bar{z}-z^{*}% \rVert_{2}⩽ ∥ over^ start_ARG italic_z end_ARG - over¯ start_ARG italic_z end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ over¯ start_ARG italic_z end_ARG - italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
(2+η)ϵ+(ID𝖳D)z2absent2𝜂italic-ϵsubscriptdelimited-∥∥𝐼superscript𝐷𝖳𝐷superscript𝑧2\displaystyle\leqslant(2+\eta)\epsilon+\lVert(I-D^{\mathsf{T}}D)z^{*}\rVert_{2}⩽ ( 2 + italic_η ) italic_ϵ + ∥ ( italic_I - italic_D start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_D ) italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
(2+η)ϵ+i=1dμ(D)2z12absent2𝜂italic-ϵsubscriptsuperscript𝑑𝑖1𝜇superscript𝐷2superscriptsubscriptnormsuperscript𝑧12\displaystyle\leqslant(2+\eta)\epsilon+\sqrt{\sum^{d}_{i=1}{\mu(D)^{2}\|z^{*}% \|_{1}^{2}}}⩽ ( 2 + italic_η ) italic_ϵ + square-root start_ARG ∑ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_μ ( italic_D ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
(2+η)ϵ+μ(D)dz1.absent2𝜂italic-ϵ𝜇𝐷𝑑subscriptnormsuperscript𝑧1\displaystyle\leqslant(2+\eta)\epsilon+\mu(D)\sqrt{d}\|z^{*}\|_{1}.⩽ ( 2 + italic_η ) italic_ϵ + italic_μ ( italic_D ) square-root start_ARG italic_d end_ARG ∥ italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

Appendix C Illustration of experiment in section V-C

In the real-data experiment described in Section V-C, we follow the approach of Simon and Elad for deploying the Convolutional Sparse Coding (CSC) model [1]. The process of this experiment is illustrated in Figure 5.

Refer to caption
Figure 5: 1D example of the process utilized in the experiment in Section V-C when stride=2stride2\textit{stride}=2stride = 2. Each signal is replicated stride times and subsequently translated, yielding slight shifts of the original for each replica. This process effectively transforms a single image into a collection of stride variations, each exhibiting a slight spatial offset. The final output is an average of all denoised shifts.

Appendix D Details of the analytical transform experiment - subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT error

In this section, we delve deeper into the performance of the algorithms from the main experiment, focusing on the subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT estimation error. This analysis complements our earlier discussion centered around the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT error, as shown in Figure 2.

Refer to caption
Figure 6: The subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT estimation error of each algorithm in the experiment presented in Figure 2. The theoretical NeLU achieves a comparatively lower subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT error because the oracle algorithms were optimized using the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT error. The standard errors are less than 0.02 and are thus barely noticeable.

As observed in Figure 6, the theoretical NeLU outperforms other methods, yielding a lower subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT error. This performance can be attributed to the fact that the oracle algorithms, in the primary experiment, were fine-tuned using the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT error criterion. A parallel behaviour is observed for the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT error when optimizing with respect to the subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT error.

Appendix E Qualitative analysis

This appendix provides a visual representation of the outcomes from the experiments detailed in Section V-C. The emphasis here is on a qualitative assessment of images processed using the networks specified in our study. Such an analysis aids comprehending the practical efficacy of the applied denoising methods.

Figure 7 showcases a comprehensive comparison, juxtaposing the original images with their noisy counterparts and the subsequent denoised versions. This layout facilitates a direct visual evaluation of the noise reduction capabilities of the networks. For each image displayed, a specific patch has been chosen for detailed analysis. Specifically, for each image, the selection includes the original image marked with the patch location, the unaltered patch, the corresponding noisy patch (the original with added Gaussian noise), and the denoised version processed by each network. Additionally, a heatmap of the residual error is presented, enabling a more precise and detailed comparison.

Refer to caption
Figure 7: Qualitative analysis of denoising networks from Section V-C. For each image, a selected patch is displayed in three states: the original, the version with Gaussian noise (σ=25𝜎25\sigma=25italic_σ = 25 for the first image, σ=50𝜎50\sigma=50italic_σ = 50 for the second and third), and the denoised output from each evaluated network. Additionally, heatmaps depict the residual errors, with warmer colors indicating larger absolute errors in those pixels. The Peak Signal-to-Noise Ratio (PSNR) values are also reported below each network’s output to quantitatively assess the denoising performance.

The residual error visualized in the figure highlights the enhancements our algorithm achieved, particularly in handling previously unseen noise levels. This visual representation serves not only as a validation of the algorithm’s effectiveness but also offers insights into its potential limitations and areas for future improvement.

Appendix F Empirical validation of Theorem 2

In this appendix, we extend our investigation to empirically validate the theoretical estimation error bound introduced within our theoretical framework. To accomplish this, we replicate the experimental setup delineated in Section V-A. Herein, we evaluate the performance of our proposed algorithm (3) in juxtaposition with the theoretical threshold. Employing the iterative process specified in Algorithm 1, and considering a given transformation, we subject the algorithm to rigorous evaluation across a range of noise levels. For each noise level, the experiment encompasses 20 independent trials, incorporating diverse instances of signal and noise, with λ𝜆\lambdaitalic_λ set to ee2subscriptdelimited-∥∥𝑒subscriptdelimited-∥∥𝑒2\frac{\lVert e\rVert_{\infty}}{\lVert e\rVert_{2}}divide start_ARG ∥ italic_e ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG.

Refer to caption
Figure 8: Empirical validation of the theoretical estimation error bound. The figure showcases the MSE of the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT estimation error as a function of the noise level σ𝜎\sigmaitalic_σ. For each noise level, 20 independent trials were conducted with distinct combinations of signal and noise, with λ=ee2𝜆subscriptdelimited-∥∥𝑒subscriptdelimited-∥∥𝑒2\lambda=\frac{\lVert e\rVert_{\infty}}{\lVert e\rVert_{2}}italic_λ = divide start_ARG ∥ italic_e ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG.

The outcomes of these experiments are illustrated in Figure 8. These experiments corroborate the assertion that the reconstruction error adheres to the bounds established in Theorem 2, thereby reinforcing the theoretical underpinnings of our methodology. Despite being conservative, the framework provides a reliable method for ensuring model robustness across different noise levels.

Further investigation into the exact optimal value of λ𝜆\lambdaitalic_λ, considering factors such as the sparsity level of the signals, presents a promising avenue for future research. It is our hope that these insights will contribute to a deeper understanding of the interplay between theoretical analysis and empirical application in the domain of sparse autoencoders.

References

  • [1] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning.   MIT press, 2016.
  • [2] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, and L. Bottou, “Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion,” Journal of Machine Learning Research, vol. 11, no. 12, 2010.
  • [3] Y. Romano and M. Elad, “Boosting of Image Denoising Algorithms,” SIAM Journal on Imaging Sciences, vol. 8, no. 2, pp. 1187–1219, 2015.
  • [4] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image Super-Resolution Via Sparse Representation,” IEEE Transactions on Image Processing, vol. 19, no. 11, pp. 2861–2873, 2010.
  • [5] V. Papyan, Y. Romano, J. Sulam, and M. Elad, “Theoretical Foundations of Deep Learning via Sparse Representations: A Multilayer Sparse Model and Its Connection to Convolutional Neural Networks,” IEEE Signal Processing Magazine, vol. 35, no. 4, pp. 72–89, 2018.
  • [6] S. Ravishankar and Y. Bresler, “Learning Sparsifying Transforms,” IEEE Transactions on Signal Processing, vol. 61, no. 5, pp. 1072–1086, 2012.
  • [7] ——, “Data-driven adaptation of a union of sparsifying transforms for blind compressed sensing MRI reconstruction,” in Wavelets and Sparsity, vol. 9597.   SPIE, 2015, pp. 247–256.
  • [8] J. Maggu, E. Chouzenoux, G. Chierchia, and A. Majumdar, “Convolutional Transform Learning,” in Neural Information Processing.   Springer, 2018, pp. 162–174.
  • [9] B. Wen, S. Ravishankar, and Y. Bresler, “VIDOSAT: High-Dimensional Sparsifying Transform Learning for Online Video Denoising,” IEEE Transactions on Image processing, vol. 28, no. 4, pp. 1691–1704, 2018.
  • [10] ——, “FRIST—Flip** and rotation invariant sparsifying transform learning and applications,” Inverse Problems, vol. 33, no. 7, p. 074007, 2017.
  • [11] A. Gigie, A. A. Kumar, A. Majumdar, K. Kumar, and M. G. Chandra, “Joint Coupled Transform Learning Framework for Multimodal Image Super-Resolution,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 1640–1644.
  • [12] J. Maggu, A. Majumdar, E. Chouzenoux, and G. Chierchia, “Deep Convolutional Transform Learning,” in Neural Information Processing.   Springer, 2020, pp. 300–307.
  • [13] S. Mohan, Z. Kadkhodaie, E. P. Simoncelli, and C. Fernandez-Granda, “Robust And Interpretable Blind Image Denoising Via Bias-Free Convolutional Neural Networks,” in International Conference on Learning Representations, 2019.
  • [14] B. Lecouat, J. Ponce, and J. Mairal, “Fully Trainable and Interpretable Non-Local Sparse Models for Image Restoration,” in European Conference on Computer Vision.   Springer, 2020, pp. 238–254.
  • [15] A. Belloni, V. Chernozhukov, and L. Wang, “Square-Root Lasso: Pivotal Recovery of Sparse Signals via Conic Programming,” Biometrika, vol. 98, no. 4, pp. 791–806, 2011.
  • [16] A. Nakamura and M. Kobayashi, “Noise-Level Estimation from Single Color Image Using Correlations Between Textures in RGB Channels,” arXiv preprint arXiv:1904.02566, 2019.
  • [17] H. Lee, A. Battle, R. Raina, and A. Ng, “Efficient Sparse Coding Algorithms,” Advances in Neural Information Processing Systems, vol. 19, 2006.
  • [18] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Transactions on Signal Processing, vol. 41, no. 12, pp. 3397–3415, 1993.
  • [19] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic Decomposition by Basis Pursuit,” SIAM Journal on Scientific Computing, vol. 43, no. 1, pp. 129–159, 2001.
  • [20] R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 58, no. 1, pp. 267–288, 1996.
  • [21] X. Li, H. Jiang, J. Haupt, R. Arora, H. Liu, M. Hong, and T. Zhao, “On Fast Convergence of Proximal Algorithms for SQRT-Lasso Optimization: Don’t Worry About its Nonsmooth Loss Function,” in Uncertainty in Artificial Intelligence.   PMLR, 2020, pp. 49–59.
  • [22] S. Ravishankar, B. Wen, and Y. Bresler, “Online Sparsifying Transform Learning—Part I: Algorithms,” IEEE Journal of Selected Topics in Signal Processing, vol. 9, no. 4, pp. 625–636, 2015.
  • [23] A. Aberdam, J. Sulam, and M. Elad, “Multi-Layer Sparse Coding: The Holistic Way,” SIAM Journal on Mathematics of Data Science, vol. 1, no. 1, pp. 46–77, 2019.
  • [24] K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T.-W. Lee, and T. J. Sejnowski, “Dictionary Learning Algorithms for Sparse Representation,” Neural Computation, vol. 15, no. 2, pp. 349–396, 2003.
  • [25] K. Zhang, W. Zuo, and L. Zhang, “FFDNet: Toward a Fast and Flexible Solution for CNN-Based Image Denoising,” IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4608–4622, 2018.
  • [26] M. Zhussip, S. Soltanayev, and S. Y. Chun, “Training Deep Learning Based Image Denoisers From Undersampled Measurements Without Ground Truth and Without Image Prior,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 255–10 264.
  • [27] M. Chen, Y. Chang, S. Cao, and L. Yan, “Learning Blind Denoising Network for Noisy Image Deblurring,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 2533–2537.
  • [28] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
  • [29] A. Gnanasambandam and S. Chan, “One Size Fits All: Can We Train One Denoiser for All Noise Levels?” in International Conference on Machine Learning.   PMLR, 2020, pp. 3576–3586.
  • [30] R. Tempo, “Robust estimation and filtering in the presence of bounded noise,” IEEE Transactions on Automatic Control, vol. 33, no. 9, pp. 864–867, 1988.
  • [31] Y. Romano, A. Aberdam, J. Sulam, and M. Elad, “Adversarial Noise Attacks of Deep Learning Architectures: Stability Analysis via Sparse-Modeled Signals,” Journal of Mathematical Imaging and Vision, vol. 62, pp. 313–327, 2020.
  • [32] M. Massias, Q. Bertrand, A. Gramfort, and J. Salmon, “Support recovery and sup-norm convergence rates for sparse pivotal estimation,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2020, pp. 2655–2665.
  • [33] K. Lounici, M. Pontil, A. Tsybakov, and S. Van De Geer, “Taking Advantage of Sparsity in Multi-Task Learning,” in The 22nd Conference on Learning Theory, 2009.
  • [34] F. A. Potra and S. J. Wright, “Interior-point methods,” Journal of Computational and Applied Mathematics, vol. 124, no. 1-2, pp. 281–302, 2000.
  • [35] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers,” Foundations and Trends in Machine learning, vol. 3, no. 1, pp. 1–122, 2011.
  • [36] K. Gregor and Y. LeCun, “Learning Fast Approximations of Sparse Coding,” in International Conference on Machine Learning, 2010, pp. 399–406.
  • [37] V. Monga, Y. Li, and Y. C. Eldar, “Algorithm Unrolling: Interpretable, Efficient Deep Learning for Signal and Image Processing,” IEEE Signal Processing Magazine, vol. 38, no. 2, pp. 18–44, 2021.
  • [38] K. Zhang, L. V. Gool, and R. Timofte, “Deep Unfolding Network for Image Super-Resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3217–3226.
  • [39] J. Liu, Y. Sun, W. Gan, X. Xu, B. Wohlberg, and U. S. Kamilov, “SGD-Net: Efficient Model-Based Deep Learning With Theoretical Guarantees,” IEEE Transactions on Computational Imaging, vol. 7, pp. 598–610, 2021.
  • [40] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in International Conference on Machine Learning.   PMLR, 2013, pp. 1139–1147.
  • [41] J. Sulam, A. Aberdam, A. Beck, and M. Elad, “On Multi-Layer Basis Pursuit, Efficient Algorithms and Convolutional Neural Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 8, pp. 1968–1980, 2019.
  • [42] I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” in International Conference on Learning Representations, 2018.
  • [43] D. Simon and M. Elad, “Rethinking the CSC Model for Natural Images,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [44] K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, and L. Zhang, “Waterloo Exploration Database: New challenges for image quality assessment models,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 1004–1016, 2016.
  • [45] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour Detection and Hierarchical Image Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 898–916, 2010.
  • [46] V. Papyan, Y. Romano, and M. Elad, “Convolutional Neural Networks Analyzed via Convolutional Sparse Coding,” Journal of Machine Learning Research, vol. 18, no. 1, pp. 2887–2938, 2017.
  • [47] J. Sulam, V. Papyan, Y. Romano, and M. Elad, “Multilayer Convolutional Sparse Modeling: Pursuit and Dictionary Learning,” IEEE Transactions on Signal Processing, vol. 66, no. 15, pp. 4090–4104, 2018.
  • [48] C. Giraud, Introduction to high-dimensional statistics.   CRC Press, 2021.