¹¹institutetext: School of Computer, University of Birmingham
¹¹email: [email protected]
[email protected]
https://github.com/krulllab/COSDD

Unsupervised Denoising for Signal-Dependent and Row-Correlated Imaging Noise

Benjamin Salmon 0000-0002-5919-0158 Alexander Krull 0000-0002-7778-7169

Abstract

Accurate analysis of microscopy images is hindered by the presence of noise. This noise is usually signal-dependent and often additionally correlated along rows or columns of pixels. Current self- and unsupervised denoisers can address signal-dependent noise, but none can reliably remove noise that is also row- or column-correlated. Here, we present the first fully unsupervised deep learning-based denoiser capable of handling imaging noise that is row-correlated as well as signal-dependent. Our approach uses a Variational Autoencoder (VAE) with a specially designed autoregressive decoder. This decoder is capable of modeling row-correlated and signal-dependent noise but is incapable of independently modeling underlying clean signal. The VAE therefore produces latent variables containing only clean signal information, and these are mapped back into image space using a proposed second decoder network. Our method does not require a pre-trained noise model and can be trained from scratch using unpaired noisy data. We show that our approach achieves competitive results when applied to a range of different sensor types and imaging modalities. Refer to caption Figure 1: Noise is often assumed to be spatially uncorrelated and/or signal-independent such as additive white gaussian noise (AWG). Such assumptions greatly facilitate self/unsupervised denoising. Unfortunately, real noise in microscopy is usually neither. We present the first unsupervised denoising approach for this challenging scenario.

1 Introduction

Imaging is often affected by the presence of noise, posing a challenge in the processing of recorded data. This is especially true in scientific imaging where technology is routinely pushed to the boundary of what is possible. Consequently, data is often pre-processed in an attempt to remove imaging noise before downstream analysis. Over the years, a variety of denoising approaches have been devised [24], including traditional filter-based methods [5] and later, supervised deep learning-based methods [44], which use training data to learn a map** from noisy to clean images.

Supervised deep learning methods excel with respect to the quality of their output. However, these methods require paired training data, typically consisting of pairs of noisy and clean images. Unlike in applications with natural images (photographs), such paired data is often unobtainable in scientific imaging problems such as microscopy [17, 41, 45]. As a consequence, the applicability of supervised denoising methods in many areas of scientific imaging is limited.

Self- and unsupervised methods (e.g. [1, 21, 36, 35, 8]) have been proposed as a solution to the lack of high quality training data. They can be trained directly on the data that is to be denoised and have substantially improved the practical applicability of denoising in scientific imaging. These methods separate imaging noise from the underlying signal by making assumptions about the statistical nature of the noise. Typically, they assume the noise is $(i)$ signal-independent (purely additive and occurs separately of the underlying signal) [39], or $(ii)$ spatially uncorrelated (unstructured and occurs separately for each pixel) [21, 22, 36, 16, 38]. We show examples of $(i)$ and $(ii)$ in Fig. 1.

In practice, $(i)$ is often broken by the presence of Poisson shot noise [24], where the variance increases as the intensity of the underlying signal increases. Moreover, many popular scientific cameras and imaging setups break $(ii)$ by producing row- or column-correlated noise. For example, the scientific Complementary Metal-Oxide-Semiconductor (sCMOS) [31] cameras that are popular in optical microscopy [29] have separate amplifiers for each column of pixels, leading to correlated noise within each column [47]. The detectors used in infrared imaging systems, e.g. microbolometers, also commonly use separate column amplifiers and suffer from similar noise structures [7]. Depending on their settings, Electron Multiplying Charge-Coupled Device (EMCCD) [6] cameras can produce horizontally correlated read-out noise [2]. Similarly, scanning-based imaging methods such as scanning transmission electron microscopy (STEM) [33] are prone to line artifacts caused by slow reaction time of readout electronics [28]. Examples of noisy images from these modalities along with plots of their spatial autocorrelation can be found in Fig. 3. These types of noise cannot be removed with basic self- or unsupervised methods. Recently, variants of these methods for spatially correlated noise have been proposed, but these are limited to locally correlated noise and can come at the expense of the reduced reconstruction quality [2, 35].

In this paper, we present the first unsupervised deep learning-based denoiser capable of reliably removing signal-dependent noise that is correlated along rows or columns of pixels, as it commonly occurs in microscopy data. The approach is illustrated in Fig. 2. Our method requires neither examples of noise-free images, which can be impossible to obtain (e.g. [42]), nor pre-trained noise models [36, 39, 22] or hand-crafted priors [42, 5, 10]. Furthermore, we do not rely on blind-spot approaches [2] or subsampling [16, 26] techniques that degrade image quality and therefore limit denoising performance.

The backbone of our approach is the Variational Autoencoder (VAE) [19], which trains a latent variable model of the noisy image data $\mathbf{x}$ as $p_{\theta}(\mathbf{x})\propto p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z})p_{% \mathbf{\theta}}(\mathbf{z})$ , where $\theta$ refers to the model parameters. We guide our VAE to represent clean, noise-free images with latent variables $\mathbf{z}$ and to model the noise generation process with the decoder $p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z})$ .

Our method for this is based on the representation learning technique proposed by Chen et al. [4], who revealed how a VAE prefers to avoid using latent variables $\mathbf{z}$ to describe structures that could instead be modeled locally by its decoder $p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z})$ . For an autoregressive (AR) decoder such as PixelCNN [32, 11], this includes any inter-pixel dependencies that lie within its receptive field. Consequently, latent variables will only represent information that the decoder cannot model, i.e., dependencies that span beyond its receptive field.

We take advantage of this behaviour by designing a decoder that is capable of modeling exactly what we want to be excluded from our latent variables – the imaging noise – while being incapable of modeling the remainder of the data. Specifically, we use an AR decoder that can only model axis-aligned structures because its receptive field spans only a row or column of the image. This trains our model to exclude row- or column-correlated noise from the latent variables, while encouraging it to include the statistics of the underlying signal.

We then propose a second network, termed signal decoder, that is trained to map these latent variables back into image space, thereby producing denoised images. Following Lehtinen et al. [27], who showed that noisy images can be used as training targets for denoisers, we train the signal decoder alongside the VAE by using the original noisy data as training target.

To summarize, our main contributions are:

•

We present an unsupervised deep learning-based denoiser for signal-dependent noise that is correlated along rows or columns of pixels.
•

By equip** a VAE with a 1-dimensional AR decoder, we guide it to represent clean images with latent variables and model noise content with the decoder.
•

We introduce a novel architecture using a signal decoder that is trained in a second step to predict the signal from our latent variables.
•

We apply our method to microscopy data recorded with various imaging modalities, as well as other datasets, achieving state-of-the-art denoising results compared to other unsupervised methods.

2 Background

2.1 Image formation and imaging noise

We can express image formation as a two step process. The clean image, or as we also refer to it, the signal, $\mathbf{s}=(s_{1,1},\dots,s_{N,M})$ , is drawn from a distribution $p(\mathbf{s})$ and then subjected to noise, producing the noisy image $\mathbf{x}=(x_{1,1},\dots,x_{N,M})$ as drawn from the noise distribution $p(\mathbf{x}|\mathbf{s})$ . Here, $s_{i,j}$ and $x_{i,j}$ correspond to the respective pixel values at position $(i,j)$ , and $N$ and $M$ are the number of rows and columns in the image, respectively.

Generally, we assume that the imaging noise is zero-centered, that is, the expected value of a noisy image equals the signal, $\mathbb{E}_{p(\mathbf{x}|\mathbf{s})}[\mathbf{x}]=\mathbf{s}$ . Note that zero-centered noise is a weak assumption that is widely shared in literature (e.g. [21, 25, 8]). Another way of looking at this assumption is that the sensor or camera is assumed to behave linearly so that an increase in $s_{i,j}$ (e.g. by increasing the light intensity) will, on average, result in a proportional increase in $x_{i,j}$ .

Additive white Gaussian noise: Arguably, the most basic traditionally used noise model is additive white Gaussian (AWG) noise, see e.g. [5, 3, 30]. We can write the probability distribution for noisy observations in an AWG noise model as

p(\mathbf{x}|\mathbf{s})=\prod_{i=1}^{N}\prod_{j=1}^{M}p(x_{i,j}|s_{i,j}),

(1)

formulated as a product over pixels. In general, we refer to any type of noise model that can be factorised according to Eq. 1 as spatially uncorrelated. In AWG noise, $p(x_{i,j}|s_{i,j})$ corresponds to a normal distribution centered at $s_{i,j}$ , with variance fixed for all $s_{i,j}$ .

As a consequence, the AWG noise itself, $\mathbf{n}=\mathbf{x}-\mathbf{s}$ , follows a fixed variance normal distribution centered at zero and is independent of the underlying signal. In general, we call noise that does not depend on the underlying signal, such that $p(\mathbf{n}|\mathbf{s})=p(\mathbf{n})$ , signal-independent. Types of noise for which this is not the case will be referred to as signal-dependent.

Denoising approaches have assumed noise to be both spatially uncorrelated and signal-independent. Unfortunately, real imaging noise rarely follows these assumptions.

Poisson shot noise: Most imaging systems are at least to some degree affected by Poisson shot noise [24]. A Poisson noise model follows Eq. 1, with $p(x_{i,j}|s_{i,j})$ corresponding to a Poisson distribution with rate $\lambda=s_{i,j}$ . A popular variant on this model is the Poisson-Gaussian noise model (e.g. [46]), where $p(x_{i,j}|s_{i,j})$ is assumed to be a combination of Poisson shot noise and Gaussian read-out noise.

Poisson and Poisson-Gaussian noise models are spatially uncorrelated but signal-dependent, since the variance of the Poisson component in each pixel depends on the pixel’s signal $s_{i,j}$ .

Row-correlated noise: While Poisson-Gaussian noise models are a popular choice in describing noise distributions, many imaging systems (e.g. EMCCD, sCMOS, or scanning microscopes) can produce noise that does not conform to Eq. 1, but is instead correlated along rows (or columns) of pixels. We propose to describe such noise as

p(\mathbf{x}|\mathbf{s})=\prod_{i=1}^{N}\prod_{j=1}^{M}p(x_{i,j}|\mathbf{s},x_% {i,1},\dots,x_{i,j-1}),

(2)

where $p(x_{i,j}|\mathbf{s},x_{i,1},\dots,x_{i,j-1})$ is the distribution of possible noisy pixel values $x_{i,j}$ conditioned on the signal, as well as all “previous” values in the same row, $i$ . While this model can describe interactions between pixels within the same row, pixels in different rows are (conditionally) independent (given a signal $\mathbf{s}$ ). Note that in this formulation a pixel $x_{i,j}$ can depend on not only the signal in pixel $(i,j)$ , as with shot noise, but on the entire image for more complex interactions.

In our work, we consider this type of noise as described in Eq. 2, which we believe is a good model for many real scientific imaging data.

2.2 Latent variable models and Variational Autoencoders

A latent variable model, with parameters $\theta$ , defines probability distribution, $p_{\theta}(\mathbf{x})\propto p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z})p_{% \mathbf{\theta}}(\mathbf{z})$ , over observed variables $\mathbf{x}$ via latent variables $\mathbf{z}$ . This can be used to represent a data generation process where a value $\mathbf{z}$ is first sampled from the prior $p_{\mathbf{\theta}}(\mathbf{z})$ and a value $\mathbf{x}$ is sampled from the conditional distribution $p_{\theta}(\mathbf{x}|\mathbf{z})$ .

A VAE can be used to simultaneously optimize the model’s parameters and approximate the posterior, $p_{\theta}(\mathbf{z}|\mathbf{x})$ , via maximization of a lower bound on the marginal log-likelihood,

\mathcal{L}(\theta,\phi)=\mathbb{E}_{q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})}% [\log p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z}))]-D_{KL}(q_{\mathbf{\phi}}(% \mathbf{z}|\mathbf{x}))\|p_{\mathbf{\theta}}(\mathbf{z})),

(3)

where the second term on the RHS is the Kullback-Leibler divergence [23] from the true prior to an approximate posterior $q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})$ . In Eq. 3, the first term is known as the reconstruction error and the second term is known as the regularizer. Accordingly, $q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})$ is known as the encoder and $p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z})$ is known as the decoder.

2.3 VAEs and the division of labour

When designing a latent variable model for image data, the decoder can be made autoregressive, where the distribution of each pixel is conditioned on the value of previous pixels in row-major order, as well as on $\mathbf{z}$ ,

p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z})=\prod^{N}_{i=1}\prod^{M}_{j=1}p_{% \theta}(x_{i,j}|\mathbf{z},x_{<(i,j)}).

(4)

We refer to $x_{<(i,j)}$ as the full AR receptive field, and its shape is shown in Fig. 2b. This is modification is intended to make the model more expressive. However, the decoder is now powerful enough to model the entire data distribution locally, to the extent that it is unclear which aspects of the data should be encoded in $\mathbf{z}$ and which should be modeled by the decoder.

In practice, when using a VAE, there seems to be a preference to model as much content as possible with the decoder. He et al. [14] argued this behaviour is caused by the approximate posterior lagging behind the true posterior in the early stages of training, causing the parameters to get stuck in a local optimum, where the decoder models the data without using the latent variables.

Alternatively, Chen et al. [4] reasoned that with most practical VAEs, ignoring the latent variables achieves a tighter lower bound on the marginal log-likelihood. By only using the latent variables to express what the decoder cannot model, the true posterior is brought closer to the prior and can be more closely matched by the relatively inflexible approximate posterior. The authors then demonstrated how this behaviour can be used to control the division of labour in a VAE. Specifically, they designed an AR decoder that is capable of modeling information that they do not want captured in the latent variables, but is incapable of modeling the information that they do want captured in the latent variables.

In this paper, we design an AR receptive field that only captures the correlations common in scientific imaging noise. This allows us train a latent variable model of noisy image data where latent variables explain the signal content and the decoder models only the noise generation process. We then use the approximate posterior to sample latent variables, each representing one of the clean signals that could possibly underlie a given noisy image.

3 Related work

3.1 Unsupervised denoising

Existing unsupervised deep learning-based denoisers extend the latent variable model to include a known, explicit noise model. The joint distribution is then defined as $p_{\theta}(\mathbf{x},\mathbf{s},\mathbf{z})=p(\mathbf{x}|\mathbf{s})p_{\theta% }(\mathbf{s}|\mathbf{z})p_{\theta}(\mathbf{z})$ , where $p_{\theta}(\mathbf{s}|\mathbf{z})$ is a deterministic distribution with support only at $g_{\theta}(\mathbf{z})$ and $p(\mathbf{x}|\mathbf{s})=p(\mathbf{x}|\mathbf{s},\mathbf{z})$ . The parameters of this model can be optimized by a VAE after a slight modification to the objective in Eq. 3, changing the first term to $\mathbb{E}_{q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|% \mathbf{s}=g_{\theta}(\mathbf{z}))]$ .

In DivNoising [36], the noise model follows Eq. 1 and is learned from calibration data or co-learned during training. Later, HDN [35] was proposed as an extension to DivNoising that used a VAE with a hierarchy of latent variables [40]. It was found that short range spatially correlated noise structures were modeled by only the bottom levels of the hierarchy, and they could be removed by preventing the these latent variables from using information from the encoded input. This technique is known as HDN₃₆.

Next, Autonoise [39] was proposed for removing spatially correlated but signal-independent noise by replacing the pixel-independent noise model in HDN with a CNN-based AR decoder [32], $p(\mathbf{x}|\mathbf{s})=\prod^{N}_{i=1}\prod^{M}_{j=1}p(x_{i,j}|\mathbf{s},x_% {<(i,j)})$ .

While this method can be used to remove noise of any spatial correlation, it cannot be applied to signal-dependent noise, making it inapplicable with most real-world data. The reason for this is that the noise model must be pre-trained using suitable calibration data. To learn only the structure of the noise, the calibration data required is samples of pure noise, which can be obtained by, e.g., imaging without light. To also learn how the noise and signal are correlated, paired noisy and clean images with a range of signal content would be needed. This is the same requirement as for supervised methods and rarely available.

3.2 Self-supervised denoising

Self-supervised deep learning-based denoisers use a noisy image as its own training target, but guide the network to learn a denoising function by corrupting the input in some way. One technique [1, 21, 43] introduces blind-spots, forcing the network to predict a pixel’s value from surrounding pixels. Another trains the network to predict a subset of randomly sampled pixels from another subset of randomly sampled pixels [16]. For photon-counting data, GAP et al. [20] proposed removing photons and training the network to predict them with a Poisson distribution. These techniques exploit a property of spatially uncorrelated noise, which is that the noise in one pixel cannot be predicted from the noise in other pixels.

To address noise that is spatially correlated, the work Structured Noise2Void (SN2V) [2] extended the blind-spot approach to also mask pixels containing noise that is correlated with the noise in the pixel being predicted. Lee et al. [26] also extended the blind-spot approach, but by sub-sampling pixels to break up noise structures, making the noise effectively spatially uncorrelated and ready for traditional blind-spot denoising. This sub-sampling technique is however only applicable to relatively short range correlations that might be found in consumer photography, not the longer range correlations that are common in microscopy.

4 Method

Refer to caption — Figure 2: a): A Variational Lossy Autoencoder [4] (solid arrows) is trained to model the distribution of noisy images $\mathbf{x}$ . The autoregressive decoder models the noise component of the images while the latent code models only the clean signal component $\mathbf{s}$ . In a second step (dashed arrows), our novel *signal decoder* is trained to map latent variables into image space, producing an estimate of the signal underlying $\mathbf{x}$ . b): To ensure that the decoder models only imaging noise and the latent code captures only the signal, we modify decoder’s receptive field. While a full autoregressive decoder includes pixels above and to the left (shown in blue), our decoder’s receptive field corresponds to the row-correlated structure of the imaging noise and includes only pixels in the same row as the pixel being modeled.

We propose a VAE-based unsupervised image denoiser for noise that is both signal-dependent and correlated along rows or columns of pixels. It is trained using only noisy images and does not require a pre-trained noise model. We restrict the receptive field of the decoder so it can model the correlations of the noise content but not the correlations of the underlying signal. Following the insights of Chen et al. [4] (Sec. 2.3), the decoder will therefore learn to model the noise, leaving the underlying signal to be encoded in the latent variable $\mathbf{z}$ . We then propose a method for taking these latent variables and map** them back into image space to obtain denoised estimates of clean signals. A full outline of our method can be found in Fig. 2.

4.1 Autoregressive receptive fields for noise

Our AR decoder has a 1-dimensional receptive field that is sufficient for modeling row/column correlated noise (Eq. 2) as it commonly occurs in EMCCD, sCMOS, microbolometer detectors, SCM or STEM (see Sec. 1) by spanning pixels in the same row or column as $x_{i,j}$ . See Fig. 2b for a visual representation.

To remove row-correlated noise, the first step in our denoising process is to train a VAE to model noisy image data with the objective function in Eq. 3, where $p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z})$ is factorized as,

p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z})=\prod_{i=1}^{N}\prod_{j=1}^{M}p_{% \theta}(x_{i,j}|\mathbf{z},x_{i,1},\dots,x_{i,j-1}).

(5)

To remove column-correlated noise, the factorization over pixels is perpendicular.

We find that this factorization is insufficient for modeling signal content, which is highly correlated in all directions. Consequently, the VAE learns to encode signal content in its latent variables. An experimental investigation of the effects of changing receptive field size can be found in Sec. 5.4, and details on how we construct this receptive field with our AR decoder architecture can be found in the supplementary material.

4.2 Decoding the signal

Once our VAE has been trained, its latent space will represent clean signals, i.e., each $\mathbf{z}$ contains all the information about an $\mathbf{s}$ . We would now like to use the model for denoising by inferring possible clean signals $\mathbf{s}$ for a given noisy image $\mathbf{x}$ . Unfortunately, unlike previous methods [36, 35, 39], we cannot directly sample clean images from our encoder. Rather, samples from $q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})$ will directly correspond to a clean signal $\mathbf{s}$ . We denote the signal corresponding to a value of $\mathbf{z}$ as $\mathbf{s}(\mathbf{z})$ . For an experimental validation of this deterministic relationship, please refer to the supplementary material. To obtain denoised images, we approximate $\mathbf{s}(\mathbf{z})$ with an additional regression network, termed the signal decoder, $f_{\nu}(\mathbf{z})\approx\mathbf{s}(\mathbf{z})$ . In the following, we describe how $f_{\nu}(\mathbf{z})$ is trained despite not having access to training pairs $(\mathbf{z}^{k},\mathbf{s}(\mathbf{z}^{k}))$ .

In fact, training the signal decoder only requires pairs $(\mathbf{z}^{k},\mathbf{x}^{k})$ , where $\mathbf{x}^{k}$ is a noisy image and $\mathbf{z}^{k}\sim q_{\phi}(\mathbf{z}|\mathbf{x}^{k})$ . If we assume that the approximate posterior is accurate, such that $q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})\approx p_{\mathbf{\theta}}(\mathbf{z}% |\mathbf{x})$ , these pairs can be equivalently thought of as sampled from the joint distribution, $(\mathbf{z}^{k},\mathbf{x}^{k})\sim p_{\theta}(\mathbf{z},\mathbf{x})$ .

Least squares regression analysis tells us that optimizing the signal decoder with the $L2$ loss,

\mathcal{L}(\nu)=|f_{\nu}(\mathbf{z}^{k})-\mathbf{x}^{k}|^{2},

(6)

will train it to approximate $\mathbb{E}_{p_{\theta}(\mathbf{x}|\mathbf{z})}[\mathbf{x}]$ for any $\mathbf{z}$ [13]. Since $\mathbf{z}$ contains no more or less information about $\mathbf{x}$ than $\mathbf{s}(\mathbf{z})$ , the signal decoder will equivalently be approximating $\mathbb{E}_{p_{\theta}(\mathbf{x}|\mathbf{s}(\mathbf{z}))}[\mathbf{x}]$ . Recalling that imaging noise is zero-centered (Sec. 2.1), the expected value of a noisy image given an underlying signal is that signal. Therefore, $f_{\nu}(\mathbf{z})\approx\mathbb{E}_{p_{\theta}(\mathbf{x}|\mathbf{s}(\mathbf% {z}))}[\mathbf{x}]=\mathbf{s}(\mathbf{z})$ . This is similar to Noise2Noise [27], where the regressor for $\mathbf{s}$ is trained using noisy targets $\mathbf{x}$ .

Even though the signal decoder would naturally be trained in a second stage after the main VAE is finished, in practice we co-trained it alongside the main VAE. At every training step, the sampled latent variable is fed to both decoders, but only the loss from the AR decoder is allowed to backpropagate to the encoder. This method of training is simply for convenience and we did not observe any changes in performance compared to a signal decoder that is trained separately, after the main VAE.

4.3 Inference

With both the VAE and signal decoder trained, we can denoise an image $\mathbf{x}$ in a two step process. We first sample a latent variable $\mathbf{z}\sim q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})$ and then obtain a clean image by decoding $\hat{\mathbf{s}}=f_{\nu}(\mathbf{z})$ . Similarly to [36, 35, 39] the result constitutes a random possible solution $\hat{\mathbf{s}}\sim p(\mathbf{s}|\mathbf{x})$ . To obtain a consensus solution, we follow [36, 35, 39] in averaging a large number of such samples. In the following experiments, all results are the mean of 100 samples, both for our method and the baseline HDN₃₆.

5 Experiments

Table 1: Comparison with baseline methods using Peak Signal-to-Noise Ratio (PSNR), higher is better. Best results for methods that do not require clean images are printed in bold, and best results overall are underlined. The dataset that contains spatially uncorrelated noise and the methods that address only unstructured noise are marked with an asterisk*. CARE is a supervised denoiser requiring examples of clean images, unlike the other baselines.

EMCCD SCM Simulated Conv. A Conv. B Mouse Actin Mouse Nuclei* Actin Conf. Mito Conf. FFHQ Stripe FFHQ Checkerb. HDN* - 37.39 34.12 36.87 - - - - HDN large* - - - 38.12 - - - - \hdashlineSN2V 30.29 31.67 32.80 36.62 23.30 26.41 29.52 19.11 HDN ${}_{36}$ 31.41 37.21 - - 26.84 26.51 32.64 29.61 HDN ${}_{36}$ large 31.80 37.92 34.14 - 27.17 26.15 34.54 25.51 Ours small 34.42 38.99 36.50 39.56 27.35 27.49 32.87 33.51 Ours large 37.49 44.10 39.23 42.98 27.41 27.50 35.66 36.27 \hdashlineCARE 31.56 36.71 34.20 36.58 29.44 27.55 36.46 36.89

5.1 Datasets

We tested the performance of our proposed denoiser on real noisy images captured by four different imaging modalities that commonly suffer from row-correlated noise. The first is the EMCCD sensor for which we have three fluorescence microscopy datasets with known ground truth: Convallaria A [2], Convallaria B [37] and Mouse Actin [37]. The second is Scanning Confocal Microscopy (SCM), for which we have three fluorescence microscopy datasets with known ground truth: Mouse Nuclei [37], Actin Confocal [12] and Mito Confocal [12]. Note that Mouse Nuclei contains spatially uncorrelated noise, but was included to demonstrate that the proposed method is still applicable to spatially uncorrelated noise without modification. There is also the sCMOS sensor for which we have one fluorescence dataset with unknown ground truth: Embryo [9], followed by the microbolometer for which we have one infrared imaging dataset with unknown ground truth: IR [34]. Lastly, we include a STEM dataset with unknown ground truth: STEM [15]. For details on dataset size and train/test splits, please see each original publication.

In addition, we created two datasets by corrupting the Flickr Faces HQ thumbnails dataset [18] with simulated noise. For FFHQ - Stripes, the images were corrupted by a combination of additive white Gaussian noise, Poisson shot noise and additive white Gaussian noise that had undergone a horizontal Gaussian blur. For FFHQ - Checkerboard, the images were corrupted by a combination of a vertical checkerboard pattern and Gaussian noise with an inverse signal dependence. For full details on how the simulated noise cases were created, please refer to the supplementary.

5.2 Baselines and architecture

There is a scarcity in the literature of deep learning-based denoisers that can be applied to signal-dependent row-correlated noise and do not require examples of clean images or extensive calibration data. Currently, the only methods that meet these requirements are the self-supervised denoiser Structured Noise2Void (SN2V) [2] and the unsupervised denoiser Hierarchical DivNoising36 (HDN₃₆) [35]. We use these as baseline methods along with the supervised denoiser Content Aware Image Restoration (CARE) [44] for datasets where paired training images are available. For details on how each baseline is implemented in this paper, please refer to the supplementary.

Of the denoisers not requiring paired images, HDN₃₆’s performance is best, but the model implemented in the baseline’s publication uses significantly fewer parameters than ours (7 million to 25 million). We therefore evaluate an additional version, termed HDN₃₆ large, with a similar number of parameters made by increasing the number of latent dimensions from 32 to 64.

It should be noted that the Convallaria B and Mouse Actin datasets had been treated as spatially uncorrelated by Prakash et al. [35] when testing the denoiser HDN. We additionally report those results in Tab. 1. It should also be noted that the noise in the Mouse Nuclei dataset is spatially uncorrelated, so was denoised by HDN and its higher parameter version HDN large, which was made in the same way as HDN₃₆ large.

Our method requires a choice of orientation and size for the AR decoder’s receptive field. Orientation was determined by examining the spatial autocorrelation of noise samples, with these plots reported in Fig. 3, and following the ablation study in Sec. 5.4, we always used a receptive field length of 40 pixels.

As for our model’s encoder, latent variables are produced by a fully convolutional hierarchical VAE [40], with 14 levels to its hierarchy. In addition to the full sized version, we evaluate a smaller version that requires approximately 6GB (as opposed to 20GB) of GPU memory to train. This was achieved by reducing the number of latent variables from 14 to 6 and reducing the number of latent dimensions from 64 to 32. We refer to these models as Ours small and Ours large, for the lower memory and higher memory versions respectively. Please see the supplementary materials for full architecture and training details.

5.3 Comparing denoising performance

Quantitative results measuring the Peak Signal-to-Noise Ratio (PSNR) in dB are reported in Tab. 1 and qualitative results are reported in Fig. 3. Note that HDN₃₆ failed to train with the Mouse Actin and the Embryo dataset. Out of the methods that do not require paired images, Ours large achieved the highest PSNR across all datasets, even beating the supervised CARE on four of six microscopy datasets. Ours small then had the second highest PSNR for an unpaired method on all datasets except FFHQ - Stripe, where it was beaten by HDN₃₆ large, although it should be noted that HDN₃₆ large requires almost 20GB of GPU memory to train while Ours small requires only 6GB.

Turning to the qualitative results in Fig. 3, we see that Ours large denoised images from each dataset without leaving behind any artifacts, whereas HDN₃₆ large could not remove the correlated component of the noise from the STEM or the FFHQ - Checkerboard dataset. SN2V left artifacts on all datasets with spatially correlated noise, but not for the uncorrelated Mouse Nuclei dataset. It can also be seen that Ours large produced the sharpest and most accurate images, with the exception of Mito Confocal, for which HDN₃₆ large was sharper.

5.4 Ablation study - receptive field size

As stated in Sec. 4, to model the noise correlations addressed in this paper, the receptive field of a VAE’s AR decoder must span pixels in the same row or column as the pixel being predicted. To effect of the investigate receptive field size (number of pixels), we denoised the FFHQ - Checkerboard dataset using a range of RF lengths, from 10 pixels to 120 pixels, and measured the effect on PSNR. We also include the effect of training a model with a full AR receptive field, as shown in Fig. 2b. The results are reported in Fig. 4.

The study shows that the AR decoder is able to model this spatially correlated noise, and therefore have the noise removed by the encoder, when the receptive field spans 40 pixels. Moreover, the decoder does not seem to model more of the signal as the receptive field grows. If it did, we would expect a steady drop in PSNR as denoised images lose signal content.

However, as the image on the right of Fig. 4 shows, the signal will be modeled by an AR decoder with a full receptive field. Latent variables will therefore carry no information about $\mathbf{x}$ , causing the signal decoder to minimize Eq. 6 by predicting the mean of the training set. This is to be expected, as when the latent variable $\mathbf{z}$ carries no information about $\mathbf{x}$ , $\mathbb{E}_{p_{\theta}(\mathbf{x}|\mathbf{z})}[\mathbf{x}]=\mathbb{E}_{p_{% \theta}(\mathbf{x})}[\mathbf{x}]$ .

5.5 Ablation study - noise reconstruction

If the VAE’s decoder is modeling only the noise component of images, encoding an image with the VAE and sampling from the AR decoder should yield an image with the same underlying signal but a different random sample of noise. If the decoder’s model of the noise is accurate, the sampled noise should exhibit the same autocorrelation and signal-dependence characteristics as the noise in the original image. An investigation into this is reported in Fig. 5, where a noisy image from the Convallaria A dataset was encoded by a trained VAE, then a latent variable was sampled and used by the VAE’s AR decoder to sample a reconstruction of the original image. The reconstructed noise exhibits spatial autocorrelation and signal dependence very similar to the real noise, indicating that our AR decoder has learnt an accurate model of the noise.

6 Conclusion

We proposed an unsupervised VAE-based denoising algorithm for noise that is correlated along rows or columns of pixels and is signal-dependent. By engineering the receptive field of the AR decoder, the VAE’s latent variables are encouraged to represent the signal content of an image while discarding the noise. We then presented a novel signal decoder that is trained to map this latent variable into an estimate of the clean image. The algorithm outperforms both the self-supervised denoiser SN2V [2] and the unsupervised denoiser HDN₃₆ [35].

Our method is suited to noise with correlations that run parallel to the axes of the image. Such noise commonly occurs in microscopy. Often, users in labs are unable to find suitable unsupervised methods to remove such noise. We release our code as open source and strongly believe that the scientific imaging community will apply and adapt our methods in a variety of applications.

While we have achieved our results using simple 1-dimensional receptive fields, some imaging modalities produce noise that is correlated in multiple directions, therefore requiring a differently shaped receptive field. We found that extending the receptive field to cover both a row and column of pixels allows the decoder to model some aspects of the signal, making the technique unsuitable for removing noise correlated in two dimensions. However, we believe that techniques other than sha** the AR decoder receptive field could be discovered to limit the decoder’s modelling capabilities. We hope that future work will further improve the theoretical understanding of the method and allow us to utilize its full potential.

Acknowledgements We would like to thank Ales Leonardis (University of Birmingham) for a helpful discussion on the writing of the paper. The computations described in this paper were performed using the University of Birmingham’s BlueBEAR HPC service, which provides a High Performance Computing service to the University’s research community. See http://www.birmingham.ac.uk/bear for more details.

References

[1] Batson, J., Royer, L.: Noise2self: Blind denoising by self-supervision. In: International Conference on Machine Learning. pp. 524–533. PMLR (2019)
[2] Broaddus, C., Krull, A., Weigert, M., Schmidt, U., Myers, G.: Removing structured noise with self-supervised blind-spot networks. In: 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI). pp. 159–163. IEEE (2020)
[3] Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05). vol. 2, pp. 60–65. Ieee (2005)
[4] Chen, X., Kingma, D.P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., Abbeel, P.: Variational lossy autoencoder. In: International Conference on Learning Representations (2017)
[5] Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image denoising with block-matching and 3d filtering. In: Image processing: algorithms and systems, neural networks, and machine learning. vol. 6064, pp. 354–365. SPIE (2006)
[6] Denvir, D.J., Conroy, E.: Electron multiplying ccds. In: Opto-Ireland 2002: Optical Metrology, Imaging, and Machine Vision. vol. 4877, pp. 55–68. SPIE (2003)
[7] Dupont, B., Dupret, A., Belhaire, E., Villard, P.: Fpn sources in bolometric infrared detectors. IEEE Sensors Journal 9(8), 944–952 (2009)
[8] Eom, M., Han, S., Park, P., Kim, G., Cho, E.S., Sim, J., Lee, K.H., Kim, S., Tian, H., Böhm, U.L., et al.: Statistically unbiased prediction enables accurate denoising of voltage imaging data. Nature Methods pp. 1–12 (2023)
[9] Glaser, A.K., Bishop, K.W., Barner, L.A., Susaki, E.A., Kubota, S.I., Gao, G., Serafin, R.B., Balaram, P., Turschak, E., Nicovich, P.R., et al.: A hybrid open-top light-sheet microscope for versatile multi-scale imaging of cleared tissues. Nature methods 19(5), 613–619 (2022)
[10] Gu, S., Zhang, L., Zuo, W., Feng, X.: Weighted nuclear norm minimization with application to image denoising. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2862–2869 (2014)
[11] Gulrajani, I., Kumar, K., Ahmed, F., Taiga, A.A., Visin, F., Vazquez, D., Courville, A.: Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013 (2016)
[12] Hagen, G.M., Bendesky, J., Machado, R., Nguyen, T.A., Kumar, T., Ventura, J.: Fluorescence microscopy datasets for training deep neural networks. GigaScience 10(5), giab032 (2021)
[13] Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction, pp. 18–22. Springer (2009)
[14] He, J., Spokoyny, D., Neubig, G., Berg-Kirkpatrick, T.: Lagging inference networks and posterior collapse in variational autoencoders. In: International Conference on Learning Representations (2019), https://openreview.net/forum?id=rylDfnCqF7
[15] Henninen, T.R., Bon, M., Wang, F., Passerone, D., Erni, R.: The structure of sub-nm platinum clusters at elevated temperatures. Angewandte Chemie International Edition 59(2), 839–845 (2020)
[16] Huang, T., Li, S., Jia, X., Lu, H., Liu, J.: Neighbor2neighbor: Self-supervised denoising from single noisy images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14781–14790 (2021)
[17] Izadi, S., Sutton, D., Hamarneh, G.: Image denoising in the deep learning era. Artificial Intelligence Review 56(7), 5929–5974 (2023)
[18] Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019)
[19] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. stat 1050, 1 (2014)
[20] Krull, A., Basevi, H., Salmon, B., Zeug, A., Müller, F., Tonks, S., Muppala, L., Leonardis, A.: Image denoising and the generative accumulation of photons. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1528–1537 (2024)
[21] Krull, A., Buchholz, T.O., Jug, F.: Noise2void-learning denoising from single noisy images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2129–2137 (2019)
[22] Krull, A., Vičar, T., Prakash, M., Lalit, M., Jug, F.: Probabilistic noise2void: Unsupervised content-aware denoising. Frontiers in Computer Science 2, 5 (2020)
[23] Kullback, S., Leibler, R.A.: On information and sufficiency. The annals of mathematical statistics 22(1), 79–86 (1951)
[24] Laine, R.F., Jacquemet, G., Krull, A.: Imaging in focus: an introduction to denoising bioimages in the era of deep learning. The international journal of biochemistry & cell biology 140, 106077 (2021)
[25] Laine, S., Karras, T., Lehtinen, J., Aila, T.: High-quality self-supervised deep image denoising. Advances in Neural Information Processing Systems 32 (2019)
[26] Lee, W., Son, S., Lee, K.M.: Ap-bsn: Self-supervised denoising for real-world images via asymmetric pd and blind-spot network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17725–17734 (2022)
[27] Lehtinen, J., Munkberg, J., Hasselgren, J., Laine, S., Karras, T., Aittala, M., Aila, T.: Noise2noise: Learning image restoration without clean data. In: International Conference on Machine Learning. pp. 2965–2974. PMLR (2018)
[28] Liao, Y.: Practical electron microscopy and database. An Online Book (2006)
[29] Mandracchia, B., Hua, X., Guo, C., Son, J., Urner, T., Jia, S.: Fast and accurate scmos noise correction for fluorescence microscopy. Nature communications 11(1), 94 (2020)
[30] Metzler, C.A., Mousavi, A., Heckel, R., Baraniuk, R.G.: Unsupervised learning with stein’s unbiased risk estimator. arXiv preprint arXiv:1805.10531 (2018)
[31] Moomaw, B.: Camera technologies for low light imaging: overview and relative advantages. Methods in cell biology 114, 243–283 (2013)
[32] Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al.: Conditional image generation with pixelcnn decoders. Advances in neural information processing systems 29 (2016)
[33] Pennycook, S.J., Nellist, P.D.: Scanning transmission electron microscopy: imaging and analysis. Springer Science & Business Media (2011)
[34] Portmann, J., Lynen, S., Chli, M., Siegwart, R.: People detection and tracking from aerial thermal views. In: 2014 IEEE international conference on robotics and automation (ICRA). pp. 1794–1800. IEEE (2014)
[35] Prakash, M., Delbracio, M., Milanfar, P., Jug, F.: Interpretable unsupervised diversity denoising and artefact removal. In: International Conference on Learning Representations (2021)
[36] Prakash, M., Krull, A., Jug, F.: Fully unsupervised diversity denoising with convolutional variational autoencoders. arXiv preprint arXiv:2006.06072 (2020)
[37] Prakash, M., Lalit, M., Tomancak, P., Krul, A., Jug, F.: Fully unsupervised probabilistic noise2void. In: 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI). pp. 154–158. IEEE (2020)
[38] Quan, Y., Chen, M., Pang, T., Ji, H.: Self2self with dropout: Learning self-supervised denoising from single image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1890–1898 (2020)
[39] Salmon, B., Krull, A.: Towards structured noise models for unsupervised denoising. In: European Conference on Computer Vision. pp. 379–394. Springer (2022)
[40] Sønderby, C.K., Raiko, T., Maaløe, L., Sønderby, S.K., Winther, O.: Ladder variational autoencoders. Advances in neural information processing systems 29 (2016)
[41] Song, T.A., Yang, F., Dutta, J.: Noise2void: unsupervised denoising of pet images. Physics in Medicine & Biology 66(21), 214002 (2021)
[42] Wang, F., Henninen, T.R., Keller, D., Erni, R.: Noise2atom: unsupervised denoising for scanning transmission electron microscopy images. Applied Microscopy 50(1), 1–9 (2020)
[43] Wang, Z., Liu, J., Li, G., Han, H.: Blind2unblind: Self-supervised image denoising with visible blind spots. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2027–2036 (2022)
[44] Weigert, M., Schmidt, U., Boothe, T., Müller, A., Dibrov, A., Jain, A., Wilhelm, B., Schmidt, D., Broaddus, C., Culley, S., et al.: Content-aware image restoration: pushing the limits of fluorescence microscopy. Nature methods 15(12), 1090–1097 (2018)
[45] Xu, J., Adalsteinsson, E.: Deformed2self: Self-supervised denoising for dynamic medical imaging. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II 24. pp. 25–35. Springer (2021)
[46] Zhang, Y., Zhu, Y., Nichols, E., Wang, Q., Zhang, S., Smith, C., Howard, S.: A poisson-gaussian denoising dataset with real fluorescence microscopy images. In: CVPR (2019)
[47] Zhang, Z., Wang, Y., Piestun, R., Huang, Z.L.: Characterizing and correcting camera noise in back-illuminated scmos cameras. Optics Express 29(5), 6668–6690 (2021)