HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: axessibility
  • failed: arydshln

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-SA 4.0
arXiv:2310.07887v2 [eess.IV] 10 Apr 2024
11institutetext: School of Computer, University of Birmingham
11email: [email protected]
[email protected]

https://github.com/krulllab/COSDD

Unsupervised Denoising for Signal-Dependent and Row-Correlated Imaging Noise

Benjamin Salmon 0000-0002-5919-0158    Alexander Krull 0000-0002-7778-7169
Abstract

Accurate analysis of microscopy images is hindered by the presence of noise. This noise is usually signal-dependent and often additionally correlated along rows or columns of pixels. Current self- and unsupervised denoisers can address signal-dependent noise, but none can reliably remove noise that is also row- or column-correlated. Here, we present the first fully unsupervised deep learning-based denoiser capable of handling imaging noise that is row-correlated as well as signal-dependent. Our approach uses a Variational Autoencoder (VAE) with a specially designed autoregressive decoder. This decoder is capable of modeling row-correlated and signal-dependent noise but is incapable of independently modeling underlying clean signal. The VAE therefore produces latent variables containing only clean signal information, and these are mapped back into image space using a proposed second decoder network. Our method does not require a pre-trained noise model and can be trained from scratch using unpaired noisy data. We show that our approach achieves competitive results when applied to a range of different sensor types and imaging modalities. Refer to caption Figure 1: Noise is often assumed to be spatially uncorrelated and/or signal-independent such as additive white gaussian noise (AWG). Such assumptions greatly facilitate self/unsupervised denoising. Unfortunately, real noise in microscopy is usually neither. We present the first unsupervised denoising approach for this challenging scenario.

1 Introduction

Imaging is often affected by the presence of noise, posing a challenge in the processing of recorded data. This is especially true in scientific imaging where technology is routinely pushed to the boundary of what is possible. Consequently, data is often pre-processed in an attempt to remove imaging noise before downstream analysis. Over the years, a variety of denoising approaches have been devised [24], including traditional filter-based methods [5] and later, supervised deep learning-based methods [44], which use training data to learn a map** from noisy to clean images.

Supervised deep learning methods excel with respect to the quality of their output. However, these methods require paired training data, typically consisting of pairs of noisy and clean images. Unlike in applications with natural images (photographs), such paired data is often unobtainable in scientific imaging problems such as microscopy [17, 41, 45]. As a consequence, the applicability of supervised denoising methods in many areas of scientific imaging is limited.

Self- and unsupervised methods (e.g. [1, 21, 36, 35, 8]) have been proposed as a solution to the lack of high quality training data. They can be trained directly on the data that is to be denoised and have substantially improved the practical applicability of denoising in scientific imaging. These methods separate imaging noise from the underlying signal by making assumptions about the statistical nature of the noise. Typically, they assume the noise is (i)𝑖(i)( italic_i ) signal-independent (purely additive and occurs separately of the underlying signal) [39], or (ii)𝑖𝑖(ii)( italic_i italic_i ) spatially uncorrelated (unstructured and occurs separately for each pixel) [21, 22, 36, 16, 38]. We show examples of (i)𝑖(i)( italic_i ) and (ii)𝑖𝑖(ii)( italic_i italic_i ) in Fig. 1.

In practice, (i)𝑖(i)( italic_i ) is often broken by the presence of Poisson shot noise [24], where the variance increases as the intensity of the underlying signal increases. Moreover, many popular scientific cameras and imaging setups break (ii)𝑖𝑖(ii)( italic_i italic_i ) by producing row- or column-correlated noise. For example, the scientific Complementary Metal-Oxide-Semiconductor (sCMOS) [31] cameras that are popular in optical microscopy [29] have separate amplifiers for each column of pixels, leading to correlated noise within each column [47]. The detectors used in infrared imaging systems, e.g. microbolometers, also commonly use separate column amplifiers and suffer from similar noise structures [7]. Depending on their settings, Electron Multiplying Charge-Coupled Device (EMCCD) [6] cameras can produce horizontally correlated read-out noise [2]. Similarly, scanning-based imaging methods such as scanning transmission electron microscopy (STEM) [33] are prone to line artifacts caused by slow reaction time of readout electronics [28]. Examples of noisy images from these modalities along with plots of their spatial autocorrelation can be found in Fig. 3. These types of noise cannot be removed with basic self- or unsupervised methods. Recently, variants of these methods for spatially correlated noise have been proposed, but these are limited to locally correlated noise and can come at the expense of the reduced reconstruction quality [2, 35].

In this paper, we present the first unsupervised deep learning-based denoiser capable of reliably removing signal-dependent noise that is correlated along rows or columns of pixels, as it commonly occurs in microscopy data. The approach is illustrated in Fig. 2. Our method requires neither examples of noise-free images, which can be impossible to obtain (e.g. [42]), nor pre-trained noise models [36, 39, 22] or hand-crafted priors [42, 5, 10]. Furthermore, we do not rely on blind-spot approaches [2] or subsampling [16, 26] techniques that degrade image quality and therefore limit denoising performance.

The backbone of our approach is the Variational Autoencoder (VAE) [19], which trains a latent variable model of the noisy image data 𝐱𝐱\mathbf{x}bold_x as pθ(𝐱)pθ(𝐱|𝐳)pθ(𝐳)proportional-tosubscript𝑝𝜃𝐱subscript𝑝𝜃conditional𝐱𝐳subscript𝑝𝜃𝐳p_{\theta}(\mathbf{x})\propto p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z})p_{% \mathbf{\theta}}(\mathbf{z})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x ) ∝ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z ), where θ𝜃\thetaitalic_θ refers to the model parameters. We guide our VAE to represent clean, noise-free images with latent variables 𝐳𝐳\mathbf{z}bold_z and to model the noise generation process with the decoder pθ(𝐱|𝐳)subscript𝑝𝜃conditional𝐱𝐳p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ).

Our method for this is based on the representation learning technique proposed by Chen et al. [4], who revealed how a VAE prefers to avoid using latent variables 𝐳𝐳\mathbf{z}bold_z to describe structures that could instead be modeled locally by its decoder pθ(𝐱|𝐳)subscript𝑝𝜃conditional𝐱𝐳p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ). For an autoregressive (AR) decoder such as PixelCNN [32, 11], this includes any inter-pixel dependencies that lie within its receptive field. Consequently, latent variables will only represent information that the decoder cannot model, i.e., dependencies that span beyond its receptive field.

We take advantage of this behaviour by designing a decoder that is capable of modeling exactly what we want to be excluded from our latent variables – the imaging noise – while being incapable of modeling the remainder of the data. Specifically, we use an AR decoder that can only model axis-aligned structures because its receptive field spans only a row or column of the image. This trains our model to exclude row- or column-correlated noise from the latent variables, while encouraging it to include the statistics of the underlying signal.

We then propose a second network, termed signal decoder, that is trained to map these latent variables back into image space, thereby producing denoised images. Following Lehtinen et al. [27], who showed that noisy images can be used as training targets for denoisers, we train the signal decoder alongside the VAE by using the original noisy data as training target.

To summarize, our main contributions are:

  • We present an unsupervised deep learning-based denoiser for signal-dependent noise that is correlated along rows or columns of pixels.

  • By equip** a VAE with a 1-dimensional AR decoder, we guide it to represent clean images with latent variables and model noise content with the decoder.

  • We introduce a novel architecture using a signal decoder that is trained in a second step to predict the signal from our latent variables.

  • We apply our method to microscopy data recorded with various imaging modalities, as well as other datasets, achieving state-of-the-art denoising results compared to other unsupervised methods.

2 Background

2.1 Image formation and imaging noise

We can express image formation as a two step process. The clean image, or as we also refer to it, the signal, 𝐬=(s1,1,,sN,M)𝐬subscript𝑠11subscript𝑠𝑁𝑀\mathbf{s}=(s_{1,1},\dots,s_{N,M})bold_s = ( italic_s start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_N , italic_M end_POSTSUBSCRIPT ), is drawn from a distribution p(𝐬)𝑝𝐬p(\mathbf{s})italic_p ( bold_s ) and then subjected to noise, producing the noisy image 𝐱=(x1,1,,xN,M)𝐱subscript𝑥11subscript𝑥𝑁𝑀\mathbf{x}=(x_{1,1},\dots,x_{N,M})bold_x = ( italic_x start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N , italic_M end_POSTSUBSCRIPT ) as drawn from the noise distribution p(𝐱|𝐬)𝑝conditional𝐱𝐬p(\mathbf{x}|\mathbf{s})italic_p ( bold_x | bold_s ). Here, si,jsubscript𝑠𝑖𝑗s_{i,j}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and xi,jsubscript𝑥𝑖𝑗x_{i,j}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT correspond to the respective pixel values at position (i,j)𝑖𝑗(i,j)( italic_i , italic_j ), and N𝑁Nitalic_N and M𝑀Mitalic_M are the number of rows and columns in the image, respectively.

Generally, we assume that the imaging noise is zero-centered, that is, the expected value of a noisy image equals the signal, 𝔼p(𝐱|𝐬)[𝐱]=𝐬subscript𝔼𝑝conditional𝐱𝐬delimited-[]𝐱𝐬\mathbb{E}_{p(\mathbf{x}|\mathbf{s})}[\mathbf{x}]=\mathbf{s}blackboard_E start_POSTSUBSCRIPT italic_p ( bold_x | bold_s ) end_POSTSUBSCRIPT [ bold_x ] = bold_s. Note that zero-centered noise is a weak assumption that is widely shared in literature (e.g. [21, 25, 8]). Another way of looking at this assumption is that the sensor or camera is assumed to behave linearly so that an increase in si,jsubscript𝑠𝑖𝑗s_{i,j}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT (e.g. by increasing the light intensity) will, on average, result in a proportional increase in xi,jsubscript𝑥𝑖𝑗x_{i,j}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT.

Additive white Gaussian noise: Arguably, the most basic traditionally used noise model is additive white Gaussian (AWG) noise, see e.g. [5, 3, 30]. We can write the probability distribution for noisy observations in an AWG noise model as

p(𝐱|𝐬)=i=1Nj=1Mp(xi,j|si,j),𝑝conditional𝐱𝐬superscriptsubscriptproduct𝑖1𝑁superscriptsubscriptproduct𝑗1𝑀𝑝conditionalsubscript𝑥𝑖𝑗subscript𝑠𝑖𝑗p(\mathbf{x}|\mathbf{s})=\prod_{i=1}^{N}\prod_{j=1}^{M}p(x_{i,j}|s_{i,j}),italic_p ( bold_x | bold_s ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) , (1)

formulated as a product over pixels. In general, we refer to any type of noise model that can be factorised according to Eq. 1 as spatially uncorrelated. In AWG noise, p(xi,j|si,j)𝑝conditionalsubscript𝑥𝑖𝑗subscript𝑠𝑖𝑗p(x_{i,j}|s_{i,j})italic_p ( italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) corresponds to a normal distribution centered at si,jsubscript𝑠𝑖𝑗s_{i,j}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, with variance fixed for all si,jsubscript𝑠𝑖𝑗s_{i,j}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT.

As a consequence, the AWG noise itself, 𝐧=𝐱𝐬𝐧𝐱𝐬\mathbf{n}=\mathbf{x}-\mathbf{s}bold_n = bold_x - bold_s, follows a fixed variance normal distribution centered at zero and is independent of the underlying signal. In general, we call noise that does not depend on the underlying signal, such that p(𝐧|𝐬)=p(𝐧)𝑝conditional𝐧𝐬𝑝𝐧p(\mathbf{n}|\mathbf{s})=p(\mathbf{n})italic_p ( bold_n | bold_s ) = italic_p ( bold_n ), signal-independent. Types of noise for which this is not the case will be referred to as signal-dependent.

Denoising approaches have assumed noise to be both spatially uncorrelated and signal-independent. Unfortunately, real imaging noise rarely follows these assumptions.

Poisson shot noise: Most imaging systems are at least to some degree affected by Poisson shot noise [24]. A Poisson noise model follows Eq. 1, with p(xi,j|si,j)𝑝conditionalsubscript𝑥𝑖𝑗subscript𝑠𝑖𝑗p(x_{i,j}|s_{i,j})italic_p ( italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) corresponding to a Poisson distribution with rate λ=si,j𝜆subscript𝑠𝑖𝑗\lambda=s_{i,j}italic_λ = italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. A popular variant on this model is the Poisson-Gaussian noise model (e.g. [46]), where p(xi,j|si,j)𝑝conditionalsubscript𝑥𝑖𝑗subscript𝑠𝑖𝑗p(x_{i,j}|s_{i,j})italic_p ( italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) is assumed to be a combination of Poisson shot noise and Gaussian read-out noise.

Poisson and Poisson-Gaussian noise models are spatially uncorrelated but signal-dependent, since the variance of the Poisson component in each pixel depends on the pixel’s signal si,jsubscript𝑠𝑖𝑗s_{i,j}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT.

Row-correlated noise: While Poisson-Gaussian noise models are a popular choice in describing noise distributions, many imaging systems (e.g. EMCCD, sCMOS, or scanning microscopes) can produce noise that does not conform to Eq. 1, but is instead correlated along rows (or columns) of pixels. We propose to describe such noise as

p(𝐱|𝐬)=i=1Nj=1Mp(xi,j|𝐬,xi,1,,xi,j1),𝑝conditional𝐱𝐬superscriptsubscriptproduct𝑖1𝑁superscriptsubscriptproduct𝑗1𝑀𝑝conditionalsubscript𝑥𝑖𝑗𝐬subscript𝑥𝑖1subscript𝑥𝑖𝑗1p(\mathbf{x}|\mathbf{s})=\prod_{i=1}^{N}\prod_{j=1}^{M}p(x_{i,j}|\mathbf{s},x_% {i,1},\dots,x_{i,j-1}),italic_p ( bold_x | bold_s ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | bold_s , italic_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i , italic_j - 1 end_POSTSUBSCRIPT ) , (2)

where p(xi,j|𝐬,xi,1,,xi,j1)𝑝conditionalsubscript𝑥𝑖𝑗𝐬subscript𝑥𝑖1subscript𝑥𝑖𝑗1p(x_{i,j}|\mathbf{s},x_{i,1},\dots,x_{i,j-1})italic_p ( italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | bold_s , italic_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i , italic_j - 1 end_POSTSUBSCRIPT ) is the distribution of possible noisy pixel values xi,jsubscript𝑥𝑖𝑗x_{i,j}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT conditioned on the signal, as well as all “previous” values in the same row, i𝑖iitalic_i. While this model can describe interactions between pixels within the same row, pixels in different rows are (conditionally) independent (given a signal 𝐬𝐬\mathbf{s}bold_s). Note that in this formulation a pixel xi,jsubscript𝑥𝑖𝑗x_{i,j}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT can depend on not only the signal in pixel (i,j)𝑖𝑗(i,j)( italic_i , italic_j ), as with shot noise, but on the entire image for more complex interactions.

In our work, we consider this type of noise as described in Eq. 2, which we believe is a good model for many real scientific imaging data.

2.2 Latent variable models and Variational Autoencoders

A latent variable model, with parameters θ𝜃\thetaitalic_θ, defines probability distribution, pθ(𝐱)pθ(𝐱|𝐳)pθ(𝐳)proportional-tosubscript𝑝𝜃𝐱subscript𝑝𝜃conditional𝐱𝐳subscript𝑝𝜃𝐳p_{\theta}(\mathbf{x})\propto p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z})p_{% \mathbf{\theta}}(\mathbf{z})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x ) ∝ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z ), over observed variables 𝐱𝐱\mathbf{x}bold_x via latent variables 𝐳𝐳\mathbf{z}bold_z. This can be used to represent a data generation process where a value 𝐳𝐳\mathbf{z}bold_z is first sampled from the prior pθ(𝐳)subscript𝑝𝜃𝐳p_{\mathbf{\theta}}(\mathbf{z})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z ) and a value 𝐱𝐱\mathbf{x}bold_x is sampled from the conditional distribution pθ(𝐱|𝐳)subscript𝑝𝜃conditional𝐱𝐳p_{\theta}(\mathbf{x}|\mathbf{z})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ).

A VAE can be used to simultaneously optimize the model’s parameters and approximate the posterior, pθ(𝐳|𝐱)subscript𝑝𝜃conditional𝐳𝐱p_{\theta}(\mathbf{z}|\mathbf{x})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z | bold_x ), via maximization of a lower bound on the marginal log-likelihood,

(θ,ϕ)=𝔼qϕ(𝐳|𝐱)[logpθ(𝐱|𝐳))]DKL(qϕ(𝐳|𝐱))pθ(𝐳)),\mathcal{L}(\theta,\phi)=\mathbb{E}_{q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})}% [\log p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z}))]-D_{KL}(q_{\mathbf{\phi}}(% \mathbf{z}|\mathbf{x}))\|p_{\mathbf{\theta}}(\mathbf{z})),caligraphic_L ( italic_θ , italic_ϕ ) = blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) ) ] - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) ) ∥ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z ) ) , (3)

where the second term on the RHS is the Kullback-Leibler divergence [23] from the true prior to an approximate posterior qϕ(𝐳|𝐱)subscript𝑞italic-ϕconditional𝐳𝐱q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ). In Eq. 3, the first term is known as the reconstruction error and the second term is known as the regularizer. Accordingly, qϕ(𝐳|𝐱)subscript𝑞italic-ϕconditional𝐳𝐱q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) is known as the encoder and pθ(𝐱|𝐳)subscript𝑝𝜃conditional𝐱𝐳p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) is known as the decoder.

2.3 VAEs and the division of labour

When designing a latent variable model for image data, the decoder can be made autoregressive, where the distribution of each pixel is conditioned on the value of previous pixels in row-major order, as well as on 𝐳𝐳\mathbf{z}bold_z,

pθ(𝐱|𝐳)=i=1Nj=1Mpθ(xi,j|𝐳,x<(i,j)).subscript𝑝𝜃conditional𝐱𝐳subscriptsuperscriptproduct𝑁𝑖1subscriptsuperscriptproduct𝑀𝑗1subscript𝑝𝜃conditionalsubscript𝑥𝑖𝑗𝐳subscript𝑥absent𝑖𝑗p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z})=\prod^{N}_{i=1}\prod^{M}_{j=1}p_{% \theta}(x_{i,j}|\mathbf{z},x_{<(i,j)}).italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) = ∏ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∏ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | bold_z , italic_x start_POSTSUBSCRIPT < ( italic_i , italic_j ) end_POSTSUBSCRIPT ) . (4)

We refer to x<(i,j)subscript𝑥absent𝑖𝑗x_{<(i,j)}italic_x start_POSTSUBSCRIPT < ( italic_i , italic_j ) end_POSTSUBSCRIPT as the full AR receptive field, and its shape is shown in Fig. 2b. This is modification is intended to make the model more expressive. However, the decoder is now powerful enough to model the entire data distribution locally, to the extent that it is unclear which aspects of the data should be encoded in 𝐳𝐳\mathbf{z}bold_z and which should be modeled by the decoder.

In practice, when using a VAE, there seems to be a preference to model as much content as possible with the decoder. He et al. [14] argued this behaviour is caused by the approximate posterior lagging behind the true posterior in the early stages of training, causing the parameters to get stuck in a local optimum, where the decoder models the data without using the latent variables.

Alternatively, Chen et al. [4] reasoned that with most practical VAEs, ignoring the latent variables achieves a tighter lower bound on the marginal log-likelihood. By only using the latent variables to express what the decoder cannot model, the true posterior is brought closer to the prior and can be more closely matched by the relatively inflexible approximate posterior. The authors then demonstrated how this behaviour can be used to control the division of labour in a VAE. Specifically, they designed an AR decoder that is capable of modeling information that they do not want captured in the latent variables, but is incapable of modeling the information that they do want captured in the latent variables.

In this paper, we design an AR receptive field that only captures the correlations common in scientific imaging noise. This allows us train a latent variable model of noisy image data where latent variables explain the signal content and the decoder models only the noise generation process. We then use the approximate posterior to sample latent variables, each representing one of the clean signals that could possibly underlie a given noisy image.

3 Related work

3.1 Unsupervised denoising

Existing unsupervised deep learning-based denoisers extend the latent variable model to include a known, explicit noise model. The joint distribution is then defined as pθ(𝐱,𝐬,𝐳)=p(𝐱|𝐬)pθ(𝐬|𝐳)pθ(𝐳)subscript𝑝𝜃𝐱𝐬𝐳𝑝conditional𝐱𝐬subscript𝑝𝜃conditional𝐬𝐳subscript𝑝𝜃𝐳p_{\theta}(\mathbf{x},\mathbf{s},\mathbf{z})=p(\mathbf{x}|\mathbf{s})p_{\theta% }(\mathbf{s}|\mathbf{z})p_{\theta}(\mathbf{z})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x , bold_s , bold_z ) = italic_p ( bold_x | bold_s ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_s | bold_z ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z ), where pθ(𝐬|𝐳)subscript𝑝𝜃conditional𝐬𝐳p_{\theta}(\mathbf{s}|\mathbf{z})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_s | bold_z ) is a deterministic distribution with support only at gθ(𝐳)subscript𝑔𝜃𝐳g_{\theta}(\mathbf{z})italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z ) and p(𝐱|𝐬)=p(𝐱|𝐬,𝐳)𝑝conditional𝐱𝐬𝑝conditional𝐱𝐬𝐳p(\mathbf{x}|\mathbf{s})=p(\mathbf{x}|\mathbf{s},\mathbf{z})italic_p ( bold_x | bold_s ) = italic_p ( bold_x | bold_s , bold_z ). The parameters of this model can be optimized by a VAE after a slight modification to the objective in Eq. 3, changing the first term to 𝔼qϕ(𝐳|𝐱)[logp(𝐱|𝐬=gθ(𝐳))]subscript𝔼subscript𝑞italic-ϕconditional𝐳𝐱delimited-[]𝑝conditional𝐱𝐬subscript𝑔𝜃𝐳\mathbb{E}_{q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|% \mathbf{s}=g_{\theta}(\mathbf{z}))]blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_x | bold_s = italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z ) ) ].

In DivNoising [36], the noise model follows Eq. 1 and is learned from calibration data or co-learned during training. Later, HDN [35] was proposed as an extension to DivNoising that used a VAE with a hierarchy of latent variables [40]. It was found that short range spatially correlated noise structures were modeled by only the bottom levels of the hierarchy, and they could be removed by preventing the these latent variables from using information from the encoded input. This technique is known as HDN36.

Next, Autonoise [39] was proposed for removing spatially correlated but signal-independent noise by replacing the pixel-independent noise model in HDN with a CNN-based AR decoder [32], p(𝐱|𝐬)=i=1Nj=1Mp(xi,j|𝐬,x<(i,j))𝑝conditional𝐱𝐬subscriptsuperscriptproduct𝑁𝑖1subscriptsuperscriptproduct𝑀𝑗1𝑝conditionalsubscript𝑥𝑖𝑗𝐬subscript𝑥absent𝑖𝑗p(\mathbf{x}|\mathbf{s})=\prod^{N}_{i=1}\prod^{M}_{j=1}p(x_{i,j}|\mathbf{s},x_% {<(i,j)})italic_p ( bold_x | bold_s ) = ∏ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∏ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | bold_s , italic_x start_POSTSUBSCRIPT < ( italic_i , italic_j ) end_POSTSUBSCRIPT ).

While this method can be used to remove noise of any spatial correlation, it cannot be applied to signal-dependent noise, making it inapplicable with most real-world data. The reason for this is that the noise model must be pre-trained using suitable calibration data. To learn only the structure of the noise, the calibration data required is samples of pure noise, which can be obtained by, e.g., imaging without light. To also learn how the noise and signal are correlated, paired noisy and clean images with a range of signal content would be needed. This is the same requirement as for supervised methods and rarely available.

3.2 Self-supervised denoising

Self-supervised deep learning-based denoisers use a noisy image as its own training target, but guide the network to learn a denoising function by corrupting the input in some way. One technique [1, 21, 43] introduces blind-spots, forcing the network to predict a pixel’s value from surrounding pixels. Another trains the network to predict a subset of randomly sampled pixels from another subset of randomly sampled pixels [16]. For photon-counting data, GAP et al. [20] proposed removing photons and training the network to predict them with a Poisson distribution. These techniques exploit a property of spatially uncorrelated noise, which is that the noise in one pixel cannot be predicted from the noise in other pixels.

To address noise that is spatially correlated, the work Structured Noise2Void (SN2V) [2] extended the blind-spot approach to also mask pixels containing noise that is correlated with the noise in the pixel being predicted. Lee et al. [26] also extended the blind-spot approach, but by sub-sampling pixels to break up noise structures, making the noise effectively spatially uncorrelated and ready for traditional blind-spot denoising. This sub-sampling technique is however only applicable to relatively short range correlations that might be found in consumer photography, not the longer range correlations that are common in microscopy.

4 Method

Refer to caption
Figure 2: a): A Variational Lossy Autoencoder [4] (solid arrows) is trained to model the distribution of noisy images 𝐱𝐱\mathbf{x}bold_x. The autoregressive decoder models the noise component of the images while the latent code models only the clean signal component 𝐬𝐬\mathbf{s}bold_s. In a second step (dashed arrows), our novel signal decoder is trained to map latent variables into image space, producing an estimate of the signal underlying 𝐱𝐱\mathbf{x}bold_x. b): To ensure that the decoder models only imaging noise and the latent code captures only the signal, we modify decoder’s receptive field. While a full autoregressive decoder includes pixels above and to the left (shown in blue), our decoder’s receptive field corresponds to the row-correlated structure of the imaging noise and includes only pixels in the same row as the pixel being modeled.

We propose a VAE-based unsupervised image denoiser for noise that is both signal-dependent and correlated along rows or columns of pixels. It is trained using only noisy images and does not require a pre-trained noise model. We restrict the receptive field of the decoder so it can model the correlations of the noise content but not the correlations of the underlying signal. Following the insights of Chen et al. [4] (Sec. 2.3), the decoder will therefore learn to model the noise, leaving the underlying signal to be encoded in the latent variable 𝐳𝐳\mathbf{z}bold_z. We then propose a method for taking these latent variables and map** them back into image space to obtain denoised estimates of clean signals. A full outline of our method can be found in Fig. 2.

4.1 Autoregressive receptive fields for noise

Our AR decoder has a 1-dimensional receptive field that is sufficient for modeling row/column correlated noise (Eq. 2) as it commonly occurs in EMCCD, sCMOS, microbolometer detectors, SCM or STEM (see Sec. 1) by spanning pixels in the same row or column as xi,jsubscript𝑥𝑖𝑗x_{i,j}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. See Fig. 2b for a visual representation.

To remove row-correlated noise, the first step in our denoising process is to train a VAE to model noisy image data with the objective function in Eq. 3, where pθ(𝐱|𝐳)subscript𝑝𝜃conditional𝐱𝐳p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) is factorized as,

pθ(𝐱|𝐳)=i=1Nj=1Mpθ(xi,j|𝐳,xi,1,,xi,j1).subscript𝑝𝜃conditional𝐱𝐳superscriptsubscriptproduct𝑖1𝑁superscriptsubscriptproduct𝑗1𝑀subscript𝑝𝜃conditionalsubscript𝑥𝑖𝑗𝐳subscript𝑥𝑖1subscript𝑥𝑖𝑗1p_{\mathbf{\theta}}(\mathbf{x}|\mathbf{z})=\prod_{i=1}^{N}\prod_{j=1}^{M}p_{% \theta}(x_{i,j}|\mathbf{z},x_{i,1},\dots,x_{i,j-1}).italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | bold_z , italic_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i , italic_j - 1 end_POSTSUBSCRIPT ) . (5)

To remove column-correlated noise, the factorization over pixels is perpendicular.

We find that this factorization is insufficient for modeling signal content, which is highly correlated in all directions. Consequently, the VAE learns to encode signal content in its latent variables. An experimental investigation of the effects of changing receptive field size can be found in Sec. 5.4, and details on how we construct this receptive field with our AR decoder architecture can be found in the supplementary material.

4.2 Decoding the signal

Once our VAE has been trained, its latent space will represent clean signals, i.e., each 𝐳𝐳\mathbf{z}bold_z contains all the information about an 𝐬𝐬\mathbf{s}bold_s. We would now like to use the model for denoising by inferring possible clean signals 𝐬𝐬\mathbf{s}bold_s for a given noisy image 𝐱𝐱\mathbf{x}bold_x. Unfortunately, unlike previous methods [36, 35, 39], we cannot directly sample clean images from our encoder. Rather, samples from qϕ(𝐳|𝐱)subscript𝑞italic-ϕconditional𝐳𝐱q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) will directly correspond to a clean signal 𝐬𝐬\mathbf{s}bold_s. We denote the signal corresponding to a value of 𝐳𝐳\mathbf{z}bold_z as 𝐬(𝐳)𝐬𝐳\mathbf{s}(\mathbf{z})bold_s ( bold_z ). For an experimental validation of this deterministic relationship, please refer to the supplementary material. To obtain denoised images, we approximate 𝐬(𝐳)𝐬𝐳\mathbf{s}(\mathbf{z})bold_s ( bold_z ) with an additional regression network, termed the signal decoder, fν(𝐳)𝐬(𝐳)subscript𝑓𝜈𝐳𝐬𝐳f_{\nu}(\mathbf{z})\approx\mathbf{s}(\mathbf{z})italic_f start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( bold_z ) ≈ bold_s ( bold_z ). In the following, we describe how fν(𝐳)subscript𝑓𝜈𝐳f_{\nu}(\mathbf{z})italic_f start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( bold_z ) is trained despite not having access to training pairs (𝐳k,𝐬(𝐳k))superscript𝐳𝑘𝐬superscript𝐳𝑘(\mathbf{z}^{k},\mathbf{s}(\mathbf{z}^{k}))( bold_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_s ( bold_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ).

In fact, training the signal decoder only requires pairs (𝐳k,𝐱k)superscript𝐳𝑘superscript𝐱𝑘(\mathbf{z}^{k},\mathbf{x}^{k})( bold_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ), where 𝐱ksuperscript𝐱𝑘\mathbf{x}^{k}bold_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is a noisy image and 𝐳kqϕ(𝐳|𝐱k)similar-tosuperscript𝐳𝑘subscript𝑞italic-ϕconditional𝐳superscript𝐱𝑘\mathbf{z}^{k}\sim q_{\phi}(\mathbf{z}|\mathbf{x}^{k})bold_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∼ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ). If we assume that the approximate posterior is accurate, such that qϕ(𝐳|𝐱)pθ(𝐳|𝐱)subscript𝑞italic-ϕconditional𝐳𝐱subscript𝑝𝜃conditional𝐳𝐱q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})\approx p_{\mathbf{\theta}}(\mathbf{z}% |\mathbf{x})italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) ≈ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z | bold_x ), these pairs can be equivalently thought of as sampled from the joint distribution, (𝐳k,𝐱k)pθ(𝐳,𝐱)similar-tosuperscript𝐳𝑘superscript𝐱𝑘subscript𝑝𝜃𝐳𝐱(\mathbf{z}^{k},\mathbf{x}^{k})\sim p_{\theta}(\mathbf{z},\mathbf{x})( bold_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z , bold_x ).

Least squares regression analysis tells us that optimizing the signal decoder with the L2𝐿2L2italic_L 2 loss,

(ν)=|fν(𝐳k)𝐱k|2,𝜈superscriptsubscript𝑓𝜈superscript𝐳𝑘superscript𝐱𝑘2\mathcal{L}(\nu)=|f_{\nu}(\mathbf{z}^{k})-\mathbf{x}^{k}|^{2},caligraphic_L ( italic_ν ) = | italic_f start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - bold_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (6)

will train it to approximate 𝔼pθ(𝐱|𝐳)[𝐱]subscript𝔼subscript𝑝𝜃conditional𝐱𝐳delimited-[]𝐱\mathbb{E}_{p_{\theta}(\mathbf{x}|\mathbf{z})}[\mathbf{x}]blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) end_POSTSUBSCRIPT [ bold_x ] for any 𝐳𝐳\mathbf{z}bold_z [13]. Since 𝐳𝐳\mathbf{z}bold_z contains no more or less information about 𝐱𝐱\mathbf{x}bold_x than 𝐬(𝐳)𝐬𝐳\mathbf{s}(\mathbf{z})bold_s ( bold_z ), the signal decoder will equivalently be approximating 𝔼pθ(𝐱|𝐬(𝐳))[𝐱]subscript𝔼subscript𝑝𝜃conditional𝐱𝐬𝐳delimited-[]𝐱\mathbb{E}_{p_{\theta}(\mathbf{x}|\mathbf{s}(\mathbf{z}))}[\mathbf{x}]blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_s ( bold_z ) ) end_POSTSUBSCRIPT [ bold_x ]. Recalling that imaging noise is zero-centered (Sec. 2.1), the expected value of a noisy image given an underlying signal is that signal. Therefore, fν(𝐳)𝔼pθ(𝐱|𝐬(𝐳))[𝐱]=𝐬(𝐳)subscript𝑓𝜈𝐳subscript𝔼subscript𝑝𝜃conditional𝐱𝐬𝐳delimited-[]𝐱𝐬𝐳f_{\nu}(\mathbf{z})\approx\mathbb{E}_{p_{\theta}(\mathbf{x}|\mathbf{s}(\mathbf% {z}))}[\mathbf{x}]=\mathbf{s}(\mathbf{z})italic_f start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( bold_z ) ≈ blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_s ( bold_z ) ) end_POSTSUBSCRIPT [ bold_x ] = bold_s ( bold_z ). This is similar to Noise2Noise [27], where the regressor for 𝐬𝐬\mathbf{s}bold_s is trained using noisy targets 𝐱𝐱\mathbf{x}bold_x.

Even though the signal decoder would naturally be trained in a second stage after the main VAE is finished, in practice we co-trained it alongside the main VAE. At every training step, the sampled latent variable is fed to both decoders, but only the loss from the AR decoder is allowed to backpropagate to the encoder. This method of training is simply for convenience and we did not observe any changes in performance compared to a signal decoder that is trained separately, after the main VAE.

4.3 Inference

With both the VAE and signal decoder trained, we can denoise an image 𝐱𝐱\mathbf{x}bold_x in a two step process. We first sample a latent variable 𝐳qϕ(𝐳|𝐱)similar-to𝐳subscript𝑞italic-ϕconditional𝐳𝐱\mathbf{z}\sim q_{\mathbf{\phi}}(\mathbf{z}|\mathbf{x})bold_z ∼ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) and then obtain a clean image by decoding 𝐬^=fν(𝐳)^𝐬subscript𝑓𝜈𝐳\hat{\mathbf{s}}=f_{\nu}(\mathbf{z})over^ start_ARG bold_s end_ARG = italic_f start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( bold_z ). Similarly to [36, 35, 39] the result constitutes a random possible solution 𝐬^p(𝐬|𝐱)similar-to^𝐬𝑝conditional𝐬𝐱\hat{\mathbf{s}}\sim p(\mathbf{s}|\mathbf{x})over^ start_ARG bold_s end_ARG ∼ italic_p ( bold_s | bold_x ). To obtain a consensus solution, we follow [36, 35, 39] in averaging a large number of such samples. In the following experiments, all results are the mean of 100 samples, both for our method and the baseline HDN36.

5 Experiments

Table 1: Comparison with baseline methods using Peak Signal-to-Noise Ratio (PSNR), higher is better. Best results for methods that do not require clean images are printed in bold, and best results overall are underlined. The dataset that contains spatially uncorrelated noise and the methods that address only unstructured noise are marked with an asterisk*. CARE is a supervised denoiser requiring examples of clean images, unlike the other baselines.

EMCCD SCM Simulated Conv. A Conv. B Mouse Actin Mouse Nuclei* Actin Conf. Mito Conf. FFHQ Stripe FFHQ Checkerb. HDN* - 37.39 34.12 36.87 - - - - HDN large* - - - 38.12 - - - - \hdashlineSN2V 30.29 31.67 32.80 36.62 23.30 26.41 29.52 19.11 HDN3636{}_{36}start_FLOATSUBSCRIPT 36 end_FLOATSUBSCRIPT 31.41 37.21 - - 26.84 26.51 32.64 29.61 HDN3636{}_{36}start_FLOATSUBSCRIPT 36 end_FLOATSUBSCRIPT large 31.80 37.92 34.14 - 27.17 26.15 34.54 25.51 Ours small 34.42 38.99 36.50 39.56 27.35 27.49 32.87 33.51 Ours large 37.49 44.10 39.23 42.98 27.41 27.50 35.66 36.27 \hdashlineCARE 31.56 36.71 34.20 36.58 29.44 27.55 36.46 36.89

5.1 Datasets

We tested the performance of our proposed denoiser on real noisy images captured by four different imaging modalities that commonly suffer from row-correlated noise. The first is the EMCCD sensor for which we have three fluorescence microscopy datasets with known ground truth: Convallaria A [2], Convallaria B [37] and Mouse Actin [37]. The second is Scanning Confocal Microscopy (SCM), for which we have three fluorescence microscopy datasets with known ground truth: Mouse Nuclei [37], Actin Confocal [12] and Mito Confocal [12]. Note that Mouse Nuclei contains spatially uncorrelated noise, but was included to demonstrate that the proposed method is still applicable to spatially uncorrelated noise without modification. There is also the sCMOS sensor for which we have one fluorescence dataset with unknown ground truth: Embryo [9], followed by the microbolometer for which we have one infrared imaging dataset with unknown ground truth: IR [34]. Lastly, we include a STEM dataset with unknown ground truth: STEM [15]. For details on dataset size and train/test splits, please see each original publication.

In addition, we created two datasets by corrupting the Flickr Faces HQ thumbnails dataset [18] with simulated noise. For FFHQ - Stripes, the images were corrupted by a combination of additive white Gaussian noise, Poisson shot noise and additive white Gaussian noise that had undergone a horizontal Gaussian blur. For FFHQ - Checkerboard, the images were corrupted by a combination of a vertical checkerboard pattern and Gaussian noise with an inverse signal dependence. For full details on how the simulated noise cases were created, please refer to the supplementary.

5.2 Baselines and architecture

There is a scarcity in the literature of deep learning-based denoisers that can be applied to signal-dependent row-correlated noise and do not require examples of clean images or extensive calibration data. Currently, the only methods that meet these requirements are the self-supervised denoiser Structured Noise2Void (SN2V) [2] and the unsupervised denoiser Hierarchical DivNoising36 (HDN36[35]. We use these as baseline methods along with the supervised denoiser Content Aware Image Restoration (CARE) [44] for datasets where paired training images are available. For details on how each baseline is implemented in this paper, please refer to the supplementary.

Of the denoisers not requiring paired images, HDN36’s performance is best, but the model implemented in the baseline’s publication uses significantly fewer parameters than ours (7 million to 25 million). We therefore evaluate an additional version, termed HDN36 large, with a similar number of parameters made by increasing the number of latent dimensions from 32 to 64.

It should be noted that the Convallaria B and Mouse Actin datasets had been treated as spatially uncorrelated by Prakash et al. [35] when testing the denoiser HDN. We additionally report those results in Tab. 1. It should also be noted that the noise in the Mouse Nuclei dataset is spatially uncorrelated, so was denoised by HDN and its higher parameter version HDN large, which was made in the same way as HDN36 large.

Our method requires a choice of orientation and size for the AR decoder’s receptive field. Orientation was determined by examining the spatial autocorrelation of noise samples, with these plots reported in Fig. 3, and following the ablation study in Sec. 5.4, we always used a receptive field length of 40 pixels.

As for our model’s encoder, latent variables are produced by a fully convolutional hierarchical VAE [40], with 14 levels to its hierarchy. In addition to the full sized version, we evaluate a smaller version that requires approximately 6GB (as opposed to 20GB) of GPU memory to train. This was achieved by reducing the number of latent variables from 14 to 6 and reducing the number of latent dimensions from 64 to 32. We refer to these models as Ours small and Ours large, for the lower memory and higher memory versions respectively. Please see the supplementary materials for full architecture and training details.

5.3 Comparing denoising performance

Refer to caption
Figure 3: Visual results from our method and the two unpaired baselines on all datasets. The spatial autocorrelation of the noise is overlaid on each noisy image, with red indicating positive correlation and blue indicating negative correlation. The direction of the correlation is given by the orientation of the autocorrelation bar. Additionally, the signal dependence of the noise in each dataset is shown in the graphs in the right-hand column. On the horizontal axis of these graphs is the clean signal intensity as a percentage of the maximum, while on the vertical axis is the variance of noisy pixel values recorded for these signal intensities. Ground truth must be used to calculate signal dependence, so orange lines are used where denoised images from our method are used as pseudo-ground truth.

Quantitative results measuring the Peak Signal-to-Noise Ratio (PSNR) in dB are reported in Tab. 1 and qualitative results are reported in Fig. 3. Note that HDN36 failed to train with the Mouse Actin and the Embryo dataset. Out of the methods that do not require paired images, Ours large achieved the highest PSNR across all datasets, even beating the supervised CARE on four of six microscopy datasets. Ours small then had the second highest PSNR for an unpaired method on all datasets except FFHQ - Stripe, where it was beaten by HDN36 large, although it should be noted that HDN36 large requires almost 20GB of GPU memory to train while Ours small requires only 6GB.

Turning to the qualitative results in Fig. 3, we see that Ours large denoised images from each dataset without leaving behind any artifacts, whereas HDN36 large could not remove the correlated component of the noise from the STEM or the FFHQ - Checkerboard dataset. SN2V left artifacts on all datasets with spatially correlated noise, but not for the uncorrelated Mouse Nuclei dataset. It can also be seen that Ours large produced the sharpest and most accurate images, with the exception of Mito Confocal, for which HDN36 large was sharper.

5.4 Ablation study - receptive field size


Refer to caption
Figure 4: Here, we denoised the FFHQ - Checkerboard dataset 5 times, varying the number of pixels covered by the AR decoder’s receptive field. The PSNR of images denoised by each model was then calculated. Images show denoising results for different receptive field sizes with PSNR overlaid. Additionally, an image denoised using a full AR receptive field is included on the right along with its PSNR. In this situation, the signal decoder is given completely uninformative inputs and learns to output the mean of the entire training dataset.

As stated in Sec. 4, to model the noise correlations addressed in this paper, the receptive field of a VAE’s AR decoder must span pixels in the same row or column as the pixel being predicted. To effect of the investigate receptive field size (number of pixels), we denoised the FFHQ - Checkerboard dataset using a range of RF lengths, from 10 pixels to 120 pixels, and measured the effect on PSNR. We also include the effect of training a model with a full AR receptive field, as shown in Fig. 2b. The results are reported in Fig. 4.

The study shows that the AR decoder is able to model this spatially correlated noise, and therefore have the noise removed by the encoder, when the receptive field spans 40 pixels. Moreover, the decoder does not seem to model more of the signal as the receptive field grows. If it did, we would expect a steady drop in PSNR as denoised images lose signal content.

However, as the image on the right of Fig. 4 shows, the signal will be modeled by an AR decoder with a full receptive field. Latent variables will therefore carry no information about 𝐱𝐱\mathbf{x}bold_x, causing the signal decoder to minimize Eq. 6 by predicting the mean of the training set. This is to be expected, as when the latent variable 𝐳𝐳\mathbf{z}bold_z carries no information about 𝐱𝐱\mathbf{x}bold_x, 𝔼pθ(𝐱|𝐳)[𝐱]=𝔼pθ(𝐱)[𝐱]subscript𝔼subscript𝑝𝜃conditional𝐱𝐳delimited-[]𝐱subscript𝔼subscript𝑝𝜃𝐱delimited-[]𝐱\mathbb{E}_{p_{\theta}(\mathbf{x}|\mathbf{z})}[\mathbf{x}]=\mathbb{E}_{p_{% \theta}(\mathbf{x})}[\mathbf{x}]blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) end_POSTSUBSCRIPT [ bold_x ] = blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x ) end_POSTSUBSCRIPT [ bold_x ].

5.5 Ablation study - noise reconstruction


Refer to caption
Figure 5: A noisy image from the Convallaria A dataset was encoded and decoded to produce a reconstructed observation and an artificial noise sample. We examine this sampled noise to check that its spatial autocorrelation and signal dependence match those of the real noise, indicating that the AR decoder accurately models the noise.

If the VAE’s decoder is modeling only the noise component of images, encoding an image with the VAE and sampling from the AR decoder should yield an image with the same underlying signal but a different random sample of noise. If the decoder’s model of the noise is accurate, the sampled noise should exhibit the same autocorrelation and signal-dependence characteristics as the noise in the original image. An investigation into this is reported in Fig. 5, where a noisy image from the Convallaria A dataset was encoded by a trained VAE, then a latent variable was sampled and used by the VAE’s AR decoder to sample a reconstruction of the original image. The reconstructed noise exhibits spatial autocorrelation and signal dependence very similar to the real noise, indicating that our AR decoder has learnt an accurate model of the noise.

6 Conclusion

We proposed an unsupervised VAE-based denoising algorithm for noise that is correlated along rows or columns of pixels and is signal-dependent. By engineering the receptive field of the AR decoder, the VAE’s latent variables are encouraged to represent the signal content of an image while discarding the noise. We then presented a novel signal decoder that is trained to map this latent variable into an estimate of the clean image. The algorithm outperforms both the self-supervised denoiser SN2V [2] and the unsupervised denoiser HDN36 [35].

Our method is suited to noise with correlations that run parallel to the axes of the image. Such noise commonly occurs in microscopy. Often, users in labs are unable to find suitable unsupervised methods to remove such noise. We release our code as open source and strongly believe that the scientific imaging community will apply and adapt our methods in a variety of applications.

While we have achieved our results using simple 1-dimensional receptive fields, some imaging modalities produce noise that is correlated in multiple directions, therefore requiring a differently shaped receptive field. We found that extending the receptive field to cover both a row and column of pixels allows the decoder to model some aspects of the signal, making the technique unsuitable for removing noise correlated in two dimensions. However, we believe that techniques other than sha** the AR decoder receptive field could be discovered to limit the decoder’s modelling capabilities. We hope that future work will further improve the theoretical understanding of the method and allow us to utilize its full potential.

Acknowledgements We would like to thank Ales Leonardis (University of Birmingham) for a helpful discussion on the writing of the paper. The computations described in this paper were performed using the University of Birmingham’s BlueBEAR HPC service, which provides a High Performance Computing service to the University’s research community. See http://www.birmingham.ac.uk/bear for more details.

References

  • [1] Batson, J., Royer, L.: Noise2self: Blind denoising by self-supervision. In: International Conference on Machine Learning. pp. 524–533. PMLR (2019)
  • [2] Broaddus, C., Krull, A., Weigert, M., Schmidt, U., Myers, G.: Removing structured noise with self-supervised blind-spot networks. In: 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI). pp. 159–163. IEEE (2020)
  • [3] Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05). vol. 2, pp. 60–65. Ieee (2005)
  • [4] Chen, X., Kingma, D.P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., Abbeel, P.: Variational lossy autoencoder. In: International Conference on Learning Representations (2017)
  • [5] Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image denoising with block-matching and 3d filtering. In: Image processing: algorithms and systems, neural networks, and machine learning. vol. 6064, pp. 354–365. SPIE (2006)
  • [6] Denvir, D.J., Conroy, E.: Electron multiplying ccds. In: Opto-Ireland 2002: Optical Metrology, Imaging, and Machine Vision. vol. 4877, pp. 55–68. SPIE (2003)
  • [7] Dupont, B., Dupret, A., Belhaire, E., Villard, P.: Fpn sources in bolometric infrared detectors. IEEE Sensors Journal 9(8), 944–952 (2009)
  • [8] Eom, M., Han, S., Park, P., Kim, G., Cho, E.S., Sim, J., Lee, K.H., Kim, S., Tian, H., Böhm, U.L., et al.: Statistically unbiased prediction enables accurate denoising of voltage imaging data. Nature Methods pp. 1–12 (2023)
  • [9] Glaser, A.K., Bishop, K.W., Barner, L.A., Susaki, E.A., Kubota, S.I., Gao, G., Serafin, R.B., Balaram, P., Turschak, E., Nicovich, P.R., et al.: A hybrid open-top light-sheet microscope for versatile multi-scale imaging of cleared tissues. Nature methods 19(5), 613–619 (2022)
  • [10] Gu, S., Zhang, L., Zuo, W., Feng, X.: Weighted nuclear norm minimization with application to image denoising. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2862–2869 (2014)
  • [11] Gulrajani, I., Kumar, K., Ahmed, F., Taiga, A.A., Visin, F., Vazquez, D., Courville, A.: Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013 (2016)
  • [12] Hagen, G.M., Bendesky, J., Machado, R., Nguyen, T.A., Kumar, T., Ventura, J.: Fluorescence microscopy datasets for training deep neural networks. GigaScience 10(5), giab032 (2021)
  • [13] Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction, pp. 18–22. Springer (2009)
  • [14] He, J., Spokoyny, D., Neubig, G., Berg-Kirkpatrick, T.: Lagging inference networks and posterior collapse in variational autoencoders. In: International Conference on Learning Representations (2019), https://openreview.net/forum?id=rylDfnCqF7
  • [15] Henninen, T.R., Bon, M., Wang, F., Passerone, D., Erni, R.: The structure of sub-nm platinum clusters at elevated temperatures. Angewandte Chemie International Edition 59(2), 839–845 (2020)
  • [16] Huang, T., Li, S., Jia, X., Lu, H., Liu, J.: Neighbor2neighbor: Self-supervised denoising from single noisy images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14781–14790 (2021)
  • [17] Izadi, S., Sutton, D., Hamarneh, G.: Image denoising in the deep learning era. Artificial Intelligence Review 56(7), 5929–5974 (2023)
  • [18] Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019)
  • [19] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. stat 1050,  1 (2014)
  • [20] Krull, A., Basevi, H., Salmon, B., Zeug, A., Müller, F., Tonks, S., Muppala, L., Leonardis, A.: Image denoising and the generative accumulation of photons. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1528–1537 (2024)
  • [21] Krull, A., Buchholz, T.O., Jug, F.: Noise2void-learning denoising from single noisy images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2129–2137 (2019)
  • [22] Krull, A., Vičar, T., Prakash, M., Lalit, M., Jug, F.: Probabilistic noise2void: Unsupervised content-aware denoising. Frontiers in Computer Science 2,  5 (2020)
  • [23] Kullback, S., Leibler, R.A.: On information and sufficiency. The annals of mathematical statistics 22(1), 79–86 (1951)
  • [24] Laine, R.F., Jacquemet, G., Krull, A.: Imaging in focus: an introduction to denoising bioimages in the era of deep learning. The international journal of biochemistry & cell biology 140, 106077 (2021)
  • [25] Laine, S., Karras, T., Lehtinen, J., Aila, T.: High-quality self-supervised deep image denoising. Advances in Neural Information Processing Systems 32 (2019)
  • [26] Lee, W., Son, S., Lee, K.M.: Ap-bsn: Self-supervised denoising for real-world images via asymmetric pd and blind-spot network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17725–17734 (2022)
  • [27] Lehtinen, J., Munkberg, J., Hasselgren, J., Laine, S., Karras, T., Aittala, M., Aila, T.: Noise2noise: Learning image restoration without clean data. In: International Conference on Machine Learning. pp. 2965–2974. PMLR (2018)
  • [28] Liao, Y.: Practical electron microscopy and database. An Online Book (2006)
  • [29] Mandracchia, B., Hua, X., Guo, C., Son, J., Urner, T., Jia, S.: Fast and accurate scmos noise correction for fluorescence microscopy. Nature communications 11(1),  94 (2020)
  • [30] Metzler, C.A., Mousavi, A., Heckel, R., Baraniuk, R.G.: Unsupervised learning with stein’s unbiased risk estimator. arXiv preprint arXiv:1805.10531 (2018)
  • [31] Moomaw, B.: Camera technologies for low light imaging: overview and relative advantages. Methods in cell biology 114, 243–283 (2013)
  • [32] Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al.: Conditional image generation with pixelcnn decoders. Advances in neural information processing systems 29 (2016)
  • [33] Pennycook, S.J., Nellist, P.D.: Scanning transmission electron microscopy: imaging and analysis. Springer Science & Business Media (2011)
  • [34] Portmann, J., Lynen, S., Chli, M., Siegwart, R.: People detection and tracking from aerial thermal views. In: 2014 IEEE international conference on robotics and automation (ICRA). pp. 1794–1800. IEEE (2014)
  • [35] Prakash, M., Delbracio, M., Milanfar, P., Jug, F.: Interpretable unsupervised diversity denoising and artefact removal. In: International Conference on Learning Representations (2021)
  • [36] Prakash, M., Krull, A., Jug, F.: Fully unsupervised diversity denoising with convolutional variational autoencoders. arXiv preprint arXiv:2006.06072 (2020)
  • [37] Prakash, M., Lalit, M., Tomancak, P., Krul, A., Jug, F.: Fully unsupervised probabilistic noise2void. In: 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI). pp. 154–158. IEEE (2020)
  • [38] Quan, Y., Chen, M., Pang, T., Ji, H.: Self2self with dropout: Learning self-supervised denoising from single image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1890–1898 (2020)
  • [39] Salmon, B., Krull, A.: Towards structured noise models for unsupervised denoising. In: European Conference on Computer Vision. pp. 379–394. Springer (2022)
  • [40] Sønderby, C.K., Raiko, T., Maaløe, L., Sønderby, S.K., Winther, O.: Ladder variational autoencoders. Advances in neural information processing systems 29 (2016)
  • [41] Song, T.A., Yang, F., Dutta, J.: Noise2void: unsupervised denoising of pet images. Physics in Medicine & Biology 66(21), 214002 (2021)
  • [42] Wang, F., Henninen, T.R., Keller, D., Erni, R.: Noise2atom: unsupervised denoising for scanning transmission electron microscopy images. Applied Microscopy 50(1),  1–9 (2020)
  • [43] Wang, Z., Liu, J., Li, G., Han, H.: Blind2unblind: Self-supervised image denoising with visible blind spots. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2027–2036 (2022)
  • [44] Weigert, M., Schmidt, U., Boothe, T., Müller, A., Dibrov, A., Jain, A., Wilhelm, B., Schmidt, D., Broaddus, C., Culley, S., et al.: Content-aware image restoration: pushing the limits of fluorescence microscopy. Nature methods 15(12), 1090–1097 (2018)
  • [45] Xu, J., Adalsteinsson, E.: Deformed2self: Self-supervised denoising for dynamic medical imaging. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II 24. pp. 25–35. Springer (2021)
  • [46] Zhang, Y., Zhu, Y., Nichols, E., Wang, Q., Zhang, S., Smith, C., Howard, S.: A poisson-gaussian denoising dataset with real fluorescence microscopy images. In: CVPR (2019)
  • [47] Zhang, Z., Wang, Y., Piestun, R., Huang, Z.L.: Characterizing and correcting camera noise in back-illuminated scmos cameras. Optics Express 29(5), 6668–6690 (2021)