License: CC BY 4.0
arXiv:2312.14977v1 [cs.LG] 21 Dec 2023

Diffusion Models for Generative Artificial Intelligence: An Introduction for Applied Mathematicians

Catherine F. Higham School of Computing Science, University of Glasgow, Sir Alwyn Williams Building, Glasgow, G12 8QQ. Supported by EPSRC grant EP/T00097X/1. ([email protected])    Desmond J. Higham School of Mathematics, University of Edinburgh, EH9 3FD, UK. Supported by EPSRC grant EP/V046527/1. ([email protected])    Peter Grindrod CBE Mathematical Institute, University of Oxford, Oxford, United Kingdom. Supported by EPSRC grant EP/R018472/1. ([email protected])
Abstract

Generative artificial intelligence (AI) refers to algorithms that create synthetic but realistic output. Diffusion models currently offer state of the art performance in generative AI for images. They also form a key component in more general tools, including text-to-image generators and large language models. Diffusion models work by adding noise to the available training data and then learning how to reverse the process. The reverse operation may then be applied to new random data in order to produce new outputs. We provide a brief introduction to diffusion models for applied mathematicians and statisticians. Our key aims are (a) to present illustrative computational examples, (b) to give a careful derivation of the underlying mathematical formulas involved, and (c) to draw a connection with partial differential equation (PDE) diffusion models. We provide code for the computational experiments. We hope that this topic will be of interest to advanced undergraduate students and postgraduate students. Portions of the material may also provide useful motivational examples for those who teach courses in stochastic processes, inference, machine learning, PDEs or scientific computing.

1 Motivation

Generative artificial intelligence (AI) models are designed to create new outputs that are similar to the examples on which they were trained. Over the past decade or so, advancements in generative AI have included the development of variational autoencoders [9, 22], generative adversarial networks [19] and transformers [38]. In this work we focus on denoising diffusion probabilistic models [15]; for simplicity we use the term diffusion models. They currently represent the state of the art in image generation [7], and form a key part of more sophisticated tools such as DALL-E 2 and 3 [31]. We refer to [4, 6, 8, 29], and the references therein, for details of the historical developments that have led to the current state-of-the art in generative AI.

The somewhat counterintuitive, but deceptively powerful, idea behind diffusion models is to destroy the training data by adding noise. While doing so, the model learns how to reverse the process. In this way, the final model is able to start with new, easily generated, random samples and denoise them, thereby generating new, synthetic examples. The task of building and applying a simple, yet impressive, model can be described very succinctly—see Algorithms 1 and 2 in section 5. However, deriving the expressions that go into these algorithms is not so straightforward, and we believe that there is a niche for a careful and accessible mathematically-oriented treatment.

Our intended readership is advanced undergraduate students and postgraduate students in mathematics or relate disciplines. The material should be suitable for independent study, and there are many directions in which this material can be followed up—the literature is rapidly expanding, and new extensions and connections are being discovered at a pace. We also hope that portions of this material will provide higher education professionals with topical and engaging examples that can be slipped into courses on stochastics, numerics, PDEs or data science.

We aimed to keep the prerequisites to a minimum; these are

  • for sections 25 ideas from statistics: mean, variance, Gaussian distribution, Markov chains, conditional probability,

  • for sections 4 and 5 ideas from deep learning: the stochastic gradient method, artificial neural networks,

  • for section 7 ideas from PDEs: multivariate calculus, the divergence theorem, spectral analysis.

We focus here on the task of image generation. We describe a bare bones form of diffusion model, explain carefully how the key mathematical expressions arise, and illustrate the concept via computational examples. The key reference for this article is [15], which built on [35] and received more than ten citations per day during the year 2023. We also found [25] to be a very useful resource.

In section 2 we present some pictures that give a feel for the idea of diffusion models in generative AI. We then provide details of the relevant forward and backward processes in sections 3 and 4, respectively, which leads to the algorithms presented in section 5.

We emphasize that this is a very active and fast-moving research topic with connections to many related areas. In section 6 we provide some links to the relevant literature. That section also highlights wider issues around performance evaluation, computational expense, copyright, privacy, ethics, bias, explainability and robustness.

We finish in section 7 with more speculative material that suggests a connection between stable diffusion models and deterministic PDEs, providing a link to more traditional applied mathematics.

2 Illustration

A diffusion model [15] aims to generate realistic-looking images. It works by

(i)

taking an existing image and iteratively adding noise until the original information is lost,

(ii)

learning how to reconstruct the original image by iteratively removing the noise.

After training, we can then use the reverse diffusion process to generate a realistic image from a new, random, starting point—remove the noise and see what emerges.

One way to conceptualize this method is to imagine an (unknown) probability distribution over the collection of all natural images. We hope to sample from this distribution; more likely images should be chosen with higher probability. We don’t have access to this magic probability distribution. Instead, we have training data; that is, examples of natural images. We also have a pseudorandom number generator that allows us to sample from a standard Gaussian distribution. In item (i) above, we are doing the easy part, map** from the image distribution to the Gaussian distribution. In item (ii) we learn the inverse operator, map** from the Gaussian distribution to the image distribution. This allows us to convert Gaussian samples into images.

We illustrate the idea of a diffusion model trained on images from the widely studied MNIST data set [23]. Here, each image represents a handwritten digit from the set {0,1,2,,9}0129\{0,1,2,\ldots,9\}{ 0 , 1 , 2 , … , 9 }. These low resolution images are black-and-white with 28×28282828\times 2828 × 28 pixels, resized by the model to 32×32323232\times 3232 × 32. Figure 1 shows a representative collection of 64646464 images.

Figures 24 were produced with a diffusion model based on a Mathworks tutorial at

https://uk.mathworks.com/help/deeplearning/ug/generate-images-using-diffusion.html

Figure 2 illustrates the forward process that is used in the training phase. At time t=0𝑡0t=0italic_t = 0 we have an MNIST image. At each integer time t=0,1,2,,499𝑡012499t=0,1,2,\ldots,499italic_t = 0 , 1 , 2 , … , 499 Gaussian noise is added. At t=500𝑡500t=500italic_t = 500 there is no visible evidence of the original image.

Refer to caption
Figure 1: Representative set of 64646464 images from the MNIST data set [23].
Refer to caption
Figure 2: Result of forward map noising over time.

Figure 3 shows the effect of the backward process that is available after training. The top left panel displays nine randomly chosen final time t=500𝑡500t=500italic_t = 500 images—pure noise matrices consisting of independent Gaussian samples. We show the effect of applying the backward, denoising process as time is reversed. At t=0𝑡0t=0italic_t = 0 the model has produced new, synthetic examples that, in at least eight of the nine cases, correspond to handwritten digits. We emphasize that labels were not used in the training process. In this simple, unconditional model there is no way to control which (if any) of the t=0𝑡0t=0italic_t = 0 images will resemble any particular category of digit.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Result of backward map denoising over time, for 9999 different random choices at t=500𝑡500t=500italic_t = 500.

In Figure 4 we show the results from a larger experiment. Here we used the trained diffusion model to generate images from 500 independent time t=500𝑡500t=500italic_t = 500 choices. For this figure, we separated the images into categories using an independent convolutional neural network classifier that was trained separately on real MNIST data. Since we have no control over how many of the 500 images will appear in each class, the number of synthetic outputs in each category varies considerably.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: 500500500500 synthetic images generated by the diffusion model from different time t=500𝑡500t=500italic_t = 500 noise samples. These have been sorted into classes by a convolutional neural network classifier that was trained separately on real MNIST data.

We finish with two experiments which illustrate that the backward, denoising process is both stochastic and unpredictable. In Figure 5 we show the images generated after applying nine independent denoising runs to the same Gaussian at t=500𝑡500t=500italic_t = 500. We see that the denoising process can produce considerably different results from a single source of randomness. In Figure 6 we perform a similar experiment where the t=500𝑡500t=500italic_t = 500 data emerges from the training set. On the left we show a training image undergoing the forward, noising process up to time t=500𝑡500t=500italic_t = 500. On the right we show the results from nine independent denoising runs on this time t=500𝑡500t=500italic_t = 500 data. We see that none of the synthetically generated images resemble the original.

Refer to caption
Refer to caption
Figure 5: Nine images (right) created from the same Gaussian noise at t=500𝑡500t=500italic_t = 500 (left).
Refer to caption
Refer to caption
Figure 6: Nine images (right) created from the noisy version (left t=500𝑡500t=500italic_t = 500) of one original image (left t=0𝑡0t=0italic_t = 0).

3 Forwards

We begin this section with some background on Gaussian random variables; see a standard text such as [3, 27] for more details. When dealing with Gaussians, we will always consider the multivariate, isotropic case. We denote the probability density at a point 𝐱d𝐱superscript𝑑\mathbf{x}\in\mathbb{R}^{d}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT by 𝒩(𝐱;𝝁,σ𝐈)𝒩𝐱𝝁𝜎𝐈\mathcal{N}(\mathbf{x};\boldsymbol{\mu},\sigma\mathbf{I})caligraphic_N ( bold_x ; bold_italic_μ , italic_σ bold_I ), where

𝒩(𝐱;𝝁,σ𝐈):=1σ(2π)1/dexp(12σ2(𝐱𝝁)T(𝐱𝝁)).assign𝒩𝐱𝝁𝜎𝐈1𝜎superscript2𝜋1𝑑12superscript𝜎2superscript𝐱𝝁𝑇𝐱𝝁\mathcal{N}(\mathbf{x};\boldsymbol{\mu},\sigma\mathbf{I}):=\frac{1}{\sigma(2% \pi)^{1/d}}\exp\left(-{\textstyle{{\frac{1}{2\sigma^{2}}}}}(\mathbf{x}-% \boldsymbol{\mu})^{T}(\mathbf{x}-\boldsymbol{\mu})\right).caligraphic_N ( bold_x ; bold_italic_μ , italic_σ bold_I ) := divide start_ARG 1 end_ARG start_ARG italic_σ ( 2 italic_π ) start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT end_ARG roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( bold_x - bold_italic_μ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_x - bold_italic_μ ) ) . (1)

Here, 𝝁d𝝁superscript𝑑\boldsymbol{\mu}\in\mathbb{R}^{d}bold_italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the mean, and we will refer to σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as the variance, since the corresponding covariance matrix has the form σ2𝐈superscript𝜎2𝐈\sigma^{2}\mathbf{I}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I, with 𝐈d×d𝐈superscript𝑑𝑑\mathbf{I}\in\mathbb{R}^{d\times d}bold_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT denoting the identity matrix. Such Gaussian random variables have the important property that their sums remain Gaussian, with means and variances combining additively: the sum of two independent Gaussians with means 𝝁1subscript𝝁1\boldsymbol{\mu}_{1}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝝁2subscript𝝁2\boldsymbol{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and variances σ12superscriptsubscript𝜎12\sigma_{1}^{2}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and σ22superscriptsubscript𝜎22\sigma_{2}^{2}italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is a Gaussian random variable with mean 𝝁1+𝝁2subscript𝝁1subscript𝝁2\boldsymbol{\mu}_{1}+\boldsymbol{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and variance σ12+σ22superscriptsubscript𝜎12superscriptsubscript𝜎22\sigma_{1}^{2}+\sigma_{2}^{2}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The term standard Gaussian refers to the case where the mean is 𝟎d0superscript𝑑\mathbf{0}\in\mathbb{R}^{d}bold_0 ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and the variance is 1111. Multiplying a standard Gaussian by the scalar σ𝜎\sigmaitalic_σ and shifting by 𝝁d𝝁superscript𝑑\boldsymbol{\mu}\in\mathbb{R}^{d}bold_italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT produces a Gaussian with mean 𝝁𝝁\boldsymbol{\mu}bold_italic_μ and variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. It follows that if 𝐲𝐲\mathbf{y}bold_y and 𝐳𝐳\mathbf{z}bold_z are independent standard Gaussians, and a𝑎aitalic_a and b𝑏bitalic_b are scalars, then a𝐲+b𝐳𝑎𝐲𝑏𝐳a\mathbf{y}+b\mathbf{z}italic_a bold_y + italic_b bold_z is Gaussian with mean zero and variance a2+b2superscript𝑎2superscript𝑏2a^{2}+b^{2}italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT; so a𝐲+b𝐳𝑎𝐲𝑏𝐳a\mathbf{y}+b\mathbf{z}italic_a bold_y + italic_b bold_z can be sampled as a2+b2ϵsuperscript𝑎2superscript𝑏2bold-italic-ϵ\sqrt{a^{2}+b^{2}}\,\boldsymbol{\epsilon}square-root start_ARG italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_italic_ϵ, where ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ is a standard Gaussian.

We consider images that can be described by d𝑑ditalic_d real numbers, typically pixel values, and we collect these into a vector in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. In practice pixel values might be constrained—for example only integers between 00 and 255255255255 might be allowed—but we ignore this issue here for simplicity.

Given an image 𝐱0dsubscript𝐱0superscript𝑑\mathbf{x}_{0}\in\mathbb{R}^{d}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, the forward process iteratively adds noise to create a sequence 𝐱1,𝐱2,,𝐱Tsubscript𝐱1subscript𝐱2subscript𝐱𝑇\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT according to the rule

𝐱t=1βt𝐱t1+βtϵt.subscript𝐱𝑡1subscript𝛽𝑡subscript𝐱𝑡1subscript𝛽𝑡subscriptbold-italic-ϵ𝑡\mathbf{x}_{t}=\sqrt{1-\beta_{t}}\,\mathbf{x}_{t-1}+\sqrt{\beta_{t}}\,% \boldsymbol{\epsilon}_{t}.bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (2)

Here, each ϵtsubscriptbold-italic-ϵ𝑡\boldsymbol{\epsilon}_{t}bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is an independent standard Gaussian and the scalar parameter βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is between zero and one. The sequence β1,β2,,βTsubscript𝛽1subscript𝛽2subscript𝛽𝑇\beta_{1},\beta_{2},\ldots,\beta_{T}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, known as the variance schedule, is predetermined. For example, in [15], linearly increasing values from β1=104subscript𝛽1superscript104\beta_{1}={10}^{-4}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to βT=0.02subscript𝛽𝑇0.02\beta_{T}=0.02italic_β start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0.02 are used. Since βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT here is increasing, more noise is added as the forward process evolves. It is useful to think of t𝑡titalic_t as a time-like variable. At time zero we have an image and at time T𝑇Titalic_T we effectively have pure Gaussian noise.

The process (2) defines a discrete time Markov process, and the associated transition density may be written

q(𝐱t|𝐱t1)=𝒩(𝐱t;1βt𝐱t1,βt𝐈).𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝒩subscript𝐱𝑡1subscript𝛽𝑡subscript𝐱𝑡1subscript𝛽𝑡𝐈q(\mathbf{x}_{t}\,|\,\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t};\sqrt{1-% \beta_{t}}\,\mathbf{x}_{t-1},\beta_{t}\mathbf{I}).italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) . (3)

This quantifies the probability of observing 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t, given 𝐱t1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT at time t1𝑡1t-1italic_t - 1.

Updating over one time step in the forward process (2) is straightforward; just scale the current value and add Gaussian noise. For later use, it is helpful to know that step** from time zero to a general time t𝑡titalic_t is possible with a single leap. To see this, we introduce αt=1βtsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT so that

𝐱t=αt𝐱t1+1αtϵt.subscript𝐱𝑡subscript𝛼𝑡subscript𝐱𝑡11subscript𝛼𝑡subscriptbold-italic-ϵ𝑡\mathbf{x}_{t}=\sqrt{\alpha_{t}}\,\mathbf{x}_{t-1}+\sqrt{1-\alpha_{t}}\,% \boldsymbol{\epsilon}_{t}.bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (4)

Then, applying (4) again, we have

𝐱tsubscript𝐱𝑡\displaystyle\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =\displaystyle== αt(αt1𝐱t2+1αt1ϵt1)+1αtϵtsubscript𝛼𝑡subscript𝛼𝑡1subscript𝐱𝑡21subscript𝛼𝑡1subscriptbold-italic-ϵ𝑡11subscript𝛼𝑡subscriptbold-italic-ϵ𝑡\displaystyle\sqrt{\alpha_{t}}\left(\sqrt{\alpha_{t-1}}\,\mathbf{x}_{t-2}+% \sqrt{1-\alpha_{t-1}}\,\boldsymbol{\epsilon}_{t-1}\right)+\sqrt{1-\alpha_{t}}% \,\boldsymbol{\epsilon}_{t}square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (5)
=\displaystyle== αtαt1𝐱t2+αt1αt1ϵt1+1αtϵt.subscript𝛼𝑡subscript𝛼𝑡1subscript𝐱𝑡2subscript𝛼𝑡1subscript𝛼𝑡1subscriptbold-italic-ϵ𝑡11subscript𝛼𝑡subscriptbold-italic-ϵ𝑡\displaystyle\sqrt{\alpha_{t}\alpha_{t-1}}\,\mathbf{x}_{t-2}+\sqrt{\alpha_{t}}% \sqrt{1-\alpha_{t-1}}\,\boldsymbol{\epsilon}_{t-1}+\sqrt{1-\alpha_{t}}\,% \boldsymbol{\epsilon}_{t}.square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT + square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT .

Using the properties of Gaussians mentioned at the start of this section, we see that αt1αt1ϵt1+1αtϵtsubscript𝛼𝑡1subscript𝛼𝑡1subscriptbold-italic-ϵ𝑡11subscript𝛼𝑡subscriptbold-italic-ϵ𝑡\sqrt{\alpha_{t}}\sqrt{1-\alpha_{t-1}}\,\boldsymbol{\epsilon}_{t-1}+\sqrt{1-% \alpha_{t}}\,\boldsymbol{\epsilon}_{t}square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be combined into a single Gaussian. In this way, (5) may be written

𝐱t=αtαt1𝐱t2+1αtαt1ϵt,t2,subscript𝐱𝑡subscript𝛼𝑡subscript𝛼𝑡1subscript𝐱𝑡21subscript𝛼𝑡subscript𝛼𝑡1subscriptbold-italic-ϵ𝑡𝑡2\mathbf{x}_{t}=\sqrt{\alpha_{t}\alpha_{t-1}}\,\mathbf{x}_{t-2}+\sqrt{1-\alpha_% {t}\alpha_{t-1}}\,\boldsymbol{\epsilon}_{t,t-2},bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t , italic_t - 2 end_POSTSUBSCRIPT ,

where ϵt,t2subscriptbold-italic-ϵ𝑡𝑡2\boldsymbol{\epsilon}_{t,t-2}bold_italic_ϵ start_POSTSUBSCRIPT italic_t , italic_t - 2 end_POSTSUBSCRIPT is a standard Gaussian.

Proceeding inductively, suppose that for some k𝑘kitalic_k between t2𝑡2t-2italic_t - 2 and 1111

𝐱t=αtαt1αk+1𝐱k+1αtαt1αk+1ϵt,k,subscript𝐱𝑡subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑘1subscript𝐱𝑘1subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑘1subscriptitalic-ϵ𝑡𝑘\mathbf{x}_{t}=\sqrt{\alpha_{t}\alpha_{t-1}\ldots\alpha_{k+1}}\,\mathbf{x}_{k}% +\sqrt{1-\alpha_{t}\alpha_{t-1}\ldots\alpha_{k+1}}\,\epsilon_{t,k},bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT … italic_α start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT … italic_α start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT , (6)

where ϵt,ksubscriptitalic-ϵ𝑡𝑘\epsilon_{t,k}italic_ϵ start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT is a standard Gaussian. Then replacing 𝐱ksubscript𝐱𝑘\mathbf{x}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT using (4)

𝐱t=αtαt1αk+1(αk𝐱k1+1αkϵk)+1αtαt1αk+1ϵt,k.subscript𝐱𝑡subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑘1subscript𝛼𝑘subscript𝐱𝑘11subscript𝛼𝑘subscriptbold-italic-ϵ𝑘1subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑘1subscriptitalic-ϵ𝑡𝑘\mathbf{x}_{t}=\sqrt{\alpha_{t}\alpha_{t-1}\ldots\alpha_{k+1}}\,\left(\sqrt{% \alpha_{k}}\,\mathbf{x}_{k-1}+\sqrt{1-\alpha_{k}}\,\boldsymbol{\epsilon}_{k}% \right)+\sqrt{1-\alpha_{t}\alpha_{t-1}\ldots\alpha_{k+1}}\,\epsilon_{t,k}.bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT … italic_α start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_ARG ( square-root start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT … italic_α start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT .

Again replacing the sum of two independent Gaussians by a single, appropriate Gaussian, we have

𝐱tsubscript𝐱𝑡\displaystyle\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =\displaystyle== αtαt1αk𝐱k1+αtαt1αk+1(1αk)+1αtαt1αk+1ϵt,k1,subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑘subscript𝐱𝑘1subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑘11subscript𝛼𝑘1subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑘1subscriptitalic-ϵ𝑡𝑘1\displaystyle\sqrt{\alpha_{t}\alpha_{t-1}\ldots\alpha_{k}}\,\mathbf{x}_{k-1}+% \sqrt{\alpha_{t}\alpha_{t-1}\ldots\alpha_{k+1}(1-\alpha_{k})+1-\alpha_{t}% \alpha_{t-1}\ldots\alpha_{k+1}}\,\epsilon_{t,k-1},square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT … italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT … italic_α start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT … italic_α start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_t , italic_k - 1 end_POSTSUBSCRIPT ,
=\displaystyle== αtαt1αk𝐱k1+1αtαt1αkϵt,k1,subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑘subscript𝐱𝑘11subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑘subscriptitalic-ϵ𝑡𝑘1\displaystyle\sqrt{\alpha_{t}\alpha_{t-1}\ldots\alpha_{k}}\,\mathbf{x}_{k-1}+% \sqrt{1-\alpha_{t}\alpha_{t-1}\ldots\alpha_{k}}\,\epsilon_{t,k-1},square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT … italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT … italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_t , italic_k - 1 end_POSTSUBSCRIPT ,

where ϵt,k1subscriptitalic-ϵ𝑡𝑘1\epsilon_{t,k-1}italic_ϵ start_POSTSUBSCRIPT italic_t , italic_k - 1 end_POSTSUBSCRIPT is a standard Gaussian. Hence, the form (6) is valid all the way down to k=0𝑘0k=0italic_k = 0. So, letting

α¯t=i=1tαi,subscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑖\overline{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i},over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (7)

we may write

𝐱t=α¯t𝐱0+1α¯tϵ¯t,subscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡subscript¯bold-italic-ϵ𝑡\mathbf{x}_{t}=\sqrt{\overline{\alpha}_{t}}\,\mathbf{x}_{0}+\sqrt{1-\overline{% \alpha}_{t}}\,\bar{\boldsymbol{\epsilon}}_{t},bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG over¯ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (8)

where ϵ¯tsubscript¯bold-italic-ϵ𝑡\bar{\boldsymbol{\epsilon}}_{t}over¯ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a standard Gaussian. We may therefore step directly from time 00 to any later time t𝑡titalic_t using a single Gaussian. This proves convenient for the analysis in section 4 and also for the training algorithm discussed in section 5.

In terms of a transition density, (8) shows that

q(𝐱t|𝐱0):=𝒩(𝐱t;α¯t𝐱0,(1α¯t)𝐈).assign𝑞conditionalsubscript𝐱𝑡subscript𝐱0𝒩subscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐈q(\mathbf{x}_{t}\,|\,\mathbf{x}_{0}):=\mathcal{N}(\mathbf{x}_{t};\sqrt{% \overline{\alpha}_{t}}\,\mathbf{x}_{0},(1-\overline{\alpha}_{t})\mathbf{I}).italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) := caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) . (9)

4 Backwards

We now consider the reverse process. We are interested in the probability of 𝐱t1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT given 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT; that is, q(𝐱t1|𝐱t,𝐱0)𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0q(\mathbf{x}_{t-1}\,|\,\mathbf{x}_{t},\mathbf{x}_{0})italic_q ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). To proceed we will make use of a result in conditional probability theory known as the product rule, [3, 27], which for our purposes may be written

P(A,B,C)=P(A|B,C)P(B,C)=P(A|B,C)P(B|C)P(C).𝑃𝐴𝐵𝐶𝑃conditional𝐴𝐵𝐶𝑃𝐵𝐶𝑃conditional𝐴𝐵𝐶𝑃conditional𝐵𝐶𝑃𝐶P(A,B,C)=P(A\,|\,B,C)\,P(B,C)=P(A\,|\,B,C)\,P(B\,|\,C)\,P(C).italic_P ( italic_A , italic_B , italic_C ) = italic_P ( italic_A | italic_B , italic_C ) italic_P ( italic_B , italic_C ) = italic_P ( italic_A | italic_B , italic_C ) italic_P ( italic_B | italic_C ) italic_P ( italic_C ) .

By symmetry, we also have

P(A,B,C)=P(B,A,C)=P(B|A,C)P(A,C)=P(B|A,C)P(A|C)P(C).𝑃𝐴𝐵𝐶𝑃𝐵𝐴𝐶𝑃conditional𝐵𝐴𝐶𝑃𝐴𝐶𝑃conditional𝐵𝐴𝐶𝑃conditional𝐴𝐶𝑃𝐶P(A,B,C)=P(B,A,C)=P(B\,|\,A,C)\,P(A,C)=P(B\,|\,A,C)\,P(A\,|\,C)\,P(C).italic_P ( italic_A , italic_B , italic_C ) = italic_P ( italic_B , italic_A , italic_C ) = italic_P ( italic_B | italic_A , italic_C ) italic_P ( italic_A , italic_C ) = italic_P ( italic_B | italic_A , italic_C ) italic_P ( italic_A | italic_C ) italic_P ( italic_C ) .

Hence,

P(A|B,C)=P(B|A,C)P(A|C)P(B|C).𝑃conditional𝐴𝐵𝐶𝑃conditional𝐵𝐴𝐶𝑃conditional𝐴𝐶𝑃conditional𝐵𝐶P(A\,|\,B,C)=\frac{P(B\,|\,A,C)\,P(A\,|\,C)}{P(B\,|\,C)}.italic_P ( italic_A | italic_B , italic_C ) = divide start_ARG italic_P ( italic_B | italic_A , italic_C ) italic_P ( italic_A | italic_C ) end_ARG start_ARG italic_P ( italic_B | italic_C ) end_ARG .

We will use this in the form

q(𝐱t1|𝐱t,𝐱0)𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0\displaystyle q(\mathbf{x}_{t-1}\,|\,\mathbf{x}_{t},\mathbf{x}_{0})italic_q ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =\displaystyle== q(𝐱t|𝐱t1,𝐱0)q(𝐱t1|𝐱0)q(𝐱t|𝐱0).𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝐱0𝑞conditionalsubscript𝐱𝑡1subscript𝐱0𝑞conditionalsubscript𝐱𝑡subscript𝐱0\displaystyle\frac{q(\mathbf{x}_{t}\,|\,\mathbf{x}_{t-1},\mathbf{x}_{0})\,q(% \mathbf{x}_{t-1}\,|\,\mathbf{x}_{0})}{q(\mathbf{x}_{t}\,|\,\mathbf{x}_{0})}.divide start_ARG italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_q ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG . (10)

So now we focus on the quantities appearing on the right hand side of (10).

By the Markovian nature of the forward process, from (3),

q(𝐱t|𝐱t1,𝐱0)=q(𝐱t|𝐱t1)=𝒩(𝐱t;αt𝐱t1,(1αt)𝐈).𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝐱0𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝒩subscript𝐱𝑡subscript𝛼𝑡subscript𝐱𝑡11subscript𝛼𝑡𝐈q(\mathbf{x}_{t}\,|\,\mathbf{x}_{t-1},\mathbf{x}_{0})=q(\mathbf{x}_{t}\,|\,% \mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t};\sqrt{\alpha_{t}}\,\mathbf{x}_{t-% 1},(1-\alpha_{t})\mathbf{I}).italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) . (11)

Making use of (9) for 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐱t1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, we then see that

q(𝐱t1|𝐱t,𝐱0)=𝒩(𝐱t;αt𝐱t1,(1αt)𝐈)𝒩(𝐱t1;α¯t1𝐱0,(1α¯t1)𝐈)𝒩(𝐱t;α¯t𝐱0,(1α¯t)𝐈).𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0𝒩subscript𝐱𝑡subscript𝛼𝑡subscript𝐱𝑡11subscript𝛼𝑡𝐈𝒩subscript𝐱𝑡1subscript¯𝛼𝑡1subscript𝐱01subscript¯𝛼𝑡1𝐈𝒩subscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐈q(\mathbf{x}_{t-1}\,|\,\mathbf{x}_{t},\mathbf{x}_{0})=\frac{\mathcal{N}(% \mathbf{x}_{t};\sqrt{\alpha_{t}}\,\mathbf{x}_{t-1},(1-\alpha_{t})\mathbf{I})\,% \mathcal{N}(\mathbf{x}_{t-1};\sqrt{\overline{\alpha}_{t-1}}\,\mathbf{x}_{0},(1% -\overline{\alpha}_{t-1})\mathbf{I})}{\mathcal{N}(\mathbf{x}_{t};\sqrt{% \overline{\alpha}_{t}}\,\mathbf{x}_{0},(1-\overline{\alpha}_{t})\mathbf{I})}.italic_q ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = divide start_ARG caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) bold_I ) end_ARG start_ARG caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) end_ARG . (12)

From the definition (1), and ignoring the normalizing constants, we see that this expression has the form

exp(12(𝐱tαt𝐱t1)T(𝐱tαt𝐱t1)1αt12(𝐱t1α¯t1𝐱0)T(𝐱t1α¯t1𝐱0)1α¯t1\displaystyle\exp\left(-{\textstyle{{\frac{1}{2}}}}\frac{(\mathbf{x}_{t}-\sqrt% {\alpha_{t}}\,\mathbf{x}_{t-1})^{T}(\mathbf{x}_{t}-\sqrt{\alpha_{t}}\,\mathbf{% x}_{t-1})}{1-\alpha_{t}}-{\textstyle{{\frac{1}{2}}}}\frac{(\mathbf{x}_{t-1}-% \sqrt{\overline{\alpha}_{t-1}}\,\mathbf{x}_{0})^{T}(\mathbf{x}_{t-1}-\sqrt{% \overline{\alpha}_{t-1}}\,\mathbf{x}_{0})}{1-\overline{\alpha}_{t-1}}\right.roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG
+12(𝐱tα¯t𝐱0)T(𝐱tα¯t𝐱0)1α¯t).\displaystyle\left.\mbox{}+{\textstyle{{\frac{1}{2}}}}\frac{(\mathbf{x}_{t}-% \sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0})^{T}(\mathbf{x}_{t}-\sqrt{\overline% {\alpha}_{t}}\,\mathbf{x}_{0})}{1-\overline{\alpha}_{t}}\right).+ divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) . (13)

We will show that this expression matches

𝒩(𝐱t1;μq(𝐱t,𝐱0),σq2(t)𝐈),𝒩subscript𝐱𝑡1subscript𝜇𝑞subscript𝐱𝑡subscript𝐱0subscriptsuperscript𝜎2𝑞𝑡𝐈\mathcal{N}\left(\mathbf{x}_{t-1};\mu_{q}(\mathbf{x}_{t},\mathbf{x}_{0}),% \sigma^{2}_{q}(t)\mathbf{I}\right),caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_t ) bold_I ) , (14)

for appropriate μq(𝐱t,𝐱0)subscript𝜇𝑞subscript𝐱𝑡subscript𝐱0\mu_{q}(\mathbf{x}_{t},\mathbf{x}_{0})italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and σq2(t)subscriptsuperscript𝜎2𝑞𝑡\sigma^{2}_{q}(t)italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_t ). From (1), we can find σq2(t)subscriptsuperscript𝜎2𝑞𝑡\sigma^{2}_{q}(t)italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_t ) by considering the coefficient of 𝐱t1T𝐱t1superscriptsubscript𝐱𝑡1𝑇subscript𝐱𝑡1\mbox{}-\mathbf{x}_{t-1}^{T}\mathbf{x}_{t-1}- bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT in the exponent of (13). This coefficient is given by

12αt1αt+1211α¯t1=12αt(1α¯t1)+1αt(1αt)(1α¯t1)=12(1α¯t(1αt)(1α¯t1)),12subscript𝛼𝑡1subscript𝛼𝑡1211subscript¯𝛼𝑡112subscript𝛼𝑡1subscript¯𝛼𝑡11subscript𝛼𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡1121subscript¯𝛼𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡1{\textstyle{{\frac{1}{2}}}}\frac{\alpha_{t}}{1-\alpha_{t}}+{\textstyle{{\frac{% 1}{2}}}}\frac{1}{1-\overline{\alpha}_{t-1}}={\textstyle{{\frac{1}{2}}}}\frac{% \alpha_{t}(1-\overline{\alpha}_{t-1})+1-\alpha_{t}}{(1-\alpha_{t})(1-\overline% {\alpha}_{t-1})}={\textstyle{{\frac{1}{2}}}}\left(\frac{1-\overline{\alpha}_{t% }}{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})}\right),divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG 1 end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) + 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( divide start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG ) ,

where we used αtα¯t1=α¯tsubscript𝛼𝑡subscript¯𝛼𝑡1subscript¯𝛼𝑡\alpha_{t}\,\overline{\alpha}_{t-1}=\overline{\alpha}_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from (7). Hence,

σq2(t)=(1αt)(1α¯t1)1α¯t.subscriptsuperscript𝜎2𝑞𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡11subscript¯𝛼𝑡\sigma^{2}_{q}(t)=\frac{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})}{1-\overline% {\alpha}_{t}}.italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_t ) = divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG . (15)

Using the functional form (1) again, we can find μq(𝐱t,𝐱0)subscript𝜇𝑞subscript𝐱𝑡subscript𝐱0\mu_{q}(\mathbf{x}_{t},\mathbf{x}_{0})italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) by considering the vector, say 𝐯𝐯\mathbf{v}bold_v, such that 𝐱t1T𝐯superscriptsubscript𝐱𝑡1𝑇𝐯\mathbf{x}_{t-1}^{T}\mathbf{v}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_v is the cross-product in the exponent of (13). We see that

μq(𝐱t,𝐱0)σq2(t)=𝐯=αt𝐱t1αt+α¯t1𝐱01α¯t1.subscript𝜇𝑞subscript𝐱𝑡subscript𝐱0subscriptsuperscript𝜎2𝑞𝑡𝐯subscript𝛼𝑡subscript𝐱𝑡1subscript𝛼𝑡subscript¯𝛼𝑡1subscript𝐱01subscript¯𝛼𝑡1\frac{\mu_{q}(\mathbf{x}_{t},\mathbf{x}_{0})}{\sigma^{2}_{q}(t)}=\mathbf{v}=% \frac{\sqrt{\alpha_{t}}\,\mathbf{x}_{t}}{1-\alpha_{t}}+\frac{\sqrt{\overline{% \alpha}_{t-1}}\,\mathbf{x}_{0}}{1-\overline{\alpha}_{t-1}}.divide start_ARG italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_t ) end_ARG = bold_v = divide start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + divide start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG .

Hence, using (15),

μq(𝐱t,𝐱0)=αt(1α¯t1)𝐱t+α¯t1(1αt)𝐱01α¯t.subscript𝜇𝑞subscript𝐱𝑡subscript𝐱0subscript𝛼𝑡1subscript¯𝛼𝑡1subscript𝐱𝑡subscript¯𝛼𝑡11subscript𝛼𝑡subscript𝐱01subscript¯𝛼𝑡\mu_{q}(\mathbf{x}_{t},\mathbf{x}_{0})=\frac{\sqrt{\alpha_{t}}\,(1-\overline{% \alpha}_{t-1})\mathbf{x}_{t}+\sqrt{\overline{\alpha}_{t-1}}(1-\alpha_{t})% \mathbf{x}_{0}}{1-\overline{\alpha}_{t}}.italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = divide start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG . (16)

We wish to compute a sample from the distribution in (14). This will allow us to perform the required transition along the backwards process. Our approach is to estimate the mean in (14) and then shift with an appropriate Gaussian in order to match the required variance.

If we know 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ϵ¯tsubscript¯bold-italic-ϵ𝑡\bar{\boldsymbol{\epsilon}}_{t}over¯ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in (8) then we may write

𝐱0=𝐱t1α¯tϵ¯tα¯t.subscript𝐱0subscript𝐱𝑡1subscript¯𝛼𝑡subscript¯bold-italic-ϵ𝑡subscript¯𝛼𝑡\mathbf{x}_{0}=\frac{\mathbf{x}_{t}-\sqrt{1-\overline{\alpha}_{t}}\,\bar{% \boldsymbol{\epsilon}}_{t}}{\sqrt{\overline{\alpha}_{t}}}.bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = divide start_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG over¯ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG .

Substituting this expression for 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT into (16) we see that the mean of 𝐱t1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, given 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, takes the form

μq(𝐱t,𝐱0)=αt(1α¯t1)1α¯t𝐱t+α¯t1(1αt)(1α¯t)α¯t𝐱tα¯t1(1αt)1α¯t(1α¯t)α¯tϵ¯t.subscript𝜇𝑞subscript𝐱𝑡subscript𝐱0subscript𝛼𝑡1subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript𝐱𝑡subscript¯𝛼𝑡11subscript𝛼𝑡1subscript¯𝛼𝑡subscript¯𝛼𝑡subscript𝐱𝑡subscript¯𝛼𝑡11subscript𝛼𝑡1subscript¯𝛼𝑡1subscript¯𝛼𝑡subscript¯𝛼𝑡subscript¯bold-italic-ϵ𝑡\mu_{q}(\mathbf{x}_{t},\mathbf{x}_{0})=\frac{\sqrt{\alpha_{t}}(1-\overline{% \alpha}_{t-1})}{1-\overline{\alpha}_{t}}\mathbf{x}_{t}+\frac{\sqrt{\overline{% \alpha}_{t-1}}(1-\alpha_{t})}{(1-\overline{\alpha}_{t})\sqrt{\overline{\alpha}% _{t}}}\mathbf{x}_{t}-\frac{\sqrt{\overline{\alpha}_{t-1}}(1-\alpha_{t})\sqrt{1% -\overline{\alpha}_{t}}}{(1-\overline{\alpha}_{t})\sqrt{\overline{\alpha}_{t}}% }\,\bar{\boldsymbol{\epsilon}}_{t}.italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = divide start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG over¯ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (17)

Noting from (7) that α¯t1/α¯t=1/αtsubscript¯𝛼𝑡1subscript¯𝛼𝑡1subscript𝛼𝑡\overline{\alpha}_{t-1}/\overline{\alpha}_{t}=1/\alpha_{t}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT / over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 / italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and αt×α¯t1=α¯tsubscript𝛼𝑡subscript¯𝛼𝑡1subscript¯𝛼𝑡\alpha_{t}\times\overline{\alpha}_{t-1}=\overline{\alpha}_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we find that in (17) the coefficient of 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT simplifies as follows:

αt(1α¯t1)1α¯t+α¯t1(1αt)(1α¯t)α¯t=1αt(1α¯t)(αt(1α¯t1)+1αt)=1αt.subscript𝛼𝑡1subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript¯𝛼𝑡11subscript𝛼𝑡1subscript¯𝛼𝑡subscript¯𝛼𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscript𝛼𝑡1subscript¯𝛼𝑡11subscript𝛼𝑡1subscript𝛼𝑡\frac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}}+% \frac{\sqrt{\overline{\alpha}_{t-1}}(1-\alpha_{t})}{(1-\overline{\alpha}_{t})% \sqrt{\overline{\alpha}_{t}}}=\frac{1}{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{% t})}\left(\alpha_{t}(1-\overline{\alpha}_{t-1})+1-\alpha_{t}\right)=\frac{1}{% \sqrt{\alpha_{t}}}.divide start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + divide start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ( italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) + 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG .

Similarly, the coefficient of ϵ¯tsubscript¯bold-italic-ϵ𝑡\bar{\boldsymbol{\epsilon}}_{t}over¯ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in (17) simplifies to

α¯t1(1αt)1α¯t(1α¯t)α¯t=1αtαt1α¯t.subscript¯𝛼𝑡11subscript𝛼𝑡1subscript¯𝛼𝑡1subscript¯𝛼𝑡subscript¯𝛼𝑡1subscript𝛼𝑡subscript𝛼𝑡1subscript¯𝛼𝑡\mbox{}-\frac{\sqrt{\overline{\alpha}_{t-1}}(1-\alpha_{t})\sqrt{1-\overline{% \alpha}_{t}}}{(1-\overline{\alpha}_{t})\sqrt{\overline{\alpha}_{t}}}=\mbox{}-% \frac{1-\alpha_{t}}{\sqrt{\alpha_{t}}\sqrt{1-\overline{\alpha}_{t}}}.- divide start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG = - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG .

Hence, (17) may be written

μq(𝐱t,𝐱0)=1αt(𝐱t1αt1α¯tϵ¯t).subscript𝜇𝑞subscript𝐱𝑡subscript𝐱01subscript𝛼𝑡subscript𝐱𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscript¯bold-italic-ϵ𝑡\mu_{q}(\mathbf{x}_{t},\mathbf{x}_{0})=\frac{1}{\sqrt{\alpha_{t}}}\left(% \mathbf{x}_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\bar{% \boldsymbol{\epsilon}}_{t}\right).italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG over¯ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (18)

The missing ingredient here is ϵ¯tsubscript¯bold-italic-ϵ𝑡\bar{\boldsymbol{\epsilon}}_{t}over¯ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT—the noise that drove the transition from 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. To deal with this we will train a neural network to predict ϵ¯tsubscript¯bold-italic-ϵ𝑡\bar{\boldsymbol{\epsilon}}_{t}over¯ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. After training, the network will be a black box which takes as input

  • a value of t𝑡titalic_t and a noisy image 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

and returns

  • a prediction of ϵ¯tsubscript¯bold-italic-ϵ𝑡\bar{\boldsymbol{\epsilon}}_{t}over¯ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

We will denote the prediction by the function ϵθsubscriptbold-italic-ϵ𝜃\boldsymbol{\epsilon}_{\theta}bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, where θ𝜃\thetaitalic_θ represents the parameters in the neural network—these will be learned during the training phase. In each training step, we select an image 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from the training set, take a Gaussian ϵ¯tsubscript¯bold-italic-ϵ𝑡\bar{\boldsymbol{\epsilon}}_{t}over¯ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and form a sample of 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using (8). The job of the network is to make the output ϵθsubscriptbold-italic-ϵ𝜃\boldsymbol{\epsilon}_{\theta}bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT as close as possible to ϵ¯tsubscript¯bold-italic-ϵ𝑡\bar{\boldsymbol{\epsilon}}_{t}over¯ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Recalling the expression (14) for the required transition density, using the neural network prediction ϵθsubscriptbold-italic-ϵ𝜃\boldsymbol{\epsilon}_{\theta}bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT in the expression (18) for the mean, and adjusting the variance using (15), we will obtain 𝐱t1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT from

𝐱t1=1αt(𝐱t1αt1α¯tϵθ)+σq(t)𝐳,subscript𝐱𝑡11subscript𝛼𝑡subscript𝐱𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscriptbold-italic-ϵ𝜃subscript𝜎𝑞𝑡𝐳\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{1-% \alpha_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\,\boldsymbol{\epsilon}_{\theta}% \right)+\sigma_{q}(t)\,\mathbf{z},bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_t ) bold_z , (19)

where 𝐳𝐳\mathbf{z}bold_z is a standard Gaussian. This allows us to run the denoising process from t=T𝑡𝑇t=Titalic_t = italic_T to t=0𝑡0t=0italic_t = 0.

Having set up the required expressions, in the next section we outline the resulting training and sampling algorithms.

5 Algorithms

The training process is summarized in Algorithm 1. Here we are applying a basic stochastic gradient method [13]; in step 5 the network parameters are updated using a least-squares loss function applied to a single, randomly chosen training image. This simple least-squares formulation can be justified from a likelihood perspective [4, 15, 16, 34]. The network architecture used for the experiments in section 2 combines residual and attention blocks in a U-Net [33] type structure, motivated by the choice in [15]. Overall, that network has 12.9 Million parameters across 205 layers.

Algorithm 1 Training with the forward process [15]
1:repeat
2:     𝐱0q(𝐱0)similar-tosubscript𝐱0𝑞subscript𝐱0\mathbf{x}_{0}\sim q(\mathbf{x}_{0})bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) \triangleright choose an image from training set
3:     tUniform({1,2,,T})similar-to𝑡Uniform12𝑇t\sim\mathrm{Uniform}(\{1,2,\ldots,T\})italic_t ∼ roman_Uniform ( { 1 , 2 , … , italic_T } )
4:     ϵN(𝟎,𝐈)similar-tobold-italic-ϵN0𝐈\boldsymbol{\epsilon}\sim\mathrm{N}(\mathbf{0},\mathbf{I})bold_italic_ϵ ∼ roman_N ( bold_0 , bold_I ) \triangleright standard Gaussian sample
5:     Take gradient step w.r.t. θ𝜃\thetaitalic_θ on ϵϵθ(α¯t𝐱0+1α¯tϵ,t)22superscriptsubscriptnormbold-italic-ϵsubscriptbold-italic-ϵ𝜃subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡bold-italic-ϵ𝑡22\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta}(\sqrt{\overline{\alpha}% _{t}}\,\mathbf{x}_{0}+\sqrt{1-\overline{\alpha}_{t}}\,\boldsymbol{\epsilon},t)% \|_{2}^{2}∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ , italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
6:until converged

Algorithm 2 summarizes the sampling process. Here we define σq(1)=0subscript𝜎𝑞10\sigma_{q}(1)=0italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( 1 ) = 0, so that only the mean estimate based on (18) is used at t=1𝑡1t=1italic_t = 1.

Algorithm 2 Sampling with the backward process [15]
1:𝐱TN(𝟎,𝐈)similar-tosubscript𝐱𝑇N0𝐈\mathbf{x}_{T}\sim\mathrm{N}(\mathbf{0},\mathbf{I})bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ roman_N ( bold_0 , bold_I ) \triangleright standard Gaussian sample
2:for t=T,T1,,1𝑡𝑇𝑇11t=T,T-1,\ldots,1italic_t = italic_T , italic_T - 1 , … , 1 do
3:     𝐳N(𝟎,𝐈)similar-to𝐳N0𝐈\mathbf{z}\sim\mathrm{N}(\mathbf{0},\mathbf{I})bold_z ∼ roman_N ( bold_0 , bold_I ) \triangleright standard Gaussian sample
4:     𝐱t1=1αt(𝐱t1αt1α¯tϵθ)+σq(t)𝐳subscript𝐱𝑡11subscript𝛼𝑡subscript𝐱𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscriptbold-italic-ϵ𝜃subscript𝜎𝑞𝑡𝐳\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{1-% \alpha_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\,\boldsymbol{\epsilon}_{\theta}% \right)+\sigma_{q}(t)\,\mathbf{z}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_t ) bold_z
5:end for
6:return 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

6 Furthermore

In this section we touch upon some issues that may have occurred to the reader, and provide references where further information may be found.

How do we judge the performance of generative AI? A generative model must balance the contradictory aims of producing outputs that are plausible (like the training data) and novel (not like the training data). Any attempt to quantify performance must involve somewhat arbitrary choices that allow this circle to be squared. A popular quantitative measure, which focuses on the plausibility aspect, is Fréchet Inception Distance [12]. This measure approximates and compares the probability distributions of the real and synthetic image spaces, under a Gaussian assumption. Some studies also make use of subjective human opinions, which raises new issues, including reproducibility and representativeness.

What are useful applications of diffusion models? Given that the internet already stores a bewildering array of real images, it is reasonable to ask whether the world needs synthetic examples, however realistic. However, in some domains representative artificial data is valuable. In medical imaging, for example, synthetically generated data may help address scarcity, class imbalance and privacy concerns in educational settings [21]. Perhaps the biggest attraction of diffusion models lies in their use within larger systems. A diffusion model for image generation may be viewed as a representation of the hidden, or latent, distribution of real-world images. By conditioning or guiding the image generation according to user-specified requirements, it is then possible to tailor the output to meet certain goals [2, 10, 17, 45]. For example, diffusion forms part of several systems with text-to-image capabilities, including Open AI’s DALL-E 2 [31], Stability.ai’s Dreamstudio [32] and Google’s Imagen [34]. In-painting and overwriting unwanted pixels is also possible [44, 29].

Stable diffusion may also be exploited within ChatGPT-style large language models; an example is Stability.ai’s StableLM-3B-4E1T [39].

How computationally expensive is it to train and employ a diffusion model? For the simple low-resolution examples in section 2, using a pretrained network to produce new images is feasible on a standard desktop machine. However, high resolution image generation with a state-of-the-art pretrained diffusion model is a “high resource intensive and slow task that prohibits interactive experience for the users and results in huge computational cost on expensive GPUs” [1]. The size of many diffusion based models also raises storage issues: “generating high-resolution images with diffusion models is often infeasible on consumer-grade GPUs due to the the excessive memory requirements” [29].

Training is greater challenge. For the examples in section 2 we trained the network for 500 epochs in under 35 minutes on a single NVIDIA GeForce RTX 3090 GPU. It is reported in [40] that training the model in [7] consumes 150-1000 days of NVIDIA V100 GPU time. StableLM-3B-4E1T [39] is a 3 Billion parameter language model trained on 1 Trillion tokens of diverse English and code datasets; a 7 Billion parameter version was later released. Develo** smaller-scale versions of such models, or applying the models to compressed latent spaces, is therefore an active area [32, 43].

In terms of power usage when a trained model is deployed, Luccioni et al. [24] estimated that “the most carbon-intensive image generation model (stable-diffusion-xl-base-1.0) generates 1,594 grams of CO2 for 1,000 inferences, which is roughly the equivalent to 4.1 miles driven by an average gasoline-powered passenger vehicle.”

Is it a coincidence that (4) and (19) look similar to a numerical discretization of a stochastic differential equation? It is natural to compare (4) and (19) with the Euler–Maruyama method [14], and indeed there are variations of the forward diffusion model that have a direct correspondence with stochastic differential equations [15, 25, 29, 37]. The reverse process may also be associated with backward stochastic differential equations [40].

What about the dark side: ethics, privacy, bias and related concerns? Carlini et al. [5] showed that diffusion models have a tendency to memorize and reproduce training images. For tests on Stable AI [32] and Imagen [34] they were able to “extract over a hundred near-identical replicas of training images that range from personally identifiable photos to trademarked logos.” Somepalli et al. [36] also found examples where a diffusion model “blatantly copies” from training data. The resulting harms to professional artists are considered in [20]; these include “reputational damage, economic loss, plagiarism and copyright infringement.” When we move into the realm of text-to-image algorithms there are many further issues to consider, including fairness, toxicity and trust [11].

The figures in section 2 indicate that the output from a simple diffusion model is difficult to predict and hence to interpret. In particular, very different results can be generated from the same input. Explainable AI is a serious challenge in this setting.

On a more general note, any machine learning algorithm is likely to reflect the biases, omissions and errors in the training set [28]. See [18] for a proposed framework for data transparency.

We also mention that discussions around ethics in this field often assume that AI is, or will become, all-powerful, thereby overlooking empirical observations that these systems may fail to operate as intended—the so-called fallacy of AI functionality [30]. So, as well as the important question of what tasks should AI be used for, we must also ask what tasks can AI reliably perform. This latter issue is ripe for mathematical and statistical contributions.

Using generative AI to create content (text, images, music, videos, and so on) that is difficult or impossible to discriminate from human generated content may allow fakery and conspiracy theories to undermine societal safety and benefits. This begets novel risks that are already upon us, identified in part by the inaugural AI Safety Summit which met at Bletchley Park in November 2023.111 https://www.gov.uk/government/publications/ai-safety-summit-2023-chairs-statement-2-november/chairs-summary-of-the-ai-safety-summit-2023-bletchley-park Arguably, some of the decadal data science focus on ethics and privacy should have been redirected towards the societal risk of fake truths and the widespread inability to discriminate between content; and the introduction of bias. These risks now require an in-depth consideration, as we seek to uncover and tackle the full range of possibilities. An understanding of the mathematical foundations of generative AI methods will be a key to ensuring transparency.

7 PDEs

For many applied mathematicians, diffusion is synonymous with certain parabolic PDEs. Here we present some speculative material that aims to draw a PDE connection with the process described in section 3. The notion of continuously re-normalizing a diffusion process takes us outside the realm of standard textbook analysis, and opens up some issues that may be of independent interest. Depending on our choice of basic PDE there are several ways to ensure that the norm of some derivative of the solution remains unchanged over time. Here we illustrate this general idea by continuously re-scaling to preserve the norm of the gradient of the solution, that is, the total variation, over the domain.

Consider a real valued field u(x,t)𝑢𝑥𝑡u(x,t)italic_u ( italic_x , italic_t ), where xΩ𝑥Ωx\in\Omegaitalic_x ∈ roman_Ω, a bounded domain in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT with a piecewise smooth boundary, ΩΩ\partial\Omega∂ roman_Ω, and time t0𝑡0t\geq 0italic_t ≥ 0, satisfying

ut=Δu+r(t)u,xΩ,u.𝐧=0,xΩ,formulae-sequenceformulae-sequencesubscript𝑢𝑡Δ𝑢𝑟𝑡𝑢𝑥Ω𝑢formulae-sequence𝐧0𝑥Ωu_{t}=\Delta u+r(t)u,\quad x\in\Omega,\quad\nabla u.{\bf n}=0,\quad x\in% \partial\Omega,italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Δ italic_u + italic_r ( italic_t ) italic_u , italic_x ∈ roman_Ω , ∇ italic_u . bold_n = 0 , italic_x ∈ ∂ roman_Ω , (20)

for a suitable given initial condition u(x,0)=u0(x)𝑢𝑥0subscript𝑢0𝑥u(x,0)=u_{0}(x)italic_u ( italic_x , 0 ) = italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ). Here r(t)>0𝑟𝑡0r(t)>0italic_r ( italic_t ) > 0 is a shadow time-dependent variable (akin to a Lagrange multiplier), which continuously rescales u𝑢uitalic_u so that the L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT norm of the gradient of u𝑢uitalic_u is preserved. More explicitly, r(t)𝑟𝑡r(t)italic_r ( italic_t ) must be such that

Ωu2𝑑x:=Ωu.udxR(constant),t0.formulae-sequenceassignsubscriptΩsuperscriptnorm𝑢2differential-d𝑥subscriptΩ𝑢formulae-sequence𝑢𝑑𝑥𝑅constant𝑡0\int_{\Omega}||\nabla u||^{2}\,dx:{=}\int_{\Omega}\nabla u.\nabla u\,dx\equiv R% \quad\mathrm{(constant)},\ \ t\geq 0.∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT | | ∇ italic_u | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_x := ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ∇ italic_u . ∇ italic_u italic_d italic_x ≡ italic_R ( roman_constant ) , italic_t ≥ 0 . (21)

Here R>0𝑅0R>0italic_R > 0 is determined from the initial condition.

Taking the gradient of (20), and then forming the scalar product with u𝑢\nabla u∇ italic_u, and integrating over ΩΩ\Omegaroman_Ω, we obtain

12ddtΩu.udx=Ωu.(Δu)dx+r(t)Ωu.udx.formulae-sequence12𝑑𝑑𝑡subscriptΩ𝑢𝑢𝑑𝑥subscriptΩ𝑢Δ𝑢𝑑𝑥𝑟𝑡subscriptΩ𝑢𝑢𝑑𝑥\frac{1}{2}\,\frac{d\ }{dt}\int_{\Omega}\nabla u.\nabla u\,dx=\int_{\Omega}{% \nabla u.\nabla(\Delta u)}\,dx+r(t)\int_{\Omega}{\nabla u.\nabla u}\,dx.divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG italic_d end_ARG start_ARG italic_d italic_t end_ARG ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ∇ italic_u . ∇ italic_u italic_d italic_x = ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ∇ italic_u . ∇ ( roman_Δ italic_u ) italic_d italic_x + italic_r ( italic_t ) ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ∇ italic_u . ∇ italic_u italic_d italic_x .

But, by direct calculation,

.(uΔu)=(Δu)2+u.(Δu).formulae-sequence𝑢Δ𝑢superscriptΔ𝑢2𝑢Δ𝑢\nabla.(\nabla u\Delta u)=(\Delta u)^{2}+\nabla u.\nabla(\Delta u).∇ . ( ∇ italic_u roman_Δ italic_u ) = ( roman_Δ italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∇ italic_u . ∇ ( roman_Δ italic_u ) .

So, using the divergence theorem [26] together with the no-flux boundary condition, in order to ensure that R𝑅Ritalic_R is constant we must set

Rr(t)=Ω(Δu)2𝑑x0.𝑅𝑟𝑡subscriptΩsuperscriptΔ𝑢2differential-d𝑥0R\,r(t)=\int_{\Omega}{(\Delta u)^{2}}\,dx\geq 0.italic_R italic_r ( italic_t ) = ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( roman_Δ italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_x ≥ 0 .

Thus we may write (20) and (21) as the nonlinear integro-differential equation

ut=Δu+uRΩ(Δu)2𝑑x,xΩ,u.𝐧=0,xΩ.formulae-sequenceformulae-sequencesubscript𝑢𝑡Δ𝑢𝑢𝑅subscriptΩsuperscriptΔ𝑢2differential-d𝑥𝑥Ω𝑢formulae-sequence𝐧0𝑥Ωu_{t}=\Delta u+\frac{u}{R}\int_{\Omega}{(\Delta u)^{2}}\,dx,\quad x\in\Omega,% \quad\nabla u.{\bf n}=0,\quad x\in\partial\Omega.italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Δ italic_u + divide start_ARG italic_u end_ARG start_ARG italic_R end_ARG ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( roman_Δ italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_x , italic_x ∈ roman_Ω , ∇ italic_u . bold_n = 0 , italic_x ∈ ∂ roman_Ω . (22)

Here, as before, R𝑅Ritalic_R in (22) is set by the initial condition: R=Ωu0(x)2𝑑x.𝑅subscriptΩsuperscriptnormsubscript𝑢0𝑥2differential-d𝑥R=\int_{\Omega}||\nabla u_{0}(x)||^{2}\,dx.italic_R = ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT | | ∇ italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_x .

For any R>0𝑅0R>0italic_R > 0, the constrained equation, (20) and (21), has infinitely many possible steady states, each of which is of the form

u=μkϕk(x)andr=λk,formulae-sequence𝑢subscript𝜇𝑘subscriptitalic-ϕ𝑘𝑥and𝑟subscript𝜆𝑘u=\mu_{k}\phi_{k}(x)\quad{\rm and}\quad r=\lambda_{k},italic_u = italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) roman_and italic_r = italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

where (ϕk(x),λk)subscriptitalic-ϕ𝑘𝑥subscript𝜆𝑘(\phi_{k}(x),\lambda_{k})( italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) , italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is the k𝑘kitalic_kth (k=0,1,2,𝑘012k=0,1,2,...italic_k = 0 , 1 , 2 , …) eigenfunction-eigenvalue pair for the Laplacian on ΩΩ\Omegaroman_Ω with no-flux boundary conditions. However, μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT must satisfy

μk2=RΩϕk2𝑑x,superscriptsubscript𝜇𝑘2𝑅subscriptΩsuperscriptnormsubscriptitalic-ϕ𝑘2differential-d𝑥\mu_{k}^{2}=\frac{R}{\int_{\Omega}||\nabla\phi_{k}||^{2}\,dx},italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG italic_R end_ARG start_ARG ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT | | ∇ italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_x end_ARG ,

and so k1𝑘1k\geq 1italic_k ≥ 1, since the simplest eigenfunction satisfies ϕ00normsubscriptitalic-ϕ00||\nabla\phi_{0}||\equiv 0| | ∇ italic_ϕ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | ≡ 0.

Now consider small perturbations around the k𝑘kitalic_kth steady state. We set

u𝑢\displaystyle uitalic_u =\displaystyle== μk(ϕk(x)+ϵeσtv(x))+O(ϵ2),subscript𝜇𝑘subscriptitalic-ϕ𝑘𝑥italic-ϵsuperscript𝑒𝜎𝑡𝑣𝑥𝑂superscriptitalic-ϵ2\displaystyle\mu_{k}\left(\phi_{k}(x)+\epsilon e^{\sigma t}v(x)\right)+O(% \epsilon^{2}),italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) + italic_ϵ italic_e start_POSTSUPERSCRIPT italic_σ italic_t end_POSTSUPERSCRIPT italic_v ( italic_x ) ) + italic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,
r(t)𝑟𝑡\displaystyle r(t)italic_r ( italic_t ) =\displaystyle== λk+ϵβeσt+O(ϵ2),subscript𝜆𝑘italic-ϵ𝛽superscript𝑒𝜎𝑡𝑂superscriptitalic-ϵ2\displaystyle\lambda_{k}+\epsilon\beta e^{\sigma t}+O(\epsilon^{2}),italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_ϵ italic_β italic_e start_POSTSUPERSCRIPT italic_σ italic_t end_POSTSUPERSCRIPT + italic_O ( italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

for some v(x)𝑣𝑥v(x)italic_v ( italic_x ) and constants (σ,β)𝜎𝛽(\sigma,\beta)( italic_σ , italic_β ) to be determined. Substituting these expressions into (20), to O(ϵ)𝑂italic-ϵO(\epsilon)italic_O ( italic_ϵ ) we obtain

0=Δv+(λkσ)v+βϕk,xΩ,v.𝐧=0,xΩ.formulae-sequenceformulae-sequence0Δ𝑣subscript𝜆𝑘𝜎𝑣𝛽subscriptitalic-ϕ𝑘𝑥Ω𝑣formulae-sequence𝐧0𝑥Ω0=\Delta v+(\lambda_{k}-\sigma)v+\beta\phi_{k},\quad x\in\Omega,\quad\nabla v.% {\bf n}=0,\quad x\in\partial\Omega.0 = roman_Δ italic_v + ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_σ ) italic_v + italic_β italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_x ∈ roman_Ω , ∇ italic_v . bold_n = 0 , italic_x ∈ ∂ roman_Ω . (23)

Now setting v=ϕk~𝑣subscriptitalic-ϕ~𝑘v=\phi_{\tilde{k}}italic_v = italic_ϕ start_POSTSUBSCRIPT over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT, for kk~𝑘~𝑘k\neq\tilde{k}italic_k ≠ over~ start_ARG italic_k end_ARG (since the eigenfunctions form an orthonormal basis for the solution space), (23) yields

0=(λkλk~σ)ϕk~+βϕk,xΩ,formulae-sequence0subscript𝜆𝑘subscript𝜆~𝑘𝜎subscriptitalic-ϕ~𝑘𝛽subscriptitalic-ϕ𝑘𝑥Ω0=(\lambda_{k}-\lambda_{\tilde{k}}-\sigma)\phi_{\tilde{k}}+\beta\phi_{k},\quad x% \in\Omega,0 = ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT - italic_σ ) italic_ϕ start_POSTSUBSCRIPT over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT + italic_β italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_x ∈ roman_Ω ,

so that

σ=λkλk~andβ=0.𝜎subscript𝜆𝑘subscript𝜆~𝑘and𝛽0\sigma=\lambda_{k}-\lambda_{\tilde{k}}\ {\rm and}\ \beta=0.italic_σ = italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT roman_and italic_β = 0 .

The condition on σ𝜎\sigmaitalic_σ implies that each steady state is stable with respect to perturbations in all higher eigenmodes, where k<k~𝑘~𝑘k<\tilde{k}italic_k < over~ start_ARG italic_k end_ARG, yet is unstable with respect to any perturbations in lower eigenmodes, k>k~𝑘~𝑘k>\tilde{k}italic_k > over~ start_ARG italic_k end_ARG. Thus over a long time the solution must decay to the the first eigenmode, μ1ϕ1(x)subscript𝜇1subscriptitalic-ϕ1𝑥\mu_{1}\phi_{1}(x)italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ).

This is borne out in the numerical experiment shown in Figure 7. Here, for a one dimensional domain Ω=[0,1]Ω01\Omega=[0,1]roman_Ω = [ 0 , 1 ], we have ϕk(x)=2coskπxsubscriptitalic-ϕ𝑘𝑥2𝑘𝜋𝑥\phi_{k}(x)=\sqrt{2}\cos k\pi xitalic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) = square-root start_ARG 2 end_ARG roman_cos italic_k italic_π italic_x and λk=k2π2subscript𝜆𝑘superscript𝑘2superscript𝜋2\lambda_{k}=k^{2}\pi^{2}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Refer to caption
Figure 7: Numerical solution profiles of (20) and (21) where d=1𝑑1d=1italic_d = 1 and Ω=[0,1]Ω01\Omega=[0,1]roman_Ω = [ 0 , 1 ], shown at successive times on a log\logroman_log timescale (t=𝑡absentt=italic_t = 1.5 (blue), 15 (yellow), 150 (green), and 1500 (red)), where any ϕ0subscriptitalic-ϕ0\phi_{0}italic_ϕ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT component has been set to zero. In this case each profile has Ωu2𝑑xR11.15subscriptΩsuperscriptnorm𝑢2differential-d𝑥𝑅11.15\int_{\Omega}||\nabla u||^{2}\,dx\equiv R\approx 11.15∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT | | ∇ italic_u | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_x ≡ italic_R ≈ 11.15. The ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT mode dominates at larger times.

This framework and others in physics (see for example [41, 42]) suggests a number of ways that generative AI might exploit, or be explained by, PDE theory. The central challenge is being able to run the processes in reverse in order to generate plausible content from randomness.

Of course in the example above the parabolicity of the forward dynamic evolution, (20) and (21), means that, formally, the backward equation is ill-posed, there being no global solution guaranteed, with both discontinuities and point masses possibly occurring as time moves backwards. Hence any backward approximation should require some mollification, perhaps via a numerical solution that leverages finite resolution and norm preserving properties.

In general one should also wish to consider (i) how these norm-preserving (re-scaling) diffusion systems work when they are subject to stochastic forcing over a finite time, and (ii) how such processes might be run backwards in seeking to recover or approximate the initial conditions under various Markovian assumptions.

Data Statement

Code for the experiments presented here will be made available upon publication.

References

  • [1] S. Agarwal, S. Mitra, S. Chakraborty, S. Karanam, K. Mukherjee, and S. Saini, Approximate caching for efficiently serving diffusion models, arXiv:2312.04429, (2023).
  • [2] A. Bansal, H.-M. Chu, A. Schwarzschild, S. Sengupta, M. Goldblum, J. Gei**, and T. Goldstein, Universal guidance for diffusion models, in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023, pp. 843–852.
  • [3] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer, Berlin, 2007.
  • [4] Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and L. Sun, A comprehensive survey of AI-generated content (AIGC): A history of generative AI from GAN to ChatGPT, arXiv:2303.04226, (2023).
  • [5] N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V. Sehwag, F. Tramèr, B. Balle, D. Ippolito, and E. Wallace, Extracting training data from diffusion models, in Proceedings of the 32nd USENIX Conference on Security Symposium, SEC ’23, USA, 2023, USENIX Association.
  • [6] F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, Diffusion models in vision: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, 45 (2023), pp. 10850–10869.
  • [7] P. Dhariwal and A. Nichol, Diffusion models beat GANs on image synthesis, in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, eds., vol. 34, Curran Associates, Inc., 2021, pp. 8780–8794.
  • [8] S. Feuerriegel, J. Hartmann, C. Janiesch, and P. Zschech, Generative AI, Business and Information Systems Engineering, (2023).
  • [9] L. Girin, S. Leglaive, X. Bie, J. Diard, T. Hueber, and X. Alameda-Pineda, Dynamical variational autoencoders: A comprehensive review, Foundations and Trends in Machine Learning, 15 (2021), p. 1–175.
  • [10] R. Gozalo-Brizuela and E. C. Garrido-Merchán, A survey of generative AI applications, arXiv:2306.02781, (2023).
  • [11] S. Hao, P. Kumar, S. Laszlo, S. Poddar, B. Radharapu, and R. Shelby, Safety and fairness for content moderation in generative models, in CVPR Workshop on Ethical Considerations in Creative applications of Computer Vision, 2023.
  • [12] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, GANs trained by a two time-scale update rule converge to a local Nash equilibrium, in Advances in Neural Information Processing Systems, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, eds., Long Beach, CA, USA, 2017, pp. 6626–6637.
  • [13] C. F. Higham and D. J. Higham, Deep learning: An introduction for applied mathematicians, SIAM Review, 61 (2019), pp. 860–891.
  • [14] D. J. Higham and P. E. Kloeden, An introduction to the numerical simulation of stochastic differential equations, SIAM, Philadelphia, 2021.
  • [15] J. Ho, A. Jain, and P. Abbeel, Denoising diffusion probabilistic models, in Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2020, Curran Associates Inc.
  • [16] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, Cascaded diffusion models for high fidelity image generation, J. Mach. Learn. Res., 23 (2022).
  • [17] J. Ho and T. Salimans, Classifier-free diffusion guidance, in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
  • [18] B. Hutchinson, A. Smart, A. Hanna, E. Denton, C. Greer, O. Kjartansson, P. Barnes, and M. Mitchell, Towards accountability for machine learning datasets: Practices from software engineering and infrastructure, in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, New York, NY, USA, 2021, Association for Computing Machinery, pp. 560–575.
  • [19] G. Iglesias, E. Talavera, and A. Díaz-Álvarez, A survey on GANs for computer vision: Recent research, analysis and taxonomy, Computer Science Review, 48 (2023), p. 100553.
  • [20] H. H. Jiang, L. Brown, J. Cheng, M. Khan, A. Gupta, D. Workman, A. Hanna, J. Flowers, and T. Gebru, AI art and its impact on artists, in Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’23, New York, NY, USA, 2023, Association for Computing Machinery, pp. 363–374.
  • [21] A. Kazerouni, E. K. Aghdam, M. Heidari, R. Azad, M. Fayyaz, I. Hacihaliloglu, and D. Merhof, Diffusion models in medical imaging: A comprehensive survey, Medical Image Analysis, 88 (2023), p. 102846.
  • [22] D. P. Kingma and M. Welling, Auto-encoding variational Bayes, in 2nd International Conference on Learning Representations, 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
  • [23] Y. LeCun, C. Cortes, and C. J. C. Burges, The MNIST database of handwritten digits.
  • [24] A. S. Luccioni, Y. Jernite, and E. Strubell, Power hungry processing: Watts driving the cost of AI deployment?, arXiv:2311.16863, (2023).
  • [25] C. Luo, Understanding diffusion models: A unified perspective, arXiv:2208.11970, (2022).
  • [26] P. C. Matthews, Vector Calculus, Springer, Berlin, 1998.
  • [27] K. P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press, Boston, 2022.
  • [28] A. Paullada, I. D. Raji, E. M. Bender, E. Denton, and A. Hanna, Data and its (dis) contents: A survey of dataset development and use in machine learning research, Patterns, 2 (2021).
  • [29] R. Po, W. Yifan, V. Golyanik, K. Aberman, J. T. Barron, A. H. Bermano, E. R. Chan, T. Dekel, A. Holynski, A. Kanazawa, C. K. Liu, L. Liu, B. Mildenhall, M. Nießner, B. Ommer, C. Theobalt, P. Wonka, and G. Wetzstein, State of the art on diffusion models for visual computing, arXiv:2310.07204, (2023).
  • [30] I. D. Raji, I. E. Kumar, A. Horowitz, and A. Selbst, The fallacy of AI functionality, in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, New York, NY, USA, 2022, Association for Computing Machinery, pp. 959–972.
  • [31] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, Hierarchical text-conditional image generation with CLIP latents, arXiv:2204.06125, (2022).
  • [32] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, High-resolution image synthesis with latent diffusion models, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10684–10695.
  • [33] O. Ronneberger, P. Fischer, and T. Brox, U-Net: Convolutional networks for biomedical image segmentation, in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, eds., Cham, 2015, Springer International Publishing, pp. 234–241.
  • [34] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, Photorealistic text-to-image diffusion models with deep language understanding, in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, eds., vol. 35, Curran Associates, Inc., 2022, pp. 36479–36494.
  • [35] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, Deep unsupervised learning using nonequilibrium thermodynamics, in Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei, eds., vol. 37 of Proceedings of Machine Learning Research, Lille, France, 2015, PMLR, pp. 2256–2265.
  • [36] G. Somepalli, V. Singla, M. Goldblum, J. Gei**, and T. Goldstein, Diffusion art or digital forgery? Investigating data replication in diffusion models, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 6048–6058.
  • [37] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, Score-based generative modeling through stochastic differential equations, in International Conference on Learning Representations, 2021.
  • [38] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, Efficient transformers: A survey, ACM Comput. Surv., 55 (2022).
  • [39] J. Tow, M. Bellagente, D. Mahan, and C. R. Ruiz, StableLM-3B-4E1T, tech. rep., Stability-AI, 2023.
  • [40] Z. Wang, Score-based generative modeling through backward stochastic differential equations: Inversion and generation, arXiv:2311.16863, (2023).
  • [41] Y. Xu, Z. Liu, M. Tegmark, and T. Jaakkola, Poisson flow generative models, in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, eds., vol. 35, Curran Associates, Inc., 2022, pp. 16782–16795.
  • [42] Y. Xu, Z. Liu, Y. Tian, S. Tong, M. Tegmark, and T. Jaakkola, PFGM++: Unlocking the potential of physics-inspired generative models, in Proceedings of the 40th International Conference on Machine Learning, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, eds., vol. 202 of Proceedings of Machine Learning Research, 2023, pp. 38566–38591.
  • [43] X. Yang, D. Zhou, J. Feng, and X. Wang, Diffusion probabilistic model made slim, in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 22552–22562.
  • [44] A. B. Yildirim, V. Baday, E. Erdem, A. Erdem, and A. Dundar, Inst-Inpaint: Instructing to remove objects with diffusion models, arXiv:2304.03246, (2023).
  • [45] L. Zhang, A. Rao, and M. Agrawala, Adding conditional control to text-to-image diffusion models, in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836–3847.