Diffusion Models for Generative Artificial Intelligence: An Introduction for Applied Mathematicians

Catherine F. Higham School of Computing Science, University of Glasgow, Sir Alwyn Williams Building, Glasgow, G12 8QQ. Supported by EPSRC grant EP/T00097X/1. ([email protected]) Desmond J. Higham School of Mathematics, University of Edinburgh, EH9 3FD, UK. Supported by EPSRC grant EP/V046527/1. ([email protected]) Peter Grindrod CBE Mathematical Institute, University of Oxford, Oxford, United Kingdom. Supported by EPSRC grant EP/R018472/1. ([email protected])

Abstract

Generative artificial intelligence (AI) refers to algorithms that create synthetic but realistic output. Diffusion models currently offer state of the art performance in generative AI for images. They also form a key component in more general tools, including text-to-image generators and large language models. Diffusion models work by adding noise to the available training data and then learning how to reverse the process. The reverse operation may then be applied to new random data in order to produce new outputs. We provide a brief introduction to diffusion models for applied mathematicians and statisticians. Our key aims are (a) to present illustrative computational examples, (b) to give a careful derivation of the underlying mathematical formulas involved, and (c) to draw a connection with partial differential equation (PDE) diffusion models. We provide code for the computational experiments. We hope that this topic will be of interest to advanced undergraduate students and postgraduate students. Portions of the material may also provide useful motivational examples for those who teach courses in stochastic processes, inference, machine learning, PDEs or scientific computing.

1 Motivation

Generative artificial intelligence (AI) models are designed to create new outputs that are similar to the examples on which they were trained. Over the past decade or so, advancements in generative AI have included the development of variational autoencoders [9, 22], generative adversarial networks [19] and transformers [38]. In this work we focus on denoising diffusion probabilistic models [15]; for simplicity we use the term diffusion models. They currently represent the state of the art in image generation [7], and form a key part of more sophisticated tools such as DALL-E 2 and 3 [31]. We refer to [4, 6, 8, 29], and the references therein, for details of the historical developments that have led to the current state-of-the art in generative AI.

The somewhat counterintuitive, but deceptively powerful, idea behind diffusion models is to destroy the training data by adding noise. While doing so, the model learns how to reverse the process. In this way, the final model is able to start with new, easily generated, random samples and denoise them, thereby generating new, synthetic examples. The task of building and applying a simple, yet impressive, model can be described very succinctly—see Algorithms 1 and 2 in section 5. However, deriving the expressions that go into these algorithms is not so straightforward, and we believe that there is a niche for a careful and accessible mathematically-oriented treatment.

Our intended readership is advanced undergraduate students and postgraduate students in mathematics or relate disciplines. The material should be suitable for independent study, and there are many directions in which this material can be followed up—the literature is rapidly expanding, and new extensions and connections are being discovered at a pace. We also hope that portions of this material will provide higher education professionals with topical and engaging examples that can be slipped into courses on stochastics, numerics, PDEs or data science.

We aimed to keep the prerequisites to a minimum; these are

•

for sections 2–5 ideas from statistics: mean, variance, Gaussian distribution, Markov chains, conditional probability,
•

for sections 4 and 5 ideas from deep learning: the stochastic gradient method, artificial neural networks,
•

for section 7 ideas from PDEs: multivariate calculus, the divergence theorem, spectral analysis.

We focus here on the task of image generation. We describe a bare bones form of diffusion model, explain carefully how the key mathematical expressions arise, and illustrate the concept via computational examples. The key reference for this article is [15], which built on [35] and received more than ten citations per day during the year 2023. We also found [25] to be a very useful resource.

In section 2 we present some pictures that give a feel for the idea of diffusion models in generative AI. We then provide details of the relevant forward and backward processes in sections 3 and 4, respectively, which leads to the algorithms presented in section 5.

We emphasize that this is a very active and fast-moving research topic with connections to many related areas. In section 6 we provide some links to the relevant literature. That section also highlights wider issues around performance evaluation, computational expense, copyright, privacy, ethics, bias, explainability and robustness.

We finish in section 7 with more speculative material that suggests a connection between stable diffusion models and deterministic PDEs, providing a link to more traditional applied mathematics.

2 Illustration

A diffusion model [15] aims to generate realistic-looking images. It works by

(i): taking an existing image and iteratively adding noise until the original information is lost,
(ii): learning how to reconstruct the original image by iteratively removing the noise.

After training, we can then use the reverse diffusion process to generate a realistic image from a new, random, starting point—remove the noise and see what emerges.

One way to conceptualize this method is to imagine an (unknown) probability distribution over the collection of all natural images. We hope to sample from this distribution; more likely images should be chosen with higher probability. We don’t have access to this magic probability distribution. Instead, we have training data; that is, examples of natural images. We also have a pseudorandom number generator that allows us to sample from a standard Gaussian distribution. In item (i) above, we are doing the easy part, map** from the image distribution to the Gaussian distribution. In item (ii) we learn the inverse operator, map** from the Gaussian distribution to the image distribution. This allows us to convert Gaussian samples into images.

We illustrate the idea of a diffusion model trained on images from the widely studied MNIST data set [23]. Here, each image represents a handwritten digit from the set $\{0,1,2,\ldots,9\}$ . These low resolution images are black-and-white with $28\times 28$ pixels, resized by the model to $32\times 32$ . Figure 1 shows a representative collection of $64$ images.

Figures 2–4 were produced with a diffusion model based on a Mathworks tutorial at

https://uk.mathworks.com/help/deeplearning/ug/generate-images-using-diffusion.html

Figure 2 illustrates the forward process that is used in the training phase. At time $t=0$ we have an MNIST image. At each integer time $t=0,1,2,\ldots,499$ Gaussian noise is added. At $t=500$ there is no visible evidence of the original image.

Refer to caption — Figure 1: Representative set of $64$ images from the MNIST data set [23].

Figure 3 shows the effect of the backward process that is available after training. The top left panel displays nine randomly chosen final time $t=500$ images—pure noise matrices consisting of independent Gaussian samples. We show the effect of applying the backward, denoising process as time is reversed. At $t=0$ the model has produced new, synthetic examples that, in at least eight of the nine cases, correspond to handwritten digits. We emphasize that labels were not used in the training process. In this simple, unconditional model there is no way to control which (if any) of the $t=0$ images will resemble any particular category of digit.

In Figure 4 we show the results from a larger experiment. Here we used the trained diffusion model to generate images from 500 independent time $t=500$ choices. For this figure, we separated the images into categories using an independent convolutional neural network classifier that was trained separately on real MNIST data. Since we have no control over how many of the 500 images will appear in each class, the number of synthetic outputs in each category varies considerably.

We finish with two experiments which illustrate that the backward, denoising process is both stochastic and unpredictable. In Figure 5 we show the images generated after applying nine independent denoising runs to the same Gaussian at $t=500$ . We see that the denoising process can produce considerably different results from a single source of randomness. In Figure 6 we perform a similar experiment where the $t=500$ data emerges from the training set. On the left we show a training image undergoing the forward, noising process up to time $t=500$ . On the right we show the results from nine independent denoising runs on this time $t=500$ data. We see that none of the synthetically generated images resemble the original.

3 Forwards

We begin this section with some background on Gaussian random variables; see a standard text such as [3, 27] for more details. When dealing with Gaussians, we will always consider the multivariate, isotropic case. We denote the probability density at a point $\mathbf{x}\in\mathbb{R}^{d}$ by $\mathcal{N}(\mathbf{x};\boldsymbol{\mu},\sigma\mathbf{I})$ , where

\mathcal{N}(\mathbf{x};\boldsymbol{\mu},\sigma\mathbf{I}):=\frac{1}{\sigma(2% \pi)^{1/d}}\exp\left(-{\textstyle{{\frac{1}{2\sigma^{2}}}}}(\mathbf{x}-% \boldsymbol{\mu})^{T}(\mathbf{x}-\boldsymbol{\mu})\right).

(1)

Here, $\boldsymbol{\mu}\in\mathbb{R}^{d}$ is the mean, and we will refer to $\sigma^{2}$ as the variance, since the corresponding covariance matrix has the form $\sigma^{2}\mathbf{I}$ , with $\mathbf{I}\in\mathbb{R}^{d\times d}$ denoting the identity matrix. Such Gaussian random variables have the important property that their sums remain Gaussian, with means and variances combining additively: the sum of two independent Gaussians with means $\boldsymbol{\mu}_{1}$ and $\boldsymbol{\mu}_{2}$ and variances $\sigma_{1}^{2}$ and $\sigma_{2}^{2}$ is a Gaussian random variable with mean $\boldsymbol{\mu}_{1}+\boldsymbol{\mu}_{2}$ and variance $\sigma_{1}^{2}+\sigma_{2}^{2}$ . The term standard Gaussian refers to the case where the mean is $\mathbf{0}\in\mathbb{R}^{d}$ and the variance is $1$ . Multiplying a standard Gaussian by the scalar $\sigma$ and shifting by $\boldsymbol{\mu}\in\mathbb{R}^{d}$ produces a Gaussian with mean $\boldsymbol{\mu}$ and variance $\sigma^{2}$ . It follows that if $\mathbf{y}$ and $\mathbf{z}$ are independent standard Gaussians, and $a$ and $b$ are scalars, then $a\mathbf{y}+b\mathbf{z}$ is Gaussian with mean zero and variance $a^{2}+b^{2}$ ; so $a\mathbf{y}+b\mathbf{z}$ can be sampled as $\sqrt{a^{2}+b^{2}}\,\boldsymbol{\epsilon}$ , where $\boldsymbol{\epsilon}$ is a standard Gaussian.

We consider images that can be described by $d$ real numbers, typically pixel values, and we collect these into a vector in $\mathbb{R}^{d}$ . In practice pixel values might be constrained—for example only integers between $0$ and $255$ might be allowed—but we ignore this issue here for simplicity.

Given an image $\mathbf{x}_{0}\in\mathbb{R}^{d}$ , the forward process iteratively adds noise to create a sequence $\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{T}$ according to the rule

\mathbf{x}_{t}=\sqrt{1-\beta_{t}}\,\mathbf{x}_{t-1}+\sqrt{\beta_{t}}\,% \boldsymbol{\epsilon}_{t}.

(2)

Here, each $\boldsymbol{\epsilon}_{t}$ is an independent standard Gaussian and the scalar parameter $\beta_{t}$ is between zero and one. The sequence $\beta_{1},\beta_{2},\ldots,\beta_{T}$ , known as the variance schedule, is predetermined. For example, in [15], linearly increasing values from $\beta_{1}={10}^{-4}$ to $\beta_{T}=0.02$ are used. Since $\beta_{t}$ here is increasing, more noise is added as the forward process evolves. It is useful to think of $t$ as a time-like variable. At time zero we have an image and at time $T$ we effectively have pure Gaussian noise.

The process (2) defines a discrete time Markov process, and the associated transition density may be written

q(\mathbf{x}_{t}\,|\,\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t};\sqrt{1-% \beta_{t}}\,\mathbf{x}_{t-1},\beta_{t}\mathbf{I}).

(3)

This quantifies the probability of observing $\mathbf{x}_{t}$ at time $t$ , given $\mathbf{x}_{t-1}$ at time $t-1$ .

Updating over one time step in the forward process (2) is straightforward; just scale the current value and add Gaussian noise. For later use, it is helpful to know that step** from time zero to a general time $t$ is possible with a single leap. To see this, we introduce $\alpha_{t}=1-\beta_{t}$ so that

\mathbf{x}_{t}=\sqrt{\alpha_{t}}\,\mathbf{x}_{t-1}+\sqrt{1-\alpha_{t}}\,% \boldsymbol{\epsilon}_{t}.

(4)

Then, applying (4) again, we have

	$\displaystyle\mathbf{x}_{t}$	$\displaystyle=$	$\displaystyle\sqrt{\alpha_{t}}\left(\sqrt{\alpha_{t-1}}\,\mathbf{x}_{t-2}+% \sqrt{1-\alpha_{t-1}}\,\boldsymbol{\epsilon}_{t-1}\right)+\sqrt{1-\alpha_{t}}% \,\boldsymbol{\epsilon}_{t}$		(5)
		$\displaystyle=$	$\displaystyle\sqrt{\alpha_{t}\alpha_{t-1}}\,\mathbf{x}_{t-2}+\sqrt{\alpha_{t}}% \sqrt{1-\alpha_{t-1}}\,\boldsymbol{\epsilon}_{t-1}+\sqrt{1-\alpha_{t}}\,% \boldsymbol{\epsilon}_{t}.$		(5)

Using the properties of Gaussians mentioned at the start of this section, we see that $\sqrt{\alpha_{t}}\sqrt{1-\alpha_{t-1}}\,\boldsymbol{\epsilon}_{t-1}+\sqrt{1-% \alpha_{t}}\,\boldsymbol{\epsilon}_{t}$ can be combined into a single Gaussian. In this way, (5) may be written

\mathbf{x}_{t}=\sqrt{\alpha_{t}\alpha_{t-1}}\,\mathbf{x}_{t-2}+\sqrt{1-\alpha_% {t}\alpha_{t-1}}\,\boldsymbol{\epsilon}_{t,t-2},

where $\boldsymbol{\epsilon}_{t,t-2}$ is a standard Gaussian.

Proceeding inductively, suppose that for some $k$ between $t-2$ and $1$

\mathbf{x}_{t}=\sqrt{\alpha_{t}\alpha_{t-1}\ldots\alpha_{k+1}}\,\mathbf{x}_{k}% +\sqrt{1-\alpha_{t}\alpha_{t-1}\ldots\alpha_{k+1}}\,\epsilon_{t,k},

(6)

where $\epsilon_{t,k}$ is a standard Gaussian. Then replacing $\mathbf{x}_{k}$ using (4)

\mathbf{x}_{t}=\sqrt{\alpha_{t}\alpha_{t-1}\ldots\alpha_{k+1}}\,\left(\sqrt{% \alpha_{k}}\,\mathbf{x}_{k-1}+\sqrt{1-\alpha_{k}}\,\boldsymbol{\epsilon}_{k}% \right)+\sqrt{1-\alpha_{t}\alpha_{t-1}\ldots\alpha_{k+1}}\,\epsilon_{t,k}.

Again replacing the sum of two independent Gaussians by a single, appropriate Gaussian, we have

	$\displaystyle\mathbf{x}_{t}$	$\displaystyle=$	$\displaystyle\sqrt{\alpha_{t}\alpha_{t-1}\ldots\alpha_{k}}\,\mathbf{x}_{k-1}+% \sqrt{\alpha_{t}\alpha_{t-1}\ldots\alpha_{k+1}(1-\alpha_{k})+1-\alpha_{t}% \alpha_{t-1}\ldots\alpha_{k+1}}\,\epsilon_{t,k-1},$
		$\displaystyle=$	$\displaystyle\sqrt{\alpha_{t}\alpha_{t-1}\ldots\alpha_{k}}\,\mathbf{x}_{k-1}+% \sqrt{1-\alpha_{t}\alpha_{t-1}\ldots\alpha_{k}}\,\epsilon_{t,k-1},$

where $\epsilon_{t,k-1}$ is a standard Gaussian. Hence, the form (6) is valid all the way down to $k=0$ . So, letting

\overline{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i},

(7)

we may write

\mathbf{x}_{t}=\sqrt{\overline{\alpha}_{t}}\,\mathbf{x}_{0}+\sqrt{1-\overline{% \alpha}_{t}}\,\bar{\boldsymbol{\epsilon}}_{t},

(8)

where $\bar{\boldsymbol{\epsilon}}_{t}$ is a standard Gaussian. We may therefore step directly from time $0$ to any later time $t$ using a single Gaussian. This proves convenient for the analysis in section 4 and also for the training algorithm discussed in section 5.

In terms of a transition density, (8) shows that

q(\mathbf{x}_{t}\,|\,\mathbf{x}_{0}):=\mathcal{N}(\mathbf{x}_{t};\sqrt{% \overline{\alpha}_{t}}\,\mathbf{x}_{0},(1-\overline{\alpha}_{t})\mathbf{I}).

(9)

4 Backwards

We now consider the reverse process. We are interested in the probability of $\mathbf{x}_{t-1}$ given $\mathbf{x}_{t}$ and $\mathbf{x}_{0}$ ; that is, $q(\mathbf{x}_{t-1}\,|\,\mathbf{x}_{t},\mathbf{x}_{0})$ . To proceed we will make use of a result in conditional probability theory known as the product rule, [3, 27], which for our purposes may be written

P(A,B,C)=P(A\,|\,B,C)\,P(B,C)=P(A\,|\,B,C)\,P(B\,|\,C)\,P(C).

By symmetry, we also have

P(A,B,C)=P(B,A,C)=P(B\,|\,A,C)\,P(A,C)=P(B\,|\,A,C)\,P(A\,|\,C)\,P(C).

Hence,

P(A\,|\,B,C)=\frac{P(B\,|\,A,C)\,P(A\,|\,C)}{P(B\,|\,C)}.

We will use this in the form

\displaystyle q(\mathbf{x}_{t-1}\,|\,\mathbf{x}_{t},\mathbf{x}_{0})

\displaystyle=

\displaystyle\frac{q(\mathbf{x}_{t}\,|\,\mathbf{x}_{t-1},\mathbf{x}_{0})\,q(% \mathbf{x}_{t-1}\,|\,\mathbf{x}_{0})}{q(\mathbf{x}_{t}\,|\,\mathbf{x}_{0})}.

(10)

So now we focus on the quantities appearing on the right hand side of (10).

By the Markovian nature of the forward process, from (3),

q(\mathbf{x}_{t}\,|\,\mathbf{x}_{t-1},\mathbf{x}_{0})=q(\mathbf{x}_{t}\,|\,% \mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t};\sqrt{\alpha_{t}}\,\mathbf{x}_{t-% 1},(1-\alpha_{t})\mathbf{I}).

(11)

Making use of (9) for $\mathbf{x}_{t}$ and $\mathbf{x}_{t-1}$ , we then see that

q(\mathbf{x}_{t-1}\,|\,\mathbf{x}_{t},\mathbf{x}_{0})=\frac{\mathcal{N}(% \mathbf{x}_{t};\sqrt{\alpha_{t}}\,\mathbf{x}_{t-1},(1-\alpha_{t})\mathbf{I})\,% \mathcal{N}(\mathbf{x}_{t-1};\sqrt{\overline{\alpha}_{t-1}}\,\mathbf{x}_{0},(1% -\overline{\alpha}_{t-1})\mathbf{I})}{\mathcal{N}(\mathbf{x}_{t};\sqrt{% \overline{\alpha}_{t}}\,\mathbf{x}_{0},(1-\overline{\alpha}_{t})\mathbf{I})}.

(12)

From the definition (1), and ignoring the normalizing constants, we see that this expression has the form

	$\displaystyle\exp\left(-{\textstyle{{\frac{1}{2}}}}\frac{(\mathbf{x}_{t}-\sqrt% {\alpha_{t}}\,\mathbf{x}_{t-1})^{T}(\mathbf{x}_{t}-\sqrt{\alpha_{t}}\,\mathbf{% x}_{t-1})}{1-\alpha_{t}}-{\textstyle{{\frac{1}{2}}}}\frac{(\mathbf{x}_{t-1}-% \sqrt{\overline{\alpha}_{t-1}}\,\mathbf{x}_{0})^{T}(\mathbf{x}_{t-1}-\sqrt{% \overline{\alpha}_{t-1}}\,\mathbf{x}_{0})}{1-\overline{\alpha}_{t-1}}\right.$
	$\displaystyle\left.\mbox{}+{\textstyle{{\frac{1}{2}}}}\frac{(\mathbf{x}_{t}-% \sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0})^{T}(\mathbf{x}_{t}-\sqrt{\overline% {\alpha}_{t}}\,\mathbf{x}_{0})}{1-\overline{\alpha}_{t}}\right).$		(13)

We will show that this expression matches

\mathcal{N}\left(\mathbf{x}_{t-1};\mu_{q}(\mathbf{x}_{t},\mathbf{x}_{0}),% \sigma^{2}_{q}(t)\mathbf{I}\right),

(14)

for appropriate $\mu_{q}(\mathbf{x}_{t},\mathbf{x}_{0})$ and $\sigma^{2}_{q}(t)$ . From (1), we can find $\sigma^{2}_{q}(t)$ by considering the coefficient of $\mbox{}-\mathbf{x}_{t-1}^{T}\mathbf{x}_{t-1}$ in the exponent of (13). This coefficient is given by

{\textstyle{{\frac{1}{2}}}}\frac{\alpha_{t}}{1-\alpha_{t}}+{\textstyle{{\frac{% 1}{2}}}}\frac{1}{1-\overline{\alpha}_{t-1}}={\textstyle{{\frac{1}{2}}}}\frac{% \alpha_{t}(1-\overline{\alpha}_{t-1})+1-\alpha_{t}}{(1-\alpha_{t})(1-\overline% {\alpha}_{t-1})}={\textstyle{{\frac{1}{2}}}}\left(\frac{1-\overline{\alpha}_{t% }}{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})}\right),

where we used $\alpha_{t}\,\overline{\alpha}_{t-1}=\overline{\alpha}_{t}$ from (7). Hence,

\sigma^{2}_{q}(t)=\frac{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})}{1-\overline% {\alpha}_{t}}.

(15)

Using the functional form (1) again, we can find $\mu_{q}(\mathbf{x}_{t},\mathbf{x}_{0})$ by considering the vector, say $\mathbf{v}$ , such that $\mathbf{x}_{t-1}^{T}\mathbf{v}$ is the cross-product in the exponent of (13). We see that

\frac{\mu_{q}(\mathbf{x}_{t},\mathbf{x}_{0})}{\sigma^{2}_{q}(t)}=\mathbf{v}=% \frac{\sqrt{\alpha_{t}}\,\mathbf{x}_{t}}{1-\alpha_{t}}+\frac{\sqrt{\overline{% \alpha}_{t-1}}\,\mathbf{x}_{0}}{1-\overline{\alpha}_{t-1}}.

Hence, using (15),

\mu_{q}(\mathbf{x}_{t},\mathbf{x}_{0})=\frac{\sqrt{\alpha_{t}}\,(1-\overline{% \alpha}_{t-1})\mathbf{x}_{t}+\sqrt{\overline{\alpha}_{t-1}}(1-\alpha_{t})% \mathbf{x}_{0}}{1-\overline{\alpha}_{t}}.

(16)

We wish to compute a sample from the distribution in (14). This will allow us to perform the required transition along the backwards process. Our approach is to estimate the mean in (14) and then shift with an appropriate Gaussian in order to match the required variance.

If we know $\mathbf{x}_{t}$ and $\bar{\boldsymbol{\epsilon}}_{t}$ in (8) then we may write

\mathbf{x}_{0}=\frac{\mathbf{x}_{t}-\sqrt{1-\overline{\alpha}_{t}}\,\bar{% \boldsymbol{\epsilon}}_{t}}{\sqrt{\overline{\alpha}_{t}}}.

Substituting this expression for $\mathbf{x}_{0}$ into (16) we see that the mean of $\mathbf{x}_{t-1}$ , given $\mathbf{x}_{t}$ and $\mathbf{x}_{0}$ , takes the form

\mu_{q}(\mathbf{x}_{t},\mathbf{x}_{0})=\frac{\sqrt{\alpha_{t}}(1-\overline{% \alpha}_{t-1})}{1-\overline{\alpha}_{t}}\mathbf{x}_{t}+\frac{\sqrt{\overline{% \alpha}_{t-1}}(1-\alpha_{t})}{(1-\overline{\alpha}_{t})\sqrt{\overline{\alpha}% _{t}}}\mathbf{x}_{t}-\frac{\sqrt{\overline{\alpha}_{t-1}}(1-\alpha_{t})\sqrt{1% -\overline{\alpha}_{t}}}{(1-\overline{\alpha}_{t})\sqrt{\overline{\alpha}_{t}}% }\,\bar{\boldsymbol{\epsilon}}_{t}.

(17)

Noting from (7) that $\overline{\alpha}_{t-1}/\overline{\alpha}_{t}=1/\alpha_{t}$ and $\alpha_{t}\times\overline{\alpha}_{t-1}=\overline{\alpha}_{t}$ , we find that in (17) the coefficient of $\mathbf{x}_{t}$ simplifies as follows:

\frac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}}+% \frac{\sqrt{\overline{\alpha}_{t-1}}(1-\alpha_{t})}{(1-\overline{\alpha}_{t})% \sqrt{\overline{\alpha}_{t}}}=\frac{1}{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{% t})}\left(\alpha_{t}(1-\overline{\alpha}_{t-1})+1-\alpha_{t}\right)=\frac{1}{% \sqrt{\alpha_{t}}}.

Similarly, the coefficient of $\bar{\boldsymbol{\epsilon}}_{t}$ in (17) simplifies to

\mbox{}-\frac{\sqrt{\overline{\alpha}_{t-1}}(1-\alpha_{t})\sqrt{1-\overline{% \alpha}_{t}}}{(1-\overline{\alpha}_{t})\sqrt{\overline{\alpha}_{t}}}=\mbox{}-% \frac{1-\alpha_{t}}{\sqrt{\alpha_{t}}\sqrt{1-\overline{\alpha}_{t}}}.

Hence, (17) may be written

\mu_{q}(\mathbf{x}_{t},\mathbf{x}_{0})=\frac{1}{\sqrt{\alpha_{t}}}\left(% \mathbf{x}_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\bar{% \boldsymbol{\epsilon}}_{t}\right).

(18)

The missing ingredient here is $\bar{\boldsymbol{\epsilon}}_{t}$ —the noise that drove the transition from $\mathbf{x}_{0}$ to $\mathbf{x}_{t}$ . To deal with this we will train a neural network to predict $\bar{\boldsymbol{\epsilon}}_{t}$ . After training, the network will be a black box which takes as input

•

a value of $t$ and a noisy image $\mathbf{x}_{t}$

and returns

•

a prediction of $\bar{\boldsymbol{\epsilon}}_{t}$ .

We will denote the prediction by the function $\boldsymbol{\epsilon}_{\theta}$ , where $\theta$ represents the parameters in the neural network—these will be learned during the training phase. In each training step, we select an image $\mathbf{x}_{0}$ from the training set, take a Gaussian $\bar{\boldsymbol{\epsilon}}_{t}$ and form a sample of $\mathbf{x}_{t}$ using (8). The job of the network is to make the output $\boldsymbol{\epsilon}_{\theta}$ as close as possible to $\bar{\boldsymbol{\epsilon}}_{t}$ .

Recalling the expression (14) for the required transition density, using the neural network prediction $\boldsymbol{\epsilon}_{\theta}$ in the expression (18) for the mean, and adjusting the variance using (15), we will obtain $\mathbf{x}_{t-1}$ from

\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{1-% \alpha_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\,\boldsymbol{\epsilon}_{\theta}% \right)+\sigma_{q}(t)\,\mathbf{z},

(19)

where $\mathbf{z}$ is a standard Gaussian. This allows us to run the denoising process from $t=T$ to $t=0$ .

Having set up the required expressions, in the next section we outline the resulting training and sampling algorithms.

5 Algorithms

The training process is summarized in Algorithm 1. Here we are applying a basic stochastic gradient method [13]; in step 5 the network parameters are updated using a least-squares loss function applied to a single, randomly chosen training image. This simple least-squares formulation can be justified from a likelihood perspective [4, 15, 16, 34]. The network architecture used for the experiments in section 2 combines residual and attention blocks in a U-Net [33] type structure, motivated by the choice in [15]. Overall, that network has 12.9 Million parameters across 205 layers.

Algorithm 1 Training with the forward process [15]

1:repeat

\mathbf{x}_{0}\sim q(\mathbf{x}_{0})

\triangleright

choose an image from training set

t\sim\mathrm{Uniform}(\{1,2,\ldots,T\})

\boldsymbol{\epsilon}\sim\mathrm{N}(\mathbf{0},\mathbf{I})

\triangleright

standard Gaussian sample

5: Take gradient step w.r.t.

\theta

\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta}(\sqrt{\overline{\alpha}% _{t}}\,\mathbf{x}_{0}+\sqrt{1-\overline{\alpha}_{t}}\,\boldsymbol{\epsilon},t)% \|_{2}^{2}

6:until converged

Algorithm 2 summarizes the sampling process. Here we define $\sigma_{q}(1)=0$ , so that only the mean estimate based on (18) is used at $t=1$ .

Algorithm 2 Sampling with the backward process [15]

\mathbf{x}_{T}\sim\mathrm{N}(\mathbf{0},\mathbf{I})

\triangleright

standard Gaussian sample

2:for

t=T,T-1,\ldots,1

\mathbf{z}\sim\mathrm{N}(\mathbf{0},\mathbf{I})

\triangleright

standard Gaussian sample

\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{1-% \alpha_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\,\boldsymbol{\epsilon}_{\theta}% \right)+\sigma_{q}(t)\,\mathbf{z}

5:end for

6:return

\mathbf{x}_{0}

6 Furthermore

In this section we touch upon some issues that may have occurred to the reader, and provide references where further information may be found.

How do we judge the performance of generative AI? A generative model must balance the contradictory aims of producing outputs that are plausible (like the training data) and novel (not like the training data). Any attempt to quantify performance must involve somewhat arbitrary choices that allow this circle to be squared. A popular quantitative measure, which focuses on the plausibility aspect, is Fréchet Inception Distance [12]. This measure approximates and compares the probability distributions of the real and synthetic image spaces, under a Gaussian assumption. Some studies also make use of subjective human opinions, which raises new issues, including reproducibility and representativeness.

What are useful applications of diffusion models? Given that the internet already stores a bewildering array of real images, it is reasonable to ask whether the world needs synthetic examples, however realistic. However, in some domains representative artificial data is valuable. In medical imaging, for example, synthetically generated data may help address scarcity, class imbalance and privacy concerns in educational settings [21]. Perhaps the biggest attraction of diffusion models lies in their use within larger systems. A diffusion model for image generation may be viewed as a representation of the hidden, or latent, distribution of real-world images. By conditioning or guiding the image generation according to user-specified requirements, it is then possible to tailor the output to meet certain goals [2, 10, 17, 45]. For example, diffusion forms part of several systems with text-to-image capabilities, including Open AI’s DALL-E 2 [31], Stability.ai’s Dreamstudio [32] and Google’s Imagen [34]. In-painting and overwriting unwanted pixels is also possible [44, 29].

Stable diffusion may also be exploited within ChatGPT-style large language models; an example is Stability.ai’s StableLM-3B-4E1T [39].

How computationally expensive is it to train and employ a diffusion model? For the simple low-resolution examples in section 2, using a pretrained network to produce new images is feasible on a standard desktop machine. However, high resolution image generation with a state-of-the-art pretrained diffusion model is a “high resource intensive and slow task that prohibits interactive experience for the users and results in huge computational cost on expensive GPUs” [1]. The size of many diffusion based models also raises storage issues: “generating high-resolution images with diffusion models is often infeasible on consumer-grade GPUs due to the the excessive memory requirements” [29].

Training is greater challenge. For the examples in section 2 we trained the network for 500 epochs in under 35 minutes on a single NVIDIA GeForce RTX 3090 GPU. It is reported in [40] that training the model in [7] consumes 150-1000 days of NVIDIA V100 GPU time. StableLM-3B-4E1T [39] is a 3 Billion parameter language model trained on 1 Trillion tokens of diverse English and code datasets; a 7 Billion parameter version was later released. Develo** smaller-scale versions of such models, or applying the models to compressed latent spaces, is therefore an active area [32, 43].

In terms of power usage when a trained model is deployed, Luccioni et al. [24] estimated that “the most carbon-intensive image generation model (stable-diffusion-xl-base-1.0) generates 1,594 grams of CO2 for 1,000 inferences, which is roughly the equivalent to 4.1 miles driven by an average gasoline-powered passenger vehicle.”

Is it a coincidence that (4) and (19) look similar to a numerical discretization of a stochastic differential equation? It is natural to compare (4) and (19) with the Euler–Maruyama method [14], and indeed there are variations of the forward diffusion model that have a direct correspondence with stochastic differential equations [15, 25, 29, 37]. The reverse process may also be associated with backward stochastic differential equations [40].

What about the dark side: ethics, privacy, bias and related concerns? Carlini et al. [5] showed that diffusion models have a tendency to memorize and reproduce training images. For tests on Stable AI [32] and Imagen [34] they were able to “extract over a hundred near-identical replicas of training images that range from personally identifiable photos to trademarked logos.” Somepalli et al. [36] also found examples where a diffusion model “blatantly copies” from training data. The resulting harms to professional artists are considered in [20]; these include “reputational damage, economic loss, plagiarism and copyright infringement.” When we move into the realm of text-to-image algorithms there are many further issues to consider, including fairness, toxicity and trust [11].

The figures in section 2 indicate that the output from a simple diffusion model is difficult to predict and hence to interpret. In particular, very different results can be generated from the same input. Explainable AI is a serious challenge in this setting.

On a more general note, any machine learning algorithm is likely to reflect the biases, omissions and errors in the training set [28]. See [18] for a proposed framework for data transparency.

We also mention that discussions around ethics in this field often assume that AI is, or will become, all-powerful, thereby overlooking empirical observations that these systems may fail to operate as intended—the so-called fallacy of AI functionality [30]. So, as well as the important question of what tasks should AI be used for, we must also ask what tasks can AI reliably perform. This latter issue is ripe for mathematical and statistical contributions.

Using generative AI to create content (text, images, music, videos, and so on) that is difficult or impossible to discriminate from human generated content may allow fakery and conspiracy theories to undermine societal safety and benefits. This begets novel risks that are already upon us, identified in part by the inaugural AI Safety Summit which met at Bletchley Park in November 2023.¹¹1 https://www.gov.uk/government/publications/ai-safety-summit-2023-chairs-statement-2-november/chairs-summary-of-the-ai-safety-summit-2023-bletchley-park Arguably, some of the decadal data science focus on ethics and privacy should have been redirected towards the societal risk of fake truths and the widespread inability to discriminate between content; and the introduction of bias. These risks now require an in-depth consideration, as we seek to uncover and tackle the full range of possibilities. An understanding of the mathematical foundations of generative AI methods will be a key to ensuring transparency.

7 PDEs

For many applied mathematicians, diffusion is synonymous with certain parabolic PDEs. Here we present some speculative material that aims to draw a PDE connection with the process described in section 3. The notion of continuously re-normalizing a diffusion process takes us outside the realm of standard textbook analysis, and opens up some issues that may be of independent interest. Depending on our choice of basic PDE there are several ways to ensure that the norm of some derivative of the solution remains unchanged over time. Here we illustrate this general idea by continuously re-scaling to preserve the norm of the gradient of the solution, that is, the total variation, over the domain.

Consider a real valued field $u(x,t)$ , where $x\in\Omega$ , a bounded domain in $\mathbb{R}^{d}$ with a piecewise smooth boundary, $\partial\Omega$ , and time $t\geq 0$ , satisfying

u_{t}=\Delta u+r(t)u,\quad x\in\Omega,\quad\nabla u.{\bf n}=0,\quad x\in% \partial\Omega,

(20)

for a suitable given initial condition $u(x,0)=u_{0}(x)$ . Here $r(t)>0$ is a shadow time-dependent variable (akin to a Lagrange multiplier), which continuously rescales $u$ so that the $L^{2}$ norm of the gradient of $u$ is preserved. More explicitly, $r(t)$ must be such that

\int_{\Omega}||\nabla u||^{2}\,dx:{=}\int_{\Omega}\nabla u.\nabla u\,dx\equiv R% \quad\mathrm{(constant)},\ \ t\geq 0.

(21)

Here $R>0$ is determined from the initial condition.

Taking the gradient of (20), and then forming the scalar product with $\nabla u$ , and integrating over $\Omega$ , we obtain

\frac{1}{2}\,\frac{d\ }{dt}\int_{\Omega}\nabla u.\nabla u\,dx=\int_{\Omega}{% \nabla u.\nabla(\Delta u)}\,dx+r(t)\int_{\Omega}{\nabla u.\nabla u}\,dx.

But, by direct calculation,

\nabla.(\nabla u\Delta u)=(\Delta u)^{2}+\nabla u.\nabla(\Delta u).

So, using the divergence theorem [26] together with the no-flux boundary condition, in order to ensure that $R$ is constant we must set

R\,r(t)=\int_{\Omega}{(\Delta u)^{2}}\,dx\geq 0.

Thus we may write (20) and (21) as the nonlinear integro-differential equation

u_{t}=\Delta u+\frac{u}{R}\int_{\Omega}{(\Delta u)^{2}}\,dx,\quad x\in\Omega,% \quad\nabla u.{\bf n}=0,\quad x\in\partial\Omega.

(22)

Here, as before, $R$ in (22) is set by the initial condition: $R=\int_{\Omega}||\nabla u_{0}(x)||^{2}\,dx.$

For any $R>0$ , the constrained equation, (20) and (21), has infinitely many possible steady states, each of which is of the form

u=\mu_{k}\phi_{k}(x)\quad{\rm and}\quad r=\lambda_{k},

where $(\phi_{k}(x),\lambda_{k})$ is the $k$ th ( $k=0,1,2,...$ ) eigenfunction-eigenvalue pair for the Laplacian on $\Omega$ with no-flux boundary conditions. However, $\mu_{k}$ must satisfy

\mu_{k}^{2}=\frac{R}{\int_{\Omega}||\nabla\phi_{k}||^{2}\,dx},

and so $k\geq 1$ , since the simplest eigenfunction satisfies $||\nabla\phi_{0}||\equiv 0$ .

Now consider small perturbations around the $k$ th steady state. We set

	$\displaystyle u$	$\displaystyle=$	$\displaystyle\mu_{k}\left(\phi_{k}(x)+\epsilon e^{\sigma t}v(x)\right)+O(% \epsilon^{2}),$
	$\displaystyle r(t)$	$\displaystyle=$	$\displaystyle\lambda_{k}+\epsilon\beta e^{\sigma t}+O(\epsilon^{2}),$

for some $v(x)$ and constants $(\sigma,\beta)$ to be determined. Substituting these expressions into (20), to $O(\epsilon)$ we obtain

0=\Delta v+(\lambda_{k}-\sigma)v+\beta\phi_{k},\quad x\in\Omega,\quad\nabla v.% {\bf n}=0,\quad x\in\partial\Omega.

(23)

Now setting $v=\phi_{\tilde{k}}$ , for $k\neq\tilde{k}$ (since the eigenfunctions form an orthonormal basis for the solution space), (23) yields

0=(\lambda_{k}-\lambda_{\tilde{k}}-\sigma)\phi_{\tilde{k}}+\beta\phi_{k},\quad x% \in\Omega,

so that

\sigma=\lambda_{k}-\lambda_{\tilde{k}}\ {\rm and}\ \beta=0.

The condition on $\sigma$ implies that each steady state is stable with respect to perturbations in all higher eigenmodes, where $k<\tilde{k}$ , yet is unstable with respect to any perturbations in lower eigenmodes, $k>\tilde{k}$ . Thus over a long time the solution must decay to the the first eigenmode, $\mu_{1}\phi_{1}(x)$ .

This is borne out in the numerical experiment shown in Figure 7. Here, for a one dimensional domain $\Omega=[0,1]$ , we have $\phi_{k}(x)=\sqrt{2}\cos k\pi x$ and $\lambda_{k}=k^{2}\pi^{2}$ .

This framework and others in physics (see for example [41, 42]) suggests a number of ways that generative AI might exploit, or be explained by, PDE theory. The central challenge is being able to run the processes in reverse in order to generate plausible content from randomness.

Of course in the example above the parabolicity of the forward dynamic evolution, (20) and (21), means that, formally, the backward equation is ill-posed, there being no global solution guaranteed, with both discontinuities and point masses possibly occurring as time moves backwards. Hence any backward approximation should require some mollification, perhaps via a numerical solution that leverages finite resolution and norm preserving properties.

In general one should also wish to consider (i) how these norm-preserving (re-scaling) diffusion systems work when they are subject to stochastic forcing over a finite time, and (ii) how such processes might be run backwards in seeking to recover or approximate the initial conditions under various Markovian assumptions.

Data Statement

Code for the experiments presented here will be made available upon publication.

References

[1] S. Agarwal, S. Mitra, S. Chakraborty, S. Karanam, K. Mukherjee, and S. Saini, Approximate caching for efficiently serving diffusion models, arXiv:2312.04429, (2023).
[2] A. Bansal, H.-M. Chu, A. Schwarzschild, S. Sengupta, M. Goldblum, J. Gei**, and T. Goldstein, Universal guidance for diffusion models, in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023, pp. 843–852.
[3] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer, Berlin, 2007.
[4] Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and L. Sun, A comprehensive survey of AI-generated content (AIGC): A history of generative AI from GAN to ChatGPT, arXiv:2303.04226, (2023).
[5] N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V. Sehwag, F. Tramèr, B. Balle, D. Ippolito, and E. Wallace, Extracting training data from diffusion models, in Proceedings of the 32nd USENIX Conference on Security Symposium, SEC ’23, USA, 2023, USENIX Association.
[6] F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, Diffusion models in vision: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, 45 (2023), pp. 10850–10869.
[7] P. Dhariwal and A. Nichol, Diffusion models beat GANs on image synthesis, in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, eds., vol. 34, Curran Associates, Inc., 2021, pp. 8780–8794.
[8] S. Feuerriegel, J. Hartmann, C. Janiesch, and P. Zschech, Generative AI, Business and Information Systems Engineering, (2023).
[9] L. Girin, S. Leglaive, X. Bie, J. Diard, T. Hueber, and X. Alameda-Pineda, Dynamical variational autoencoders: A comprehensive review, Foundations and Trends in Machine Learning, 15 (2021), p. 1–175.
[10] R. Gozalo-Brizuela and E. C. Garrido-Merchán, A survey of generative AI applications, arXiv:2306.02781, (2023).
[11] S. Hao, P. Kumar, S. Laszlo, S. Poddar, B. Radharapu, and R. Shelby, Safety and fairness for content moderation in generative models, in CVPR Workshop on Ethical Considerations in Creative applications of Computer Vision, 2023.
[12] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, GANs trained by a two time-scale update rule converge to a local Nash equilibrium, in Advances in Neural Information Processing Systems, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, eds., Long Beach, CA, USA, 2017, pp. 6626–6637.
[13] C. F. Higham and D. J. Higham, Deep learning: An introduction for applied mathematicians, SIAM Review, 61 (2019), pp. 860–891.
[14] D. J. Higham and P. E. Kloeden, An introduction to the numerical simulation of stochastic differential equations, SIAM, Philadelphia, 2021.
[15] J. Ho, A. Jain, and P. Abbeel, Denoising diffusion probabilistic models, in Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2020, Curran Associates Inc.
[16] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, Cascaded diffusion models for high fidelity image generation, J. Mach. Learn. Res., 23 (2022).
[17] J. Ho and T. Salimans, Classifier-free diffusion guidance, in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
[18] B. Hutchinson, A. Smart, A. Hanna, E. Denton, C. Greer, O. Kjartansson, P. Barnes, and M. Mitchell, Towards accountability for machine learning datasets: Practices from software engineering and infrastructure, in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, New York, NY, USA, 2021, Association for Computing Machinery, pp. 560–575.
[19] G. Iglesias, E. Talavera, and A. Díaz-Álvarez, A survey on GANs for computer vision: Recent research, analysis and taxonomy, Computer Science Review, 48 (2023), p. 100553.
[20] H. H. Jiang, L. Brown, J. Cheng, M. Khan, A. Gupta, D. Workman, A. Hanna, J. Flowers, and T. Gebru, AI art and its impact on artists, in Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’23, New York, NY, USA, 2023, Association for Computing Machinery, pp. 363–374.
[21] A. Kazerouni, E. K. Aghdam, M. Heidari, R. Azad, M. Fayyaz, I. Hacihaliloglu, and D. Merhof, Diffusion models in medical imaging: A comprehensive survey, Medical Image Analysis, 88 (2023), p. 102846.
[22] D. P. Kingma and M. Welling, Auto-encoding variational Bayes, in 2nd International Conference on Learning Representations, 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
[23] Y. LeCun, C. Cortes, and C. J. C. Burges, The MNIST database of handwritten digits.
[24] A. S. Luccioni, Y. Jernite, and E. Strubell, Power hungry processing: Watts driving the cost of AI deployment?, arXiv:2311.16863, (2023).
[25] C. Luo, Understanding diffusion models: A unified perspective, arXiv:2208.11970, (2022).
[26] P. C. Matthews, Vector Calculus, Springer, Berlin, 1998.
[27] K. P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press, Boston, 2022.
[28] A. Paullada, I. D. Raji, E. M. Bender, E. Denton, and A. Hanna, Data and its (dis) contents: A survey of dataset development and use in machine learning research, Patterns, 2 (2021).
[29] R. Po, W. Yifan, V. Golyanik, K. Aberman, J. T. Barron, A. H. Bermano, E. R. Chan, T. Dekel, A. Holynski, A. Kanazawa, C. K. Liu, L. Liu, B. Mildenhall, M. Nießner, B. Ommer, C. Theobalt, P. Wonka, and G. Wetzstein, State of the art on diffusion models for visual computing, arXiv:2310.07204, (2023).
[30] I. D. Raji, I. E. Kumar, A. Horowitz, and A. Selbst, The fallacy of AI functionality, in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, New York, NY, USA, 2022, Association for Computing Machinery, pp. 959–972.
[31] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, Hierarchical text-conditional image generation with CLIP latents, arXiv:2204.06125, (2022).
[32] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, High-resolution image synthesis with latent diffusion models, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10684–10695.
[33] O. Ronneberger, P. Fischer, and T. Brox, U-Net: Convolutional networks for biomedical image segmentation, in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, eds., Cham, 2015, Springer International Publishing, pp. 234–241.
[34] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, Photorealistic text-to-image diffusion models with deep language understanding, in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, eds., vol. 35, Curran Associates, Inc., 2022, pp. 36479–36494.
[35] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, Deep unsupervised learning using nonequilibrium thermodynamics, in Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei, eds., vol. 37 of Proceedings of Machine Learning Research, Lille, France, 2015, PMLR, pp. 2256–2265.
[36] G. Somepalli, V. Singla, M. Goldblum, J. Gei**, and T. Goldstein, Diffusion art or digital forgery? Investigating data replication in diffusion models, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 6048–6058.
[37] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, Score-based generative modeling through stochastic differential equations, in International Conference on Learning Representations, 2021.
[38] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, Efficient transformers: A survey, ACM Comput. Surv., 55 (2022).
[39] J. Tow, M. Bellagente, D. Mahan, and C. R. Ruiz, StableLM-3B-4E1T, tech. rep., Stability-AI, 2023.
[40] Z. Wang, Score-based generative modeling through backward stochastic differential equations: Inversion and generation, arXiv:2311.16863, (2023).
[41] Y. Xu, Z. Liu, M. Tegmark, and T. Jaakkola, Poisson flow generative models, in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, eds., vol. 35, Curran Associates, Inc., 2022, pp. 16782–16795.
[42] Y. Xu, Z. Liu, Y. Tian, S. Tong, M. Tegmark, and T. Jaakkola, PFGM++: Unlocking the potential of physics-inspired generative models, in Proceedings of the 40th International Conference on Machine Learning, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, eds., vol. 202 of Proceedings of Machine Learning Research, 2023, pp. 38566–38591.
[43] X. Yang, D. Zhou, J. Feng, and X. Wang, Diffusion probabilistic model made slim, in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 22552–22562.
[44] A. B. Yildirim, V. Baday, E. Erdem, A. Erdem, and A. Dundar, Inst-Inpaint: Instructing to remove objects with diffusion models, arXiv:2304.03246, (2023).
[45] L. Zhang, A. Rao, and M. Agrawala, Adding conditional control to text-to-image diffusion models, in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836–3847.