Diffusion Models and Representation Learning: A Survey

Michael Fuest,  **chuan Ma,  Ming Gui,  Johannes S. Fischer,  Vincent Tao Hu,  Björn Ommer Michael Fuest is a Master’s student at the Technical University of Munich. **chuan Ma, Ming Gui, and Johannes S. Fischer are PhD students at LMU Munich. Vincent Tao Hu, a PostDoc from LMU Munich, is also the corresponding author.
E-mail: [email protected] Björn Ommer is a full professor at LMU where he heads the Computer Vision & Learning Group (previously Computer Vision Group Heidelberg).
Abstract

Diffusion Models are popular generative modeling methods in various vision tasks, attracting significant attention. They can be considered a unique instance of self-supervised learning methods due to their independence from label annotation. This survey explores the interplay between diffusion models and representation learning. It provides an overview of diffusion models’ essential aspects, including mathematical foundations, popular denoising network architectures, and guidance methods. Various approaches related to diffusion models and representation learning are detailed. These include frameworks that leverage representations learned from pre-trained diffusion models for subsequent recognition tasks and methods that utilize advancements in representation and self-supervised learning to enhance diffusion models. This survey aims to offer a comprehensive overview of the taxonomy between diffusion models and representation learning, identifying key areas of existing concerns and potential exploration. Github link: https://github.com/dongzhuoyao/Diffusion-Representation-Learning-Survey-Taxonomy.

Index Terms:
deep generative modeling, diffusion models, denoising diffusion models, score-based models, image generation, representation learning.

1 Introduction

Diffusion Models [151, 154, 68] have recently emerged as the state-of-the-art of generative modeling, demonstrating remarkable results in image synthesis [68, 67, 43, 141] and across other modalities including natural language [9, 70, 101, 77], computational chemistry [6, 71] and audio synthesis [92, 109, 80]. The remarkable generative capabilities of Diffusion Models suggest that Diffusion Models learn both low and high-level features of their input data, potentially making them well-suited for general representation learning. Unlike other generative models like Generative Adversarial Networks (GANs) [53, 84, 22] and Variational Autoencoders (VAEs) [88, 137], diffusion models do not contain fixed architectural components that capture data representations [124]. This makes diffusion model-based representation learning challenging. Nevertheless, approaches leveraging diffusion models for representation learning have seen increasing interest, simultaneously driven by advancements in training and sampling of Diffusion Models.

Refer to caption
Figure 1: Shows yearly numbers of both published and preprint papers on diffusion models and representation learning. For 2024, the green bar indicates the number of papers collected up to and including June 2024, and the dashed grey bar indicates the projected number for the whole year.

Current state-of-the-art self-supervised representation learning approaches [33, 55, 24, 8] have demonstrated great scalability. It is thus likely that diffusion models exhibit similar scaling properties [159]. Controlled generation approaches like Classifier Guidance [43] and Classifier-free Guidance [67] used to obtain state-of-the-art generation results rely on annotated data, which represents a bottleneck for scaling up diffusion models. Guidance approaches that leverage representation learning and that are thus annotation-free offer a solution, potentially enabling diffusion models to train on much larger, annotation-free datasets.

This survey paper aims to elucidate the relationship and interplay between diffusion models and representation learning. We highlight two central perspectives: Using diffusion models themselves for representation learning and using representation learning for improving diffusion models. We introduce a taxonomy of current approaches and derive generalized frameworks that demonstrate commonalities among current approaches.

Interest in exploring the representation learning capabilities of diffusion models has been growing since the original formulation of diffusion models by Sohl-Dickstein et al. [151], Ho et al. [68], Song et al. [154]. As demonstrated in Fig. 1, we expect this trend to continue this year. The increased volume of published works on diffusion models and representation learning makes it more difficult for researchers to identify state-of-the-art approaches and stay on top of current developments. This can hinder progress in the space, which is why we feel a comprehensive overview and categorization is required.

Research on representation learning and diffusion models is in its infancy. Many of the current approaches rely on using diffusion models solely trained for generative synthesis for representation learning. We therefore hypothesize that there are significant opportunities for further progress in this area in the future and that diffusion models can increasingly challenge the current state-of-the-art in representation learning. Fig. 2 shows qualitative results from existing methods. We hope that this survey can contribute to advances in diffusion-based representation learning, by clarifying commonalities and differences among current approaches. In summary, the main contributions of this paper are the following:

  • Comprehensive Overview: Offers a thorough survey of the interplay between diffusion models and representation learning, providing clarity on how diffusion models can be used for representation learning and vice versa.

  • Taxonomy of Approaches: We introduce a taxonomy of current approaches in diffusion-based representation learning, categorizing and highlighting commonalities and differences among them.

  • Generalized Frameworks: The paper derives generalized frameworks for both diffusion model feature extraction and assignment-based guidance, offering a structured view on a large number of works on diffusion models and representation learning.

  • Future Directions: We identify key opportunities for further progress in the field, encouraging the exploration of diffusion models and flow matching as a new state-of-the-art in representation learning.

Refer to caption
Figure 2: Left: Shows qualitative generation results from diffusion models conditioned using self-supervised guidance signals. Right: Shows qualitative results of downstream image tasks that leverage representations learned in training diffusion models. Adapted from Li et al. [100], Hu et al. [73], Pan et al. [130], Baranchuk et al. [15], Yang and Wang [173].

2 Background

The following section outlines the required mathematical foundations of diffusion models. We also highlight current architecture backbones of diffusion models and provide a brief overview of sampling methods and conditional generation approaches.

2.1 Mathematical Foundations

Consider a set of training examples drawn from an underlying probability distribution p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ). The idea behind generative diffusion models is to learn a denoising process that maps samples of random noise to novel images sampled from p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ) [133]. To achieve this, images are corrupted by gradually adding different levels of Gaussian noise. Given an uncorrupted training sample 𝐱0p(𝐱)similar-tosubscript𝐱0𝑝𝐱\mathbf{x}_{0}\sim p(\mathbf{x})bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p ( bold_x ), where index 00 denotes the fact that the sample is not corrupted, the corrupted samples 𝐱1,𝐱2,𝐱Tsubscript𝐱1subscript𝐱2subscript𝐱𝑇\mathbf{x}_{1},\mathbf{x}_{2}...,\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … , bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are generated according to a Markovian process. One common choice for the transition kernel p(𝐱t|𝐱t1)𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1p(\mathbf{x}_{t}|\mathbf{x}_{t-1})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) is the following:

p(𝐱t|𝐱t1)=𝒩(\displaystyle p(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\mathcal{N}\Big{(}italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( 𝐱t;1βt𝐱t1,subscript𝐱𝑡1subscript𝛽𝑡subscript𝐱𝑡1\displaystyle\mathbf{x}_{t};\sqrt{1-\beta_{t}}\mathbf{x}_{t-1},bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , βt𝐈),\displaystyle\beta_{t}\mathbf{I}\Big{)},italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) , (1)
t{1,,T},for-all𝑡1𝑇\displaystyle\forall t\in\{1,\ldots,T\},∀ italic_t ∈ { 1 , … , italic_T } ,

where T𝑇Titalic_T denotes the number of diffusion timesteps, βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a time-dependent variance schedule and 𝐈𝐈\mathbf{I}bold_I is an identity matrix with dimensionality equal to 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT [37]. Note that other parametrizations of the transition kernel p(𝐱t|𝐱t1)𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1p(\mathbf{x}_{t}|\mathbf{x}_{t-1})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) are also applicable in the same manner [87, 188]. We proceed with the parametrization used in DDPMs [68] to simplify the discussion moving forward. A noisy image 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be sampled directly from x0subscriptx0\textbf{x}_{0}x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with the help of a reparametrization trick [151] as follows:

p(𝐱t|𝐱0)=𝒩(𝐱t;α¯t𝐱0;(1α¯t)𝐈),𝑝conditionalsubscript𝐱𝑡subscript𝐱0𝒩subscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐈p(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}\Big{(}\mathbf{x}_{t};\sqrt{\bar{% \alpha}_{t}}\mathbf{x}_{0};(1-\bar{\alpha}_{t})\mathbf{I}\Big{)},italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) , (2)

where αt:=1βtassignsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}:=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and α¯t:=i=1tαiassignsubscript¯𝛼𝑡subscriptsuperscriptproduct𝑡𝑖1subscript𝛼𝑖\bar{\alpha}_{t}:=\prod^{t}_{i=1}\alpha_{i}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := ∏ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Given the original input image 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we can now obtain 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in one step by sampling Gaussian vector ϵt𝒩(0,𝐈)similar-tosubscriptbold-italic-ϵ𝑡𝒩0𝐈\bm{\epsilon}_{t}\sim\mathcal{N}(0,\mathbf{I})bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ) and applying:

𝐱t=α¯t𝐱0+(1α¯t)ϵt.\mathbf{x}_{t}=\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{(1-\bar{\alpha}_{t}% })\bm{\epsilon}_{t}.bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (3)

We can generate novel samples from p(𝐱0)𝑝subscript𝐱0p(\mathbf{x}_{0})italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) starting from a pure noise image 𝐱Tπ(𝐱T)=𝒩(0,𝐈)similar-tosubscript𝐱𝑇𝜋subscript𝐱𝑇𝒩0𝐈\mathbf{x}_{T}\sim\pi(\mathbf{x}_{T})=\mathcal{N}(0,\mathbf{I})bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ italic_π ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = caligraphic_N ( 0 , bold_I ) with dimensionality equivalent to the data and sequentially denoise it such that at every step, pθ(𝐱t1|𝐱t)=𝒩(𝐱t1;μθ(𝐱t,t),Σθ(𝐱t,t))subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡𝒩subscript𝐱𝑡1subscript𝜇𝜃subscript𝐱𝑡𝑡subscriptΣ𝜃subscript𝐱𝑡𝑡p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})=\mathcal{N}(\mathbf{x}_{t-1};\mu_{% \theta}(\mathbf{x}_{t},t),\Sigma_{\theta}(\mathbf{x}_{t},t))italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ). In practice, this requires training a neural network pθ(𝐱t1|𝐱t)subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) that predicts the mean μ0(𝐱t,t)subscript𝜇0subscript𝐱𝑡𝑡\mu_{0}(\mathbf{x}_{t},t)italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) and the covariance Σθ(𝐱t,t)subscriptΣ𝜃subscript𝐱𝑡𝑡\Sigma_{\theta}(\mathbf{x}_{t},t)roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) given a diffusion timestep t𝑡titalic_t and the noisy input image 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [172]. Training this neural network with a maximum likelihood objective is intractable [37], so the objective is amended to minimize a Variational Lower-Bound of the Negative Log-Likelihood instead [151, 68]:

vlb=subscript𝑣𝑙𝑏absent\displaystyle\mathcal{L}_{vlb}=caligraphic_L start_POSTSUBSCRIPT italic_v italic_l italic_b end_POSTSUBSCRIPT = logpθ(𝐱0|𝐱1)+DKL(p(𝐱T|𝐱0)π(𝐱T))subscript𝑝𝜃conditionalsubscript𝐱0subscript𝐱1subscript𝐷𝐾𝐿conditional𝑝conditionalsubscript𝐱𝑇subscript𝐱0𝜋subscript𝐱𝑇\displaystyle-\log p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1})+D_{KL}\left(p(% \mathbf{x}_{T}|\mathbf{x}_{0})\|\pi(\mathbf{x}_{T})\right)- roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ italic_π ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) (4)
+t>1DKL(p(𝐱t1|𝐱t,𝐱0)pθ(𝐱t1|𝐱t)),\displaystyle+\sum_{t>1}D_{KL}\left(p(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{% x}_{0})\|p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})\right),+ ∑ start_POSTSUBSCRIPT italic_t > 1 end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_p ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ,

where DKLsubscript𝐷𝐾𝐿D_{KL}italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT is the Kullback-Leibler divergence. This objective ensures that the neural network is trained to minimize the distance between pθ(𝐱t1|𝐱t)subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and the true posterior of the forward process when conditioned on 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The denoising network is generally applied to parametrize the reverse mean μθ(𝐱,t)subscript𝜇𝜃𝐱𝑡\mu_{\theta}(\mathbf{x},t)italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x , italic_t ) of the distribution of the reverse transition pθ(𝐱t1|𝐱t):=𝒩(𝐱t1;μθ(𝐱t,t),Σθ(𝐱t,t))assignsubscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡𝒩subscript𝐱𝑡1subscript𝜇𝜃subscript𝐱𝑡𝑡subscriptΣ𝜃subscript𝐱𝑡𝑡p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t}):=\mathcal{N}(\mathbf{x}_{t-1};\mu_% {\theta}(\mathbf{x}_{t},t),\Sigma_{\theta}(\mathbf{x}_{t},t))italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) [27]. The true value of the reverse mean is a function of 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which is unknown in the reverse process and must therefore be estimated using input timestep t𝑡titalic_t and the noisy data 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Specifically, the reverse mean is formulated as the following:

μ(𝐱t,t):=α¯t1(1α¯t1)𝐱t+α¯t1(1αt)𝐱01α¯t,assign𝜇subscript𝐱𝑡𝑡subscript¯𝛼𝑡11subscript¯𝛼𝑡1subscript𝐱𝑡subscript¯𝛼𝑡11subscript𝛼𝑡subscript𝐱01subscript¯𝛼𝑡\mu(\mathbf{x}_{t},t):=\frac{\sqrt{\bar{\alpha}_{t-1}}(1-\bar{\alpha}_{t-1})% \mathbf{x}_{t}+\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_{t})\mathbf{x}_{0}}{1-\bar{% \alpha}_{t}},italic_μ ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) := divide start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , (5)

where the original data 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is unavailable in the reverse process and must therefore be estimated. We denote the denoising network’s prediction of the original data as 𝐱^0subscript^𝐱0\hat{\mathbf{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. This prediction 𝐱^0subscript^𝐱0\hat{\mathbf{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can then be used to obtain μθ(𝐱t,t)subscript𝜇𝜃subscript𝐱𝑡𝑡\mu_{\theta}(\mathbf{x}_{t},t)italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) using Equation 5. Parametrizing with 𝐱^0subscript^𝐱0\hat{\mathbf{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT directly is beneficial at the beginning of sampling, since predicting 𝐱^0subscript^𝐱0\hat{\mathbf{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT directly helps the denoising network to learn higher-level structural features [115].

[68] suggest fixing the covariance Σθ(𝐱t,t)subscriptΣ𝜃subscript𝐱𝑡𝑡\Sigma_{\theta}(\mathbf{x}_{t},t)roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) to a constant value, which enables rewriting the parametrized reverse mean as a function of the added noise ϵ(𝐱t,t)bold-italic-ϵsubscript𝐱𝑡𝑡\bm{\epsilon}(\mathbf{x}_{t},t)bold_italic_ϵ ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) instead of 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT:

μθ(𝐱t,t)=1αt(𝐱t1αt1α¯tϵθ(𝐱t,t).)\mu_{\theta}(\mathbf{x}_{t},t)=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}% -\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\bm{\epsilon}_{\theta}(\mathbf{% x}_{t},t).\right)italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) . ) (6)

This reparametrization allows for the derivation of a simplification of the objective vlbsubscript𝑣𝑙𝑏\mathcal{L}_{vlb}caligraphic_L start_POSTSUBSCRIPT italic_v italic_l italic_b end_POSTSUBSCRIPT which we denote simplesubscript𝑠𝑖𝑚𝑝𝑙𝑒\mathcal{L}_{simple}caligraphic_L start_POSTSUBSCRIPT italic_s italic_i italic_m italic_p italic_l italic_e end_POSTSUBSCRIPT that measures the distance between the predicted noise ϵθ(𝐱t,t)subscriptbold-italic-ϵ𝜃subscript𝐱𝑡𝑡\bm{\epsilon}_{\theta}(\mathbf{x}_{t},t)bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) and the actual noise ϵtsubscriptbold-italic-ϵ𝑡\bm{\epsilon}_{t}bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as follows:

simple=𝔼t[1,T]𝔼𝐱0p(𝐱0)𝔼ϵt𝒩(0,𝐈)ϵtϵθ(𝐱t,t)2.subscript𝑠𝑖𝑚𝑝𝑙𝑒subscript𝔼similar-to𝑡1𝑇subscript𝔼similar-tosubscript𝐱0𝑝subscript𝐱0subscript𝔼similar-tosubscriptbold-italic-ϵ𝑡𝒩0𝐈superscriptnormsubscriptbold-italic-ϵ𝑡subscriptbold-italic-ϵ𝜃subscript𝐱𝑡𝑡2\mathcal{L}_{simple}=\mathbb{E}_{t\sim[1,T]}\mathbb{E}_{\mathbf{x}_{0}\sim p(% \mathbf{x}_{0})}\mathbb{E}_{\bm{\epsilon}_{t}\sim\mathcal{N}(0,\mathbf{I})}% \left\|\bm{\epsilon}_{t}-\bm{\epsilon}_{\theta}(\mathbf{x}_{t},t)\right\|^{2}.caligraphic_L start_POSTSUBSCRIPT italic_s italic_i italic_m italic_p italic_l italic_e end_POSTSUBSCRIPT = roman_𝔼 start_POSTSUBSCRIPT italic_t ∼ [ 1 , italic_T ] end_POSTSUBSCRIPT roman_𝔼 start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT roman_𝔼 start_POSTSUBSCRIPT bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ) end_POSTSUBSCRIPT ∥ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (7)

Instead of predicting the mean and covariance directly, the network is now parametrized to predict the added noise for a diffusion timestep and noisy image input. The reverse mean is obtained using Equation  6, and the covariance is fixed. Noise prediction networks have the benefit of being able to recover 𝐱t1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT from 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the final sampling stages by predicting zero noise [79]. This is more difficult for direct parametrizations of 𝐱^0subscript^𝐱0\hat{\mathbf{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. There is therefore a tradeoff between the two, where direct parametrizations can be more beneficial for very noisy inputs in the initial sampling stages, and noise prediction parametrization can be beneficial in the latter sampling stages [27].

In efforts to improve sampling efficiency, Salimans and Ho [143] introduce velocity prediction as a further alternative parametrization. Velocity is a linear combination of the denoised input and the added noise, commonly defined as:

𝐯=α¯tϵ(1α¯t)𝐱t.𝐯subscript¯𝛼𝑡italic-ϵ1subscript¯𝛼𝑡subscript𝐱𝑡\mathbf{v}=\bar{\alpha}_{t}\epsilon-(1-\bar{\alpha}_{t})\mathbf{x}_{t}.bold_v = over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ - ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (8)

This parametrization combines benefits of both data and noise parametrizations, allowing the denoising network to flexibly learn noise prediction as well as reconstruction dynamics based on the signal-to-noise ratio. This parametrization has led to stable results in diffusion distillation approaches [143], and can speed up generation [19].

Recently, several works [153, 154, 133, 32] further propose to think of the noise in terms of continuous instead of discrete timesteps. Here, the diffusion process is expressed as a continuous time-dependent function σ(t)𝜎𝑡\sigma(t)italic_σ ( italic_t ). Noise is gradually added whenever a sample 𝐱𝐱\mathbf{x}bold_x moves forward in time, and gradually removed if the image follows the reverse trajectory. More specifically, the diffusion process can be expressed using an Itô Stochastic Differential Equation (SDE) [83], where the vector-valued drift coefficient 𝐟(,t):dd:𝐟𝑡superscript𝑑superscript𝑑\mathbf{f}(\cdot,t):\mathbb{R}^{d}\to\mathbb{R}^{d}bold_f ( ⋅ , italic_t ) : roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and the scalar-valued diffusion coefficient g()::𝑔g(\cdot):\mathbb{R}\to\mathbb{R}italic_g ( ⋅ ) : roman_ℝ → roman_ℝ need to be selected when implementing a diffusion model:

d𝐱=𝐟(𝐱,t)dt+g(t)d𝐰,𝑑𝐱𝐟𝐱𝑡𝑑𝑡𝑔𝑡𝑑𝐰d\mathbf{x}=\mathbf{f}(\mathbf{x},t)dt+g(t)d\mathbf{w},italic_d bold_x = bold_f ( bold_x , italic_t ) italic_d italic_t + italic_g ( italic_t ) italic_d bold_w , (9)

where 𝐰𝐰\mathbf{w}bold_w is the standard Wiener process. There are two widely used choices of the SDE formulation used to model the diffusion process. The first is the Variance-Preserving (VP) SDE, used in the work of Ho et al. [68] which is given by 𝐟(𝐱,t)=12β(t)𝐱𝐟𝐱𝑡12𝛽𝑡𝐱\mathbf{f}(\mathbf{x},t)=-\frac{1}{2}\beta(t)\mathbf{x}bold_f ( bold_x , italic_t ) = - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_β ( italic_t ) bold_x and g(t)=β(t)𝑔𝑡𝛽𝑡g(t)=\sqrt{\beta(t)}italic_g ( italic_t ) = square-root start_ARG italic_β ( italic_t ) end_ARG, where β(t)=βt𝛽𝑡subscript𝛽𝑡\beta(t)=\beta_{t}italic_β ( italic_t ) = italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as T𝑇Titalic_T goes to infinity. Note that this is equivalent to the continuous formulation of the DDPM parametrization in Equation 1. The second is the Variance-Exploding (VE) SDE [153], resulting from a choice of 𝐟(𝐱,t)=0𝐟𝐱𝑡0\mathbf{f}(\mathbf{x},t)=0bold_f ( bold_x , italic_t ) = 0 and g(t)=2σ(t)dσ(t)dt𝑔𝑡2𝜎𝑡𝑑𝜎𝑡𝑑𝑡g(t)=\sqrt{2\sigma(t)\frac{d\sigma(t)}{dt}}italic_g ( italic_t ) = square-root start_ARG 2 italic_σ ( italic_t ) divide start_ARG italic_d italic_σ ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG end_ARG. The VE SDE gets its name since the variance continually increases with increasing t𝑡titalic_t, whereas the variance in the VP SDE is bounded [154]. Anderson [7] derives an SDE that reverses a diffusion process, which results in the following when applied to the Variance Exploding SDE:

d𝐱=2σ(t)dσ(t)dt𝐱logp(𝐱;σ(t))dt+2σ(t)dσ(t)dtd𝐰.𝑑𝐱2𝜎𝑡𝑑𝜎𝑡𝑑𝑡subscript𝐱𝑝𝐱𝜎𝑡𝑑𝑡2𝜎𝑡𝑑𝜎𝑡𝑑𝑡𝑑𝐰d\mathbf{x}=-2\sigma(t)\frac{d\sigma(t)}{dt}\nabla_{\mathbf{x}}\log p(\mathbf{% x};\sigma(t))\,dt+\sqrt{2\sigma(t)\frac{d\sigma(t)}{dt}}\,d\mathbf{w}.italic_d bold_x = - 2 italic_σ ( italic_t ) divide start_ARG italic_d italic_σ ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p ( bold_x ; italic_σ ( italic_t ) ) italic_d italic_t + square-root start_ARG 2 italic_σ ( italic_t ) divide start_ARG italic_d italic_σ ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG end_ARG italic_d bold_w . (10)

𝐱logp(𝐱;σ(t))subscript𝐱𝑝𝐱𝜎𝑡\nabla_{\mathbf{x}}\log p(\mathbf{x};\sigma(t))∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p ( bold_x ; italic_σ ( italic_t ) ) is known as the score function. This score function is generally not known, so it needs to be approximated using a neural network. A neural network D(𝐱;σ)𝐷𝐱𝜎D(\mathbf{x};\sigma)italic_D ( bold_x ; italic_σ ) that minimizes the L2-denoising error can be used to extract the score function since 𝐱logp(𝐱;σ(t))=D(𝐱;σ)𝐱σ2subscript𝐱𝑝𝐱𝜎𝑡𝐷𝐱𝜎𝐱superscript𝜎2\nabla_{\mathbf{x}}\log p(\mathbf{x};\sigma(t))=\frac{D(\mathbf{x};\sigma)-% \mathbf{x}}{\sigma^{2}}∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p ( bold_x ; italic_σ ( italic_t ) ) = divide start_ARG italic_D ( bold_x ; italic_σ ) - bold_x end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. This idea is known as Denoising Score Matching [161].

2.2 Backbone Architectures

Refer to caption
Figure 3: Left: An exemplary visualization of the U-Net architecture [140]. Consists of an encoder and a decoder, with residual connections that preserve gradient flow and low-level input details. Adapted from [135]. Right: An exemplary visualization of the DiT architecture. Shows the high-level architecture, as well as a breakdown of the adaLN-Zero DiT block. Adapted from Peebles and Xie [132].

We outline the mathematical foundations of diffusion models in Section 2.1. Since denoising prediction networks are generally parametrized by parameters θ𝜃\thetaitalic_θ, we discuss the formulation of θ𝜃\thetaitalic_θ by several neural network architectures in the following section. All of these network architectures map from the same input space to the same output space.

Ho et al. [68] use a U-Net backbone similar to an unmasked PixelCNN++ [144] to approximate the score function. This U-Net architecture, originally used in semantic segmentation approaches [140, 113, 30, 31], is based on a Wide ResNet [182] and takes a noisy image and the diffusion timestep t𝑡titalic_t as input, encodes the image to a lower-dimensional representation, and outputs the noise prediction for that image and noise level. The U-Net consists of an encoder and a decoder with residual connections between blocks that preserve gradient flow and help recover fine-grained details lost in the compressed representation. The encoder consists of a series of residual and self-attention blocks and downsamples the input image to a low-dimensional representation. The decoder mirrors this structure, gradually upsampling the low-dimensional representation to match the input dimensionality. The diffusion timestep t𝑡titalic_t is specified by adding a sinusoidal positional embedding in each residual block [68] that scales and shifts the input features, enhancing the network’s ability to capture temporal dependencies.

DDPMs operate in the pixel space, making their training and inference computationally expensive. Rombach et al. [138] address this by proposing Latent Diffusion Models (LDMs), which operate in the latent space of a pre-trained variational autoencoder. The diffusion process is applied to the generated representation as opposed to the image directly, leading to computational benefits without sacrificing generation quality. While the authors introduce additional cross-attention mechanisms to allow for more flexible conditioned generation, the denoising network backbone remains very close to the DDPM U-Net architecture.

Recent advances in the use of transformer architectures for vision tasks like ViT [45] have led to the adoption of transformer-based architectures for diffusion models. Peebles and Xie [132] propose Diffusion Transformers (DiT), a diffusion model backbone architecture that is largely inspired by ViTs, and demonstrates state-of-the-art generation performance on ImageNet when combined with the LDM framework. Following ViT, DiTs work by transforming input images into a sequence of patches, which are converted into a sequence of tokens using a ”patchify” layer. After adding ViT-style positional embeddings to all input tokens, the tokens are fed through a series of transformer blocks. These blocks are equivalent to standard ViT blocks that take additional conditional information such as the diffusion timestep t𝑡titalic_t and a conditioning signal 𝐜𝐜\mathbf{c}bold_c as inputs. A detailed overview of their structure can be seen in Fig 3.

U-ViTs [12] combine the U-Net and ViT backbones into a unified backbone. U-ViTs follow the design methodology of transformers in tokenizing time, conditioning and image inputs, but additionally employ long skip connections between shallow and deep layers. These skip connections provide shortcuts for low-level features and therefore stabilize training of the denoising network [12]. Works utilizing U-ViT-based backbones [72, 13] achieve results on par with U-Net CNN-based architectures, demonstrating their potential as a viable alternative to other denoising network backbones.

2.3 Diffusion Model Guidance

TABLE I: An overview of different diffusion model guidance approaches. Self-guidance [75] and [73] are both classifier and annotation-free, and online guidance facilitates online learning.
Approach Classifier-Free Annotation-Free Online Learning
Classifier Guidance [42]
Classifier-free Guidance [67]
Self-guidance [75, 100]
Online guidance [73]

Recent improvements in image generation results have largely been driven by improved guidance approaches. The ability to control generation by passing user-defined conditions is an important property of generative models, and guidance describes the modulation of the strength of the conditioning signal within the model. Conditioning signals can have a wide range of modalities, ranging from class labels, to text embeddings to other images. A simple method to pass spatial conditioning signals to diffusion models is to simply concatenate the conditioning signal with the denoising targets and then pass the signal through the denoising network [75, 12]. Another effective approach uses cross-attention mechanisms, where a conditioning signal 𝐜𝐜\mathbf{c}bold_c is preprocessed by an encoder to an intermediate projection E(𝐜)𝐸𝐜E(\mathbf{c})italic_E ( bold_c ), and then injected into the intermediate layer of the denoising network using cross-attention [142, 76]. These conditioning approaches alone do not leave the possibility to regulate the strength of the conditioning signal within the model. Diffusion model guidance has recently emerged as an approach to more precisely trade-off generation quality and diversity.

Dhariwal and Nichol [42] use classifier guidance, a compute-efficient method leveraging a pre-trained noise-robust classifier to improve sample quality. Classifier guidance is based on the observation that a pre-trained diffusion model can be conditioned using the gradients of a classifier parametrized by ϕitalic-ϕ\phiitalic_ϕ outputting pϕ(𝐜|𝐱𝐭,t)subscript𝑝italic-ϕconditional𝐜subscript𝐱𝐭𝑡p_{\phi}(\mathbf{c}|\mathbf{x_{t}},t)italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_c | bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT , italic_t ). The gradients of the log-likelihood of this classifier 𝐱𝐭logpϕ(𝐜|𝐱𝐭,t)subscriptsubscript𝐱𝐭subscript𝑝italic-ϕconditional𝐜subscript𝐱𝐭𝑡\nabla_{\mathbf{x_{t}}}\log p_{\phi}(\mathbf{c}|\mathbf{x_{t}},t)∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_c | bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT , italic_t ) can be used to guide the diffusion process towards generating an image belonging to class label 𝐲𝐲\mathbf{y}bold_y. The score estimator for p(x|𝐜)𝑝conditional𝑥𝐜p(x|\mathbf{c})italic_p ( italic_x | bold_c ) can be written as

𝐱𝐭log(pθ(𝐱𝐭)pϕ(𝐜|𝐱𝐭))=𝐱𝐭logpθ(𝐱𝐭)+𝐱𝐭logpϕ(𝐜|𝐱𝐭).subscriptsubscript𝐱𝐭subscript𝑝𝜃subscript𝐱𝐭subscript𝑝italic-ϕconditional𝐜subscript𝐱𝐭subscriptsubscript𝐱𝐭subscript𝑝𝜃subscript𝐱𝐭subscriptsubscript𝐱𝐭subscript𝑝italic-ϕconditional𝐜subscript𝐱𝐭\nabla_{\mathbf{x_{t}}}\log\left(p_{\theta}(\mathbf{x_{t}})p_{\phi}(\mathbf{c}% |\mathbf{x_{t}})\right)=\nabla_{\mathbf{x_{t}}}\log p_{\theta}(\mathbf{x_{t}})% +\nabla_{\mathbf{x_{t}}}\log p_{\phi}(\mathbf{c}|\mathbf{x_{t}}).∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_c | bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT ) ) = ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT ) + ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_c | bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT ) . (11)

By using Bayes’ theorem, the noise prediction network can then be rewritten to estimate:

ϵ^θ(𝐱𝐭,𝐜)=ϵθ(𝐱𝐭,𝐜)wσt𝐱𝐭logpϕ(𝐜|𝐱𝐭),subscript^italic-ϵ𝜃subscript𝐱𝐭𝐜subscriptitalic-ϵ𝜃subscript𝐱𝐭𝐜𝑤subscript𝜎𝑡subscriptsubscript𝐱𝐭subscript𝑝italic-ϕconditional𝐜subscript𝐱𝐭\hat{\epsilon}_{\theta}(\mathbf{x_{t}},\mathbf{c})=\epsilon_{\theta}(\mathbf{x% _{t}},\mathbf{c})-w\sigma_{t}\nabla_{\mathbf{x_{t}}}\log p_{\phi}(\mathbf{c}|% \mathbf{x_{t}}),over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT , bold_c ) = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT , bold_c ) - italic_w italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_c | bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT ) , (12)

where the parameter w𝑤witalic_w modulates the strength of the conditioning signal. Classifier guidance is a versatile approach that increases sample quality, but it is heavily reliant on the availability of a noise-robust pre-trained classifier, which in turn relies on the availability of annotated data, which is not available in many applications.

To address this limitation, Classifier-free guidance (CFG) [67] eliminates the need for a pre-trained classifier. CFG works by training an unconditional diffusion model parametrized by ϵθ(𝐱𝐭,t,ϕ)subscriptitalic-ϵ𝜃subscript𝐱𝐭𝑡italic-ϕ\epsilon_{\theta}(\mathbf{x_{t}},t,\phi)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT , italic_t , italic_ϕ ) together with a conditional model parametrized by ϵθ(𝐱𝐭,t,𝐜)subscriptitalic-ϵ𝜃subscript𝐱𝐭𝑡𝐜\epsilon_{\theta}(\mathbf{x_{t}},t,\mathbf{c})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT , italic_t , bold_c ). For the unconditional model, a null input token ϕitalic-ϕ\phiitalic_ϕ is used as a conditioning signal 𝐜𝐜\mathbf{c}bold_c. The network is trained by randomly drop** out the conditioning signal with probability puncondsubscript𝑝uncondp_{\text{uncond}}italic_p start_POSTSUBSCRIPT uncond end_POSTSUBSCRIPT. Sampling is then performed using a weighted combination of conditional and unconditional score estimates:

ϵ~θ(𝐱𝐭,𝐜)=(1+w)ϵθ(𝐱𝐭,𝐜)wϵθ(𝐱𝐭,ϕ).subscript~italic-ϵ𝜃subscript𝐱𝐭𝐜1𝑤subscriptitalic-ϵ𝜃subscript𝐱𝐭𝐜𝑤subscriptitalic-ϵ𝜃subscript𝐱𝐭italic-ϕ\tilde{\epsilon}_{\theta}(\mathbf{x_{t}},\mathbf{c})=(1+w)\epsilon_{\theta}(% \mathbf{x_{t}},\mathbf{c})-w\epsilon_{\theta}(\mathbf{x_{t}},\phi).over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT , bold_c ) = ( 1 + italic_w ) italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT , bold_c ) - italic_w italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT , italic_ϕ ) . (13)

This sampling method does not rely on the gradients of a pre-trained classifier but still requires an annotated dataset to train the conditional denoising network. Fully unconditional approaches have yet to match classifier-free guidance, though recent works using diffusion model representations for self-supervised guidance show promise [100, 73]. These methods do not need annotated data, allowing the use of larger unlabelled datasets.

Table I shows the requirements of current guidance methods. While classifier and classifier-free guidance improve generation results, they require annotated training data. Self-guidance and online guidance are fully self-supervised alternatives that achieve competitive performance without annotations.

Classifier and classifier-free guidance are controlled generation methods that rely on conditional training. Training-free approaches modify the generation process of a pre-trained model by binding multiple diffusion processes [14] or using time-independent energy functions [179]. Other controlled generation methods take a variational perspective [54, 119, 164, 146], treating controlled generation as a source point optimization problem [17]. The goal is to find samples 𝐱𝐱\mathbf{x}bold_x that minimize a loss function (𝐱)𝐱\mathcal{L}(\mathbf{x})caligraphic_L ( bold_x ) and are likely under the model’s distribution p𝑝pitalic_p. The optimization is formulated as min𝐱𝟎(𝐱)subscriptsubscript𝐱0𝐱\min_{\mathbf{x_{0}}}\mathcal{L}(\mathbf{x})roman_min start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_x ), where 𝐱𝟎subscript𝐱0\mathbf{x_{0}}bold_x start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT is the source noise point. The loss function (𝐱)𝐱\mathcal{L}(\mathbf{x})caligraphic_L ( bold_x ) can be modified for conditional sampling to generate a sample belonging to a particular class 𝐲𝐲\mathbf{y}bold_y.

3 Methods

TABLE II: Summary of the methods using diffusion models for representation learning.
Paradigm Downstream Task Method
Generative Augmentation Classification Generative Augmentation [10]
MA-ZSC [150]
Semantic Segmentation ScribbleGen [148]
Leveraging Intermediate Activations Classification GDC [125]
DifFormer [126]
DDAE [169]
Semantic Segmentation DDPM-Seg [15]
VDM [187]
Panoptic Segmentation ODISE [170]
Semantic Correspondence DIFT [157]
SD+DINO [183]
Diffusion Hyperfeatures [116]
SD4Match [103]
USCSD [62]
Depth Estimation VDM [187]
Image Editing P2PCAC [65]
Plug-and-Play Diffusion Features [160]
Diffusion Model Reconstruction Classification SODA [82]
l-DAE [35]
DiffMAE [166]
Semantic Segmentation MDM [130]
Image Editing DiffAE [134]
PDAE [186]
Image Interpolation InfoDiffusion [165]
SmoothDiffusion [58]
Diffusion Model Knowledge Transfer Classification DiffusionClassifier [95]
RepFusion [173]
DreamTeacher [96]
Joint Diffusion Models Classification JDM [40]
HybViT [174]
Semantic Segmentation ADDP [158]

Having covered the main preliminaries for diffusion models, we outline a series of methods related to diffusion models and representation learning in the following section. In subsection 3.1 we describe and categorize current frameworks utilizing representations learned by pre-trained diffusion models for downstream recognition tasks. In subsection 3.2, we describe methods that leverage advances in representation learning to improve diffusion models themselves.

3.1 Diffusion Models for Representation Learning

Learning useful representations is one of the main motivations for designing architectures like VAEs [88, 89] and GANs [84, 22]. Contrastive learning approaches, where the goal is to learn a feature space in which representations of similar images are very close together, and vice versa for dissimilar images (e.g. SimCLR [34], MoCo [60]), have also led to significant advances in representation learning. These contrastive methods are not fully self-supervised however, since they require supervision in the form of augmentations that preserve the original content of the image.

Diffusion models offer a promising alternative to these approaches. While diffusion models are primarily designed for generation tasks, the denoising process encourages the learning of semantic image representations [15], that can be used for downstream recognition tasks. The diffusion model learning process is similar to the learning process of Denoising Autoencoders (DAE) [162, 18], which are trained to reconstruct images corrupted by adding noise. The main difference is that diffusion models additionally take the diffusion timestep t𝑡titalic_t as input, and can thus be viewed as multi-level DAEs with different noise scales [169]. Since DAEs learn meaningful representations in the compressed latent space, it is intuitive that diffusion models exhibit similar representation learning capabilities. We outline and discuss current approaches in the following section.

3.1.1 Leveraging intermediate activations

Baranchuk et al. [15] investigate the intermediate activations from the U-Net network that approximates the Markov step of the reverse diffusion process in DDPMs [42]. They show that for certain diffusion timesteps, these intermediate activations capture semantic information that can be used for downstream semantic segmentation. The authors take a noise-predictor network ϵθ(𝐱t,t)subscriptitalic-ϵ𝜃subscript𝐱𝑡𝑡\epsilon_{\theta}(\mathbf{x}_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) trained on the LSUN-Horse [177] and FFHQ-256 [84] datasets and extract feature maps produced by one of the network’s 18 decoder blocks for label-efficient downstream segmentation tasks. Selecting the ideal diffusion timestep and decoder block activation to extract is non-trivial. To understand the efficacy of pixel-level representations of different decoder blocks, the authors train a multi-layer perceptron (MLP) to predict the semantic label from features produced by different decoder blocks on a specific diffusion step t𝑡titalic_t. The representations from a fixed set of blocks B𝐵Bitalic_B of the pre-trained U-Net decoder and higher diffusion timesteps are upsampled to the image size using bilinear interpolation and concatenated. The obtained feature vectors are then used to train an ensemble of independent MLPs which predict a semantic label for each pixel. The final prediction is obtained by majority voting. This method, denoted DDPM-Seg, outperforms baselines that exploit alternative generative models and achieves segmentation results competitive with MAE [61], illustrating that intermediate denoising network activations contain semantic image features.

Xiang et al. [169] extend this approach to further architectures and image recognition on CIFAR-10 and Tiny-ImageNet. They investigate the discriminative efficacy of extracted features for different backbones (U-Net and DiT [132]) under different frameworks (DDPM and EDM [85]). The relationship between feature quality and layer-noise combinations is evaluated through grid search, where the quality of feature representations is determined using linear probing. The best-performing features lie in the middle of up-sampling using relatively small noising levels, which is in line with conclusions drawn in DDPM-Seg [15]. Benchmark comparisons against diffusion-based methods like HybViT [174] and SBGC [190] on CIFAR-10 and Tiny-ImageNet [41] show that EDM-based Denoising Diffusion Autoencoders (DDAEs) outperform previous supervised and unsupervised diffusion-based methods on both generation and recognition, especially after fine-tuning. Benchmarking against contrastive learning methods shows that the EDM-based DDAE is comparable with Sim-CLRs considering model sizes, and outperforms SimCLRs with comparable parameters on CIFAR-10 and Tiny-ImageNet.

ODISE [170] is a related approach that unites text-to-image diffusion models with discriminative models to perform panoptic segmentation [90, 91], a segmentation approach unifying instance and semantic segmentation into a common framework for comprehensive scene understanding. ODISE extracts the internal features of a pre-trained text-to-image diffusion model. These features are input to a mask generator trained on annotated masks. A mask classification module then categorizes each generated binary mask into an open vocabulary category by relating the predicted mask’s diffusion features with text embeddings of object category names. The authors use the Stable Diffusion U-Net DDPM backbone and extract features by computing a single forward pass and extracting the intermediate activations f=UNet(𝐱t,τ(s),t)𝑓UNetsubscript𝐱𝑡𝜏𝑠𝑡f=\text{UNet}(\mathbf{x}_{t},\tau(s),t)italic_f = UNet ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_τ ( italic_s ) , italic_t ) where τ(s)𝜏𝑠\tau(s)italic_τ ( italic_s ) is an encoded representation of the image caption s𝑠sitalic_s obtained leveraging a pre-trained text encoder τ𝜏\tauitalic_τ. Interestingly, the authors obtain the best results using t=0𝑡0t=0italic_t = 0, whereas previous methods obtain better results using higher diffusion timesteps. To overcome reliance on available image captions, Xu et al. [170] additionally train an MLP-based implicit captioner that computes an implicit text embedding from the image itself. ODISE establishes a new state-of-the-art in open-vocabulary segmentation and is a further example of the rich semantic representations learned by denoising diffusion models.

Mukhopadhyay et al. [125] also propose leveraging intermediate activations from the unconditional ADM U-Net architecture [42] for ImageNet classification. The methodology for layer and timestep selection is similar to previous approaches. Additionally, the impact of different sizes for feature map pooling is evaluated and several different lightweight architectures for classification (including linear, MLP, CNN, and attention-based classification heads) are used. Feature quality is found to be mostly insensitive to pooling size, and is mostly dependent on time steps and the selected block number. Their approach, which we term guided diffusion classification (GDC), achieves competitive performance against other unified models, namely BigBiGAN [44] and MAGE [99]. The attention-based classification heads perform best on ImageNet-50, but perform poorly on Fine-Grained Visual Classification datasets, indicating their reliance on a large amount of available data.

In a continuation of their previous work, Mukhopadhyay et al. [126] extend this approach by introducing two methods for more fine-grained block and denoising time step selection. The first is DifFormer [126], an attention mechanism replacing the fixed pooling and linear classification head from [125] with an attention-based feature fusion head. This fusion head is designed to replace the fixed flattening and pooling operation required to generate vector feature representations from the U-Net CNN used in the GDC approach with a learnable pooling mechanism. The second mechanism is DifFeed [126], a dynamic feedback mechanism that decouples the feature extraction process into two forward passes. In the first forward pass, only the selected decoder feature maps are stored. These are fed to an auxiliary feedback network that learns to map decoder features to a feature space suitable for adding them to the encoder blocks of corresponding blocks. In the second forward pass, the feedback features are added to the encoder features, and the DifFeed attention head is used on top of those second forward pass features. These additional improvements further increase the quality of learned representations and improve ImageNet and fine-grained visual classification performance.

The previously described diffusion representation learning methods focus on segmentation and classification, which are only a subset of downstream recognition tasks. Correspondence tasks are another subset that generally involves identifying and matching points or features between different images. The problem setting is as follows: Consider two images 𝐈1subscript𝐈1\mathbf{I}_{1}bold_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐈2subscript𝐈2\mathbf{I}_{2}bold_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and a pixel location p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in 𝐈1subscript𝐈1\mathbf{I}_{1}bold_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. A correspondence task involves finding the corresponding pixel location p2subscript𝑝2p_{2}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in 𝐈2subscript𝐈2\mathbf{I}_{2}bold_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The relationship between p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and p2subscript𝑝2p_{2}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be semantic (pixels that contain similar semantics), geometrical (pixels that contain different views of an object) or temporal (pixels that contain the same object deforming over time). DIFT (Diffusion Features) [157] is an approach leveraging pre-trained diffusion model representations for correspondence tasks. DIFT also relies on extracting diffusion model features. Similarly to previous approaches, diffusion timestep and network layer numbers used for extraction are an important consideration. The authors observe more semantically meaningful features for large diffusion timesteps and earlier network layer combinations, whereas lower-level features are captured in smaller diffusion timesteps and later denoising network layers. DIFT is shown to outperform other self-supervised and weakly-supervised methods across a range of correspondence tasks, showing on-par performance with state-of-the-art methods on semantic correspondence specifically.

Zhang et al. [183] evaluate how learned diffusion features relate across multiple images, instead of focusing on downstream tasks for single images. To investigate this, they employ Stable Diffusion features for semantic correspondence as well. The authors observe that Stable Diffusion features have a strong sense of spatial layout, but sometimes provide inaccurate semantic matches. DINOv2 [128], a method for self-supervised representation learning using knowledge distillation and vision transformers, produces more sparse features that provide more accurate matches. Zhang et al. [183] therefore propose to combine the two features and employ zero-shot evaluation of nearest neighbor search on the combined features to achieve state-of-the-art performance on several semantic correspondence datasets like SPair-71k and TSS.

SD4Match [103] builds on this approach by using various prompt tuning and conditioning techniques. One method, SD4Match-Class, fine-tunes prompt embedding ΘΘ\Thetaroman_Θ for each semantic class using a semantic matching loss [102]. Given images 𝐈tAsuperscriptsubscript𝐈𝑡𝐴\mathbf{I}_{t}^{A}bold_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and 𝐈tBsuperscriptsubscript𝐈𝑡𝐵\mathbf{I}_{t}^{B}bold_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT, the Stable Diffusion U-Net f()𝑓f(\cdot)italic_f ( ⋅ ) extracts feature maps 𝐅tAsuperscriptsubscript𝐅𝑡𝐴\mathbf{F}_{t}^{A}bold_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and 𝐅tBsuperscriptsubscript𝐅𝑡𝐵\mathbf{F}_{t}^{B}bold_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT by 𝐅t=f(𝐈t,t,𝚯)subscript𝐅𝑡𝑓subscript𝐈𝑡𝑡𝚯\mathbf{F}_{t}=f(\mathbf{I}_{t},t,\bm{\Theta})bold_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ( bold_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , bold_Θ ). Correspondence points are predicted by normalizing feature maps and computing a correlation map, which is converted to a probability distribution using a softmax operation. Additionally, Li et al. [103] propose conditioning prompts on input images using a Conditional Prompting Module (CPM), which includes a DINOv2 feature extractor, linear layers, and an adaptive MaxPooling layer. The conditioning embedding 𝚯condsubscript𝚯cond\bm{\Theta}_{\text{cond}}bold_Θ start_POSTSUBSCRIPT cond end_POSTSUBSCRIPT is formed by concatenating feature representations and projecting them to the prompt embedding dimension. The final prompt 𝚯ABsubscript𝚯AB\bm{\Theta}_{\text{AB}}bold_Θ start_POSTSUBSCRIPT AB end_POSTSUBSCRIPT is obtained by appending 𝚯condsubscript𝚯cond\bm{\Theta}_{\text{cond}}bold_Θ start_POSTSUBSCRIPT cond end_POSTSUBSCRIPT to a global prompt 𝚯globalsubscript𝚯global\bm{\Theta}_{\text{global}}bold_Θ start_POSTSUBSCRIPT global end_POSTSUBSCRIPT. This method sets new benchmark accuracies on SPair-71k [122], PF-Willow, and PF-Pascal [59], surpassing methods like DIFT [157] and SD+DINO [183]

Luo et al. [116] introduce Diffusion Hyperfeatures, a framework designed to consolidate multiple intermediate activation maps across diffusion timesteps for downstream recognition. Activations are consolidated using an interpretable aggregation network, that takes the collection of intermediate feature maps as input and produces a single feature descriptive feature map as output. While other approaches manually select fixed diffusion timesteps and activations from a pre-determined number of intermediate network layers, Diffusion Hyperparameters cache all feature maps across all layers and timesteps in the diffusion process to generate a dense set of activations. This high dimensional set of activations is upsampled, passed through a bottleneck layer B𝐵Bitalic_B and weighed with a unique learnable mixing weight wl,ssubscript𝑤𝑙𝑠w_{l,s}italic_w start_POSTSUBSCRIPT italic_l , italic_s end_POSTSUBSCRIPT for each layer and timestep combination. The final diffusion hyperfeatures take on the form

s=0Sl=1Lwl,sBl(𝐫l,s),superscriptsubscript𝑠0𝑆superscriptsubscript𝑙1𝐿subscript𝑤𝑙𝑠subscript𝐵𝑙subscript𝐫𝑙𝑠\sum_{s=0}^{S}\sum_{l=1}^{L}w_{l,s}B_{l}(\mathbf{r}_{l,s}),∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_l , italic_s end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_r start_POSTSUBSCRIPT italic_l , italic_s end_POSTSUBSCRIPT ) , (14)

where L𝐿Litalic_L is the number of layers, S𝑆Sitalic_S is a subsample of the number of diffusion timesteps and r𝑟ritalic_r is an activation feature map. Bottleneck layers and mixing weights are finetuned on the specific downstream task. Similar to previous approaches, Diffusion Hyperfeatures is used for semantic correspondence. The authors extract activations from Stable-Diffusion and tune the aggregation network on a subset of SPair-71k. Diffusion Hyperfeatures outperforms models that use self-supervised descriptors or supervised hypercolumns on the SPair-71k and CUB datasets.

Hedlin et al. [62] focus on optimizing the prompt embeddings by exploiting intermediate attention maps specifically. Given a certain input text prompt, these attention activation maps correspond to the semantics of the prompt. Instead of optimizing a global or a class-dependent prompt embedding ΘΘ\Thetaroman_Θ using the semantic loss, Hedlin et al. [62] optimize the embedding to maximize the cross-attention at the location of interest. Locating corresponding points in a second image then comes down to conditioning on the optimized prompt, and selecting the point with the pixel attaining the maximum attention map value within the target image. Note that this approach does not utilize supervised training specific to semantic correspondence. However, they require test-time optimization which is costly. Text prompts are optimized using an off-the-shelf diffusion model without fine-tuning. Several further works building on aforementioned approaches [120, 184] exist, showing that exploiting pre-trained diffusion models for semantic correspondence remains a promising application of diffusion models.

Zhao et al. [187] propose Visual Perception with a pre-trained Diffusion Model (VDM), a framework closely related to USCSD [62] that employs a text feature refinement network as well as an additional recognition encoder for semantic segmentation and depth estimation. Here, the denoising network is fed with refined text representations as well as an input image, and the resulting feature maps as well as the cross-attention maps between the text and image features are used to provide guidance for a decoder. To achieve this, the prediction model is written as pϕ(𝐲|𝐱,𝒮)subscript𝑝italic-ϕconditional𝐲𝐱𝒮p_{\phi}(\mathbf{y}|\mathbf{x},\mathcal{S})italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_y | bold_x , caligraphic_S ), where 𝒮𝒮\mathcal{S}caligraphic_S represents the set of all category labels of the downstream task. The prediction model is implemented as the following:

pϕ(𝐲|𝐱,S)=pϕ3(𝐲|)pϕ2(|𝐱,𝒞)pϕ1(𝒞|𝒮),subscript𝑝italic-ϕconditional𝐲𝐱𝑆subscript𝑝subscriptitalic-ϕ3conditional𝐲subscript𝑝subscriptitalic-ϕ2conditional𝐱𝒞subscript𝑝subscriptitalic-ϕ1conditional𝒞𝒮p_{\phi}(\mathbf{y}|\mathbf{x},S)=p_{\phi_{3}}(\mathbf{y}|\mathcal{F})p_{\phi_% {2}}(\mathcal{F}|\mathbf{x},\mathcal{C})p_{\phi_{1}}(\mathcal{C}|\mathcal{S}),italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_y | bold_x , italic_S ) = italic_p start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_y | caligraphic_F ) italic_p start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_F | bold_x , caligraphic_C ) italic_p start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_C | caligraphic_S ) , (15)

where \mathcal{F}caligraphic_F denotes the set of feature maps and 𝒞𝒞\mathcal{C}caligraphic_C denotes the text features. Here, pϕ1(𝒞|𝒮)subscript𝑝subscriptitalic-ϕ1conditional𝒞𝒮p_{\phi_{1}}(\mathcal{C}|\mathcal{S})italic_p start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_C | caligraphic_S ) denotes a text adapter consisting of a two-layer MLP that refines the text features obtained by applying the CLIP text encoder to a text template of "a photo of a [CLS]". pϕ2(|𝐱)subscript𝑝subscriptitalic-ϕ2conditional𝐱p_{\phi_{2}}(\mathcal{F}|\mathbf{x})italic_p start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_F | bold_x ) extracts the feature maps from the denoising network given the input image 𝐱𝐱\mathbf{x}bold_x and the set of refined text features 𝒞𝒞\mathcal{C}caligraphic_C. The authors use t=0𝑡0t=0italic_t = 0 when feeding the denoising network the latent representation of the input image generated by using the VQGAN encoder [47] to obtain feature maps \mathcal{F}caligraphic_F. Finally, pϕ3(𝐲|)subscript𝑝subscriptitalic-ϕ3conditional𝐲p_{\phi_{3}}(\mathbf{y}|\mathcal{F})italic_p start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_y | caligraphic_F ) serves as a light-weight prediction head implemented as a semantic feature pyramid network [90] that is adapted to the downstream task. VDM is evaluated on semantic segmentation and depth estimation, and achieves highly competitive performance and fast convergence compared to methods with other pre-training paradigms.

A more indirect application of text-to-image diffusion model representations is instructional image editing [23, 51, 98], where the desired image edit is described by a natural language instruction rather than a description of the desired new image [81]. Prompt-based image editing is challenging since small changes in the textual prompt can lead to vastly different generation outcomes. [65] propose a textual editing method for pre-trained text-conditioned diffusion models that leverages the semantic strength of the intermediate cross-attention layers in the denoising backbone. This approach is based on a key observation also employed in [62, 187]: Cross-attention maps contain rich information on the spatial layout and geometry of the generated image. Injecting the cross-attention layers obtained when generating an image \mathcal{I}caligraphic_I into the generation process of the edited image superscript\mathcal{I^{*}}caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ensures that the edited image preserves the original spatial layout. Hertz et al. [65] use Imagen [141] to conduct experiments and demonstrate promising results on text-only localized editing, global editing, and real image editing. Following works like Plug-and-play Diffusion Features [160] further improve upon this by leveraging all intermediate activation maps to enable instructional image editing. Other techniques like TokenFlow [52] and work by Yatim et al. [175] have extended this idea to the video space, using diffusion features to enable prompt-based video editing text-driven motion transfer.

3.1.2 A general representation extraction framework

Refer to caption
Figure 4: A high-level overview of a framework for extracting representations from pre-trained diffusion models for downstream tasks.

Many of the methods outlined in the previous section follow a similar procedure in leveraging learned representations of pre-trained diffusion models for downstream vision tasks. In this section, we aim to consolidate these approaches to a common three-step framework. We do this to provide clarity on the relationship between diffusion models and their use for downstream predictive tasks. To leverage intermediate activations for downstream tasks, a selection methodology that outputs the ideal diffusion timestep input as well as the intermediate layer number(s) whose activation maps have the highest predictive performance when upsampled and linearly probed must be applied. This can be a trainable model [116], a grid search procedure [169] or a learning agent [173]. The goal of this methodology is generally to select timestep tT𝑡𝑇t\in Titalic_t ∈ italic_T and a set of decoder block numbers B𝐵Bitalic_B that maximize predictive performance on a downstream task. Given a set of possible timesteps T𝑇Titalic_T and a set of decoder blocks \mathcal{B}caligraphic_B, the goal is to find:

(t,B)=argmintT,Bdiscr(t,B)superscript𝑡superscript𝐵subscriptformulae-sequence𝑡𝑇𝐵subscriptdiscr𝑡𝐵(t^{*},B^{*})=\arg\min_{t\in T,B\subseteq\mathcal{B}}\mathcal{L}_{\text{discr}% }(t,B)( italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_arg roman_min start_POSTSUBSCRIPT italic_t ∈ italic_T , italic_B ⊆ caligraphic_B end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT discr end_POSTSUBSCRIPT ( italic_t , italic_B ) (16)

where discr(t,B)subscriptdiscr𝑡𝐵\mathcal{L}_{\text{discr}}(t,B)caligraphic_L start_POSTSUBSCRIPT discr end_POSTSUBSCRIPT ( italic_t , italic_B ) represents the discriminative loss at timestep t𝑡titalic_t when the blocks in B𝐵Bitalic_B are used for downstream prediction. Generally, discriminative tasks will require more high-level features corresponding to structural elements and shapes, whereas generative tasks map** random noise to images will require the computation of lower-level features. The ideal intermediate layer number as well as the optimal diffusion timestep will largely depend on the exact downstream prediction task, the dataset, and the architecture of the diffusion model used.

Once the ideal timestep and layer number are determined, an input image and the selected diffusion timestep are passed to the diffusion model, and the intermediate activations in the selected decoder blocks computed in the forward pass are extracted and generally concatenated and pre-processed depending on the downstream task (e.g. through upsampling, pooling, etc.). Finally, a classification head is trained on the annotated dataset, taking the preprocessed features extracted from the diffusion model as input. This classification head can be an MLP, a CNN, or an attention-based network depending on the availability of labeled data and predictive performance on the dataset. The diffusion model weights are usually frozen in this probing process, but additional fine-tuning regimes can increase discriminative performance for certain datasets and architectures (see e.g., Xiang et al. [169]). Fig. 4 shows an overview of the generalized framework.

3.1.3 Knowledge transfer

Aside from leveraging intermediate activations from pre-trained diffusion models directly as inputs to a recognition network, several recent approaches propose a more indirect method of reusing learned representations for downstream tasks. We summarize these under the term knowledge transfer methods. This reflects the common idea of distilling representations from pre-trained diffusion models and then transferring them to auxiliary networks in a way that is distinct from simply providing aggregated feature activation maps as input. Several of these approaches are discussed in the following section.

Yang and Wang [173] propose RepFusion, a knowledge distillation approach that dynamically extracts intermediate representations at different time steps using a reinforcement learning framework, and uses the extracted representations as auxiliary supervision for student networks. Given an input 𝐱𝐱\mathbf{x}bold_x with label 𝐲𝐲\mathbf{y}bold_y, the authors extract a pair of features, one from the diffusion probabilistic model (DPM) and one from the student model, where 𝐳(𝐭)superscript𝐳𝐭\mathbf{z^{(t)}}bold_z start_POSTSUPERSCRIPT ( bold_t ) end_POSTSUPERSCRIPT is the diffusion model representation and 𝐳𝐳\mathbf{z}bold_z is the student model representation. The distance between the two is minimized during training using a loss function kdsubscript𝑘𝑑\mathcal{L}_{kd}caligraphic_L start_POSTSUBSCRIPT italic_k italic_d end_POSTSUBSCRIPT. After the distillation, the student network is reapplied as a feature extractor and fine-tuned on the available task labels. Previous approaches for using diffusion model representations rely on grid-search to determine which diffusion timestep to use for feature extraction. Here, the authors formulate a reinforcement learning environment where the action space is the set of all possible timesteps t𝑡titalic_t available for selection, and the reward function is the negative task loss task(𝐲,g(𝐳(t);θg))subscript𝑡𝑎𝑠𝑘𝐲𝑔superscript𝐳𝑡subscript𝜃𝑔-\mathcal{L}_{task}(\mathbf{y},g(\mathbf{z}^{(t)};\theta_{g}))- caligraphic_L start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k end_POSTSUBSCRIPT ( bold_y , italic_g ( bold_z start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) ). Given the input 𝐱𝐱\mathbf{x}bold_x, a policy network πθπ(t|𝐱)subscript𝜋subscript𝜃𝜋conditional𝑡𝐱\pi_{\theta_{\pi}}(t|\mathbf{x})italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t | bold_x ) is trained to determine which timestep t𝑡titalic_t to use for representation extraction. Once the timestep is selected, the authors use the feature representations in the mid-block of the DPM for the selected timestep tsuperscript𝑡t^{*}italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to obtain 𝐳(t)superscript𝐳superscript𝑡\mathbf{z}^{(t^{*})}bold_z start_POSTSUPERSCRIPT ( italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT. After the distillation phase, the student network is used as a feature extractor and subsequently fine-tuned on the task label 𝐲𝐲\mathbf{y}bold_y.

Li et al. [96] introduce DreamTeacher, a knowledge distillation method using a feature regressor module that distills the learned representations of a generative model G𝐺Gitalic_G into a target image recognition backbone f𝑓fitalic_f. Given a feature dataset D={𝐱i,𝐟ig}i=1N𝐷subscriptsuperscriptsubscript𝐱𝑖superscriptsubscript𝐟𝑖𝑔𝑁𝑖1D=\{\mathbf{x}_{i},\mathbf{f}_{i}^{g}\}^{N}_{i=1}italic_D = { bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT consisting of images 𝐱𝐱\mathbf{x}bold_x and extracted features 𝐟igsuperscriptsubscript𝐟𝑖𝑔\mathbf{f}_{i}^{g}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT, f𝑓fitalic_f is trained by distilling 𝐟igsuperscriptsubscript𝐟𝑖𝑔\mathbf{f}_{i}^{g}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT into the intermediate features of f(𝐱i)𝑓subscript𝐱𝑖f(\mathbf{x}_{i})italic_f ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The features are extracted from G𝐺Gitalic_G by running a forward diffusion process for T𝑇Titalic_T timesteps and conducting a single denoising step to extract 𝐟igsuperscriptsubscript𝐟𝑖𝑔\mathbf{f}_{i}^{g}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT from the intermediate layers of the U-Net backbone. The extracted features are distilled using a feature regressor module with a top-down architecture containing lateral skip connections that aligns the image backbone features with the generative features. Intermediate CNN encoder features 𝐟lesubscriptsuperscript𝐟𝑒𝑙\mathbf{f}^{e}_{l}bold_f start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT at layers l𝑙litalic_l and regressor outputs 𝐟lrsubscriptsuperscript𝐟𝑟𝑙\mathbf{f}^{r}_{l}bold_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT are used to compute an MSE feature regression loss inspired by FitNet [139]:

MSE=1Ll=1Lflr𝒲(fle)22subscriptMSE1𝐿superscriptsubscript𝑙1𝐿superscriptsubscriptnormsuperscriptsubscript𝑓𝑙𝑟𝒲superscriptsubscript𝑓𝑙𝑒22\mathcal{L}_{\text{MSE}}=\frac{1}{L}\sum_{l=1}^{L}\left\|f_{l}^{r}-\mathcal{W}% (f_{l}^{e})\right\|_{2}^{2}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT - caligraphic_W ( italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (17)

𝒲𝒲\mathcal{W}caligraphic_W is a non-learnable operator implemented as LayerNorm [11]. This loss is combined with the activation-based Attention Transfer (AT) objective [181], which distills a one dimensional ”attention map” for each spatial feature. DreamTeacher is evaluated on a range of downstream recognition tasks by fine-tuning the pre-trained backbone with additional classification heads for each task. DreamTeacher outperforms existing contrastive and masking-based self-supervised methods on the COCO [106], ADE20k [189] and BDD100K [178] benchmarks.

Both RepFusion and DreamTeacher are inspired by earlier works on knowledge distillation [66, 139]. Li et al. [95] propose a slightly different knowledge transfer approach: Diffusion Classifier, a method for zero-shot classification that leverages conditional density estimates from text-to-image diffusion models. This classifier converts the diffusion model into a classifier by computing class conditional likelihoods pθ(x|𝐜i)subscript𝑝𝜃conditional𝑥subscript𝐜𝑖p_{\theta}(x|\mathbf{c}_{i})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and using Bayes’ theorem to obtain predicted class probabilities p(𝐜i|𝐱)𝑝conditionalsubscript𝐜𝑖𝐱p(\mathbf{c}_{i}|\mathbf{x})italic_p ( bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_x ). Since direct computation of pθ(x|𝐜i)subscript𝑝𝜃conditional𝑥subscript𝐜𝑖p_{\theta}(x|\mathbf{c}_{i})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is intractable, they use the Evidence Lower Bound (ELBO) in its place. The classifier is derived by adding noise repeatedly and estimating noise reconstruction losses for each class using Monte Carlo methods. While Diffusion Classifier suffers from high inference time, it generally outperforms DDPM-Seg Baranchuk et al. [15] on most datasets and is competitive with CLIP ResNet-50 [136] and OpenCLIP ViT-H/14 [36].

3.1.4 Reconstructing diffusion models

Previous diffusion representation learning techniques do not propose making fundamental modifications to diffusion model architectures and training methodologies. While these techniques often show encouraging performance for downstream tasks, they fail to generate deep insights into the architectural components and techniques required to learn useful representations. It remains largely unclear for example whether the representation learning abilities of diffusion models are driven by the diffusion process, or by the model’s denoising capabilities. It is also unclear what architectural and optimization choices can improve diffusion models’ representation learning capabilities.

Chen et al. [35] investigate these questions by deconstructing a denoising diffusion model (DDM), modifying individual model components to turn a DDM into a Denoising Autoencoder. The deconstruction process consists of three stages. In the first stage, the DDM is reoriented for self-supervised learning. This entails the removal of class conditioning and a reconstruction of the VQGAN tokenizer [47] used in the DiT baseline. Both the perceptual and adversarial loss terms rely on annotated data and are thus removed. This essentially converts the VQGAN to a VAE. The second stage consists of simplifying the VAE tokenizer even further, replacing it with different autoencoder variants. Surprisingly, the authors find that using simpler autoencoder variants, like patch-wise PCA, does not degrade performance substantially. The authors conclude that the dimensionality per token of the latent space has a much larger impact on probing accuracy than the chosen autoencoder. The final deconstruction step includes converting the DDM to predict the denoised input instead of the added noise and removing input scaling, as well as changing the diffusion model to operate directly in the pixel space. This final stage results in what the authors call the latent Denoising Autoencoder (l-DAE). They conclude that representation learning abilities are largely driven by the denoising-driven process rather than the diffusion process.

l-DAE is inspired by the observation that diffusion models resemble hierarchical autoencoders with varying noise scales. This insight is also applied in DiffAE [134], which uses diffusion models for representation learning via autoencoding. Preechakul et al. [134] separate latent representations into a compact semantic representation and a stochastic representation. DiffAE consists of a semantic encoder, that generates a semantic representation 𝐳semsubscript𝐳sem\mathbf{z}_{\text{sem}}bold_z start_POSTSUBSCRIPT sem end_POSTSUBSCRIPT, as well as a conditional DDIM [152]. This DDIM acts both as the stochastic encoder, which maps 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, and as the decoder, which maps 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT represents the stochastic representation and captures low-level variation, whereas 𝐳semsubscript𝐳sem\mathbf{z}_{\text{sem}}bold_z start_POSTSUBSCRIPT sem end_POSTSUBSCRIPT encodes higher-level semantics. During inference, [134] fit a second latent DDIM to 𝐳semsubscript𝐳sem\mathbf{z}_{\text{sem}}bold_z start_POSTSUBSCRIPT sem end_POSTSUBSCRIPT, and sample from this DDIM and 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to facilitate unconditional sampling. Variations in 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT with fixed 𝐳semsubscript𝐳sem\mathbf{z}_{\text{sem}}bold_z start_POSTSUBSCRIPT sem end_POSTSUBSCRIPT result in minor changes in generated images, while varying 𝐳𝐳\mathbf{z}bold_z leads to different reconstructions, showing DiffAE’s efficiency in generating semantically meaningful and decodable representations. InfoDiffusion [165] extends DiffAE, supporting custom priors and improving latent representations 𝐳semsubscript𝐳sem\mathbf{z}_{\text{sem}}bold_z start_POSTSUBSCRIPT sem end_POSTSUBSCRIPT via mutual information regularization.

Zhang et al. [186] observe that there is a gap between the true and the predicted posterior mean of 𝐱t1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT when predicting from 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the diffusion reverse process. Classifier guidance can be viewed as reconstructing information lost in the diffusion forward process by shifting the posterior mean to fill that gap. They propose Pre-trained DPM AutoEncoding (PDAE), a method for adapting DPMs to decoders for image reconstruction. Instead of using a class label 𝐲𝐲\mathbf{y}bold_y to fill this information gap, PDAE employs a model to predict mean shift according to encoded representations 𝐳𝐳\mathbf{z}bold_z, ensuring that 𝐳𝐳\mathbf{z}bold_z contains as much information as possible from 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Specifically, Zhang et al. [186] employ an encoder Ephi(𝐱0)=𝐳subscript𝐸𝑝𝑖subscript𝐱0𝐳E_{phi}(\mathbf{x}_{0})=\mathbf{z}italic_E start_POSTSUBSCRIPT italic_p italic_h italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = bold_z along with a gradient estimator Gψ(𝐱t,𝐳,t)subscript𝐺𝜓subscript𝐱𝑡𝐳𝑡G_{\psi}(\mathbf{x}_{t},\mathbf{z},t)italic_G start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_z , italic_t ) that simulates 𝐱tlog(p(𝐳|𝐱t)\nabla_{\mathbf{x}_{t}}\log(p(\mathbf{z}|\mathbf{x}_{t})∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log ( italic_p ( bold_z | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to modify the conditional DPM training objective. This modified objective forces the predicted mean shift to fill the aforementioned posterior mean gap. With a trained Gψ(𝐱t,𝐳,t)subscript𝐺𝜓subscript𝐱𝑡𝐳𝑡G_{\psi}(\mathbf{x}_{t},\mathbf{z},t)italic_G start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_z , italic_t ), the score of the implicit classifier p(𝐳|𝐱t)𝑝conditional𝐳subscript𝐱𝑡p(\mathbf{z}|\mathbf{x}_{t})italic_p ( bold_z | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) can be used analogously to classifier-guided sampling. PDAE is evaluated using similar experiments as used in [134] and exhibits improved training efficiency and performance.

Pan et al. [130] propose a different method for DDM reconstruction. They introduce a masked diffusion model (MDM), designed for self-supervised semantic segmentation. MDM substitutes the conventional diffusion process with a masking mechanism inspired by the masked autoencoder [61]. The representations learned by the pre-trained MDM are extracted following Baranchuk et al. [15]. The proposed MDM is a variant of a time-dependent denoising autoencoder, that takes a masked input image and subsequently reconstructs the uncorrupted image. While other DDMs and MAE use an MSE reconstruction loss, Pan et al. [130] propose using the structural similarity index (SSIM) loss. This is done to narrow the gap between reconstruction and subsequent segmentation tasks. MDM is pre-trained on a set of unlabeled images using the described self-supervised approach. The learned representations are then extracted to train an MLP-based classification head on a smaller labeled dataset. Features based on specific block setting \mathcal{B}caligraphic_B are extracted by selecting the activation maps from each of the specified blocks, upsampling activation maps to match the image size, and concatenating the activations. The method achieves state-of-the-art results against existing supervised segmentation methods on multiple benchmark datasets even when only 10% of labels are available. DiffMAE [166] is a similar approach that uses a conditional generative objective, where the distribution of the masked pixels 𝐱0msuperscriptsubscript𝐱0𝑚\mathbf{x}_{0}^{m}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT conditioned on the visible pixels 𝐱0vsuperscriptsubscript𝐱0𝑣\mathbf{x}_{0}^{v}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT is modeled, and diffusion is only applied to masked regions.

Hudson et al. [82] introduce a novel view generation learning goal as well as a bottleneck layer to aid representation learning. They present SODA, a self-supervised diffusion model that consists of an encoder and a denoising decoder. The encoder produces a concise latent representation, which is used for denoising decoder guidance by modulation of the decoder activations. The encoder E(𝐱)𝐸𝐱E(\mathbf{x})italic_E ( bold_x ) converts an input view 𝐱𝐱\mathbf{x}bold_x into a compressed latent representation 𝐳𝐳\mathbf{z}bold_z, which is used to generate a novel output view 𝐱superscript𝐱\mathbf{x}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT relating to the input 𝐱𝐱\mathbf{x}bold_x. 𝐱superscript𝐱\mathbf{x}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is created through a diffusion process conditioned on the latent representation 𝐳𝐳\mathbf{z}bold_z via feature modulation. In addition to this, the authors use layer modulation, where the latent representation is partitioned, with each partition 𝐳isubscript𝐳𝑖\mathbf{z}_{i}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT modulating a specific pair of layer activations. This enables further specialization among the latent subvectors, where some are optimized to capture finer levels of granularity than others. During training, Hudson et al. [82] opt to randomly zero out a subset of the latent subvectors, effectively implementing a layer-wise generalization of classifier-free guidance. This further increases control over the generative process since the trained model can then be conditioned using a curated subset of latent subvectors.

SmoothDiffusion [58] is a work focusing on improving the smoothness of the latent space of diffusion models, which refers to the consistency of perturbations in the latent and the image space. SmoothDiffusion enforces smoothness over its latent space by proposing a novel step-wise variation regularization method in training. The resulting smoothed latents benefit a wide range of image interpolation, image inversion and image editing tasks.

3.1.5 Joint diffusion models

Many current diffusion-based representation learning methods focus on using the diffusion model’s latent variables to benefit the training of a separate recognition network. These frameworks are conceptually equivalent to constructing hybrid models that solely concentrate on synthesis in the pre-training stage, and on downstream recognition in the post-training/fine-tuning phase. The recognition head and the diffusion denoising network do not share a parametrization, and the recognition head is often trained separately while kee** the weights of the denoising network frozen. A natural question that arises is whether this separation is necessary and whether approaches that optimize a generative and a discriminative objective simultaneously in a shared parametrization can improve representation learning.

HybViT [174] is an approach that establishes a direct connection between diffusion models and vision transformers by training a single hybrid model for both image classification and image generation. This hybrid model uses a shared parametrization for image classification and reconstruction. The authors use a ViT backbone to train a model with a combined loss \mathcal{L}caligraphic_L consisting of a standard cross-entropy loss to train p(y|𝐱p(y|\mathbf{x}italic_p ( italic_y | bold_x) and the simple denoising loss to train p(x)𝑝𝑥p(x)italic_p ( italic_x ). HybViT provides stable training and outperforms previous hybrid models on both generative and discriminative tasks, but lags behind generative-only models in generation quality. HybViT also requires more training iterations to achieve high classification performance, and the sampling speed during inference is slow.

Joint Diffusion Models (JDM) [40] is a related work that produces meaningful representations across generative and discriminative tasks. Using a U-Net backbone, JDM consists of an encoder eνsubscript𝑒𝜈e_{\nu}italic_e start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT, a decoder dψsubscript𝑑𝜓d_{\psi}italic_d start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, and a classifier gωsubscript𝑔𝜔g_{\omega}italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT. The encoder maps an input 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to feature vectors 𝐙𝐭=eν(𝐱t)subscript𝐙𝐭subscript𝑒𝜈subscript𝐱𝑡\mathbf{Z_{t}}=e_{\nu}(\mathbf{x}_{t})bold_Z start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT = italic_e start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The decoder reconstructs these into a denoised sample 𝐱t1=dψ(𝐙t)subscript𝐱𝑡1subscript𝑑𝜓subscript𝐙𝑡\mathbf{x}_{t-1}=d_{\psi}(\mathbf{Z}_{t})bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and the classifier predicts the target class y^=gω(𝐙t)^𝑦subscript𝑔𝜔subscript𝐙𝑡\hat{y}=g_{\omega}(\mathbf{Z}_{t})over^ start_ARG italic_y end_ARG = italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The combined training objective includes cross-entropy loss Lclasssubscript𝐿classL_{\text{class}}italic_L start_POSTSUBSCRIPT class end_POSTSUBSCRIPT and the noise prediction network’s simplified objective Lt,diff(ν,ψ)subscript𝐿𝑡diff𝜈𝜓L_{t,\text{diff}}(\nu,\psi)italic_L start_POSTSUBSCRIPT italic_t , diff end_POSTSUBSCRIPT ( italic_ν , italic_ψ ), resulting in the following loss:

L(ν,ψ,ω)=Lclass(ν,ω)L0(ν,ψ)t=2TLt,diff(ν,ψ)LT(ν,ψ).𝐿𝜈𝜓𝜔subscript𝐿class𝜈𝜔subscript𝐿0𝜈𝜓superscriptsubscript𝑡2𝑇subscript𝐿𝑡diff𝜈𝜓subscript𝐿𝑇𝜈𝜓L(\nu,\psi,\omega)=L_{\text{class}}(\nu,\omega)-L_{0}(\nu,\psi)-\sum_{t=2}^{T}% L_{t,\text{diff}}(\nu,\psi)-L_{T}(\nu,\psi).italic_L ( italic_ν , italic_ψ , italic_ω ) = italic_L start_POSTSUBSCRIPT class end_POSTSUBSCRIPT ( italic_ν , italic_ω ) - italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ν , italic_ψ ) - ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_t , diff end_POSTSUBSCRIPT ( italic_ν , italic_ψ ) - italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_ν , italic_ψ ) .

JDM also enables a simplification of classifier guidance. By applying the classifier to noisy images 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the classifier is effectively augmented to be robust to noise. To guide the generated sample towards a target label, representations 𝐙tsubscript𝐙𝑡\mathbf{Z}_{t}bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are optimized according to the classifier gradient, giving 𝐙t=𝐙tα𝐳tloggω(𝐲|𝐙t)superscriptsubscript𝐙𝑡subscript𝐙𝑡𝛼subscriptsubscript𝐳𝑡subscript𝑔𝜔conditional𝐲subscript𝐙𝑡\mathbf{Z}_{t}^{\prime}=\mathbf{Z}_{t}-\alpha\nabla_{\mathbf{z}_{t}}\log g_{% \omega}(\mathbf{y}|\mathbf{Z}_{t})bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_α ∇ start_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( bold_y | bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). JDM achieves state-of-the-art performance for joint models on CIFAR and CelebA datasets, outperforming HybViT.

Tian et al. [158] propose the Alternating Denoising Diffusion Process (ADDP). ADDP alternately denoises pixels and VQ tokens. Given an image 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, a pre-trained VQ Encoder [26] maps time image to VQ tokens 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The alternating diffusion process masks regions of 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with a Markov chain according to diffusion timestep t𝑡titalic_t, producing 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Unreliable tokens 𝐳¯tsubscript¯𝐳𝑡\bar{\mathbf{z}}_{t}over¯ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are generated by a token predictor and fed into a VQ Decoder to synthesize 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, replacing the masked regions of 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. A pixel-to-token generation network is then trained to approximate the distribution of 𝐳¯t1subscript¯𝐳𝑡1\bar{\mathbf{z}}_{t-1}over¯ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. During sampling, ADDP starts with a representation of pure unreliable tokens 𝐳¯Tsubscript¯𝐳𝑇\bar{\mathbf{z}}_{T}over¯ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and iteratively denoises the token sequence by predicting 𝐳¯t1subscript¯𝐳𝑡1\bar{\mathbf{z}}_{t-1}over¯ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. For recognition, the representations learned by the pixel-to-token generation network can be forwarded to different task-specific recognition heads. ADDP with the VQGAN tokenizer [47] MAGE-Large [99] token predictor and ViT-Large [45] pixel-to-token encoder, outperforms previous unified models in image classification, object detection, semantic segmentation, and unconditional generation.

3.1.6 Generative augmentation

A lot of state-of-the-art representation learning methods [60, 55, 33] rely on a fixed set of data augmentations to define positive labels for learning representations. This approach encourages encoders to learn to map the original and the augmented image to similar embedding space representations [10]. These augmentations should not alter the semantics of the image, and they should not render the image unrealistic in a real-world setting. A set of standard transformations might not adequately capture the distribution of real-world data, raising the question of how to design transformations that create diverse images and improve the generalization of learned representations.

Ayromlou et al. [10] propose using latent diffusion models [138] to generate novel views of the original image that preserve the semantic content, while closely following the distribution of real images. This augmentation method is denoted by:

T0(𝐱)={G(𝐳;ϕ(𝐱))if pp0𝐱otherwise,subscript𝑇0𝐱cases𝐺𝐳italic-ϕ𝐱if 𝑝subscript𝑝0𝐱otherwiseT_{0}(\mathbf{x})=\begin{cases}G(\mathbf{z};\phi(\mathbf{x}))&\text{if }p\leq p% _{0}\\ \mathbf{x}&\text{otherwise},\end{cases}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x ) = { start_ROW start_CELL italic_G ( bold_z ; italic_ϕ ( bold_x ) ) end_CELL start_CELL if italic_p ≤ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_x end_CELL start_CELL otherwise , end_CELL end_ROW (18)

where G𝐺Gitalic_G denotes a conditional generative model taking noise vector 𝐳𝒩(0,𝐈)similar-to𝐳𝒩0𝐈\mathbf{z}\sim\mathcal{N}(0,\mathbf{I})bold_z ∼ caligraphic_N ( 0 , bold_I ) and condition vector ϕ(𝐱)italic-ϕ𝐱\phi(\mathbf{x})italic_ϕ ( bold_x ) as inputs. ϕitalic-ϕ\phiitalic_ϕ is a pre-trained image encoder such as CLIP [136], p[0,1]𝑝01p\in[0,1]italic_p ∈ [ 0 , 1 ] is a random number and p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a hyperparameter specifying the probability of applying the augmentation. Ayromlou et al. [10] show that using generative augmentation leads to consistent improvements in learned representations over standard transformations across other representation learning techniques.

Shipard et al. [150] take this approach one step further, using Stable Diffusion to generate a fully synthetic dataset to improve model-agnostic zero-shot classification (MA-ZSC). They use Stable Diffusion, employing several variations of prompts designed to increase the diversity of the synthetic dataset. An image classifier is subsequently trained on this synthetic dataset, and zero-shot classification results on CIFAR10, CIFAR100, and EuroSAT [64] are evaluated. Shipard et al. [150] observe substantial classification architecture-agnostic improvements on the aforementioned datasets, achieving comparable performance to state-of-the-art zero-shot classification methods like CLIP.

Moving beyond classification, Schnell et al. [148] apply similar ideas to scribble-supervised segmentation [104, 129], a weakly-supervised form of semantic segmentation that uses sparse annotations in the form of scribbles drawn over the images. They introduce ScribbleGen, a diffusion model conditioned on semantic scribbles that generates synthetic training images for data augmentation. ScribbleGen utilizes a ControlNet [185] denoising diffusion model for noise prediction given 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and conditioning signal 𝐜𝐜\mathbf{c}bold_c. The number of classes is denoted by different color scribbles in RGB images, and the conditioning signal c𝑐citalic_c is supplemented by a text prompt stating all classes in the image. Schnell et al. [148] trade-off photorealism and image diversity by introducing an encode ratio λ[0,1]𝜆01\lambda\in[0,1]italic_λ ∈ [ 0 , 1 ]. This diffusion parameter controls the number of noise-adding forward diffusion steps, where λ=1𝜆1\lambda=1italic_λ = 1 leads to no change but λ<1𝜆1\lambda<1italic_λ < 1 leads to λT𝜆𝑇\lambda\cdot Titalic_λ ⋅ italic_T steps, meaning less noise is added to the input image. The authors evaluate both a fixed and an adaptive λ𝜆\lambdaitalic_λ, where the encoding ratio is gradually increased to provide increasingly diverse synthetic images during training. ScribbleGen achieves state-of-the-art performance on the PASCAL VOC12 segmentation dataset [48] using scribbles from Scribblesup [104].

DiffuMask [167] is another generative augmentation method designed to improve downstream semantic segmentation tasks. The idea here is to exploit cross-attention maps between text prompts and generated images to extend image synthesis to semantic mask generation. Synthetically generated masks are used for data augmentation to improve downstream segmentation performance. Individual token attention maps of all layers are averaged and converted to binary masks using an adaptive threshold mechanism based on an AffinityNet [4]. Additionally, a noise-learning module prunes low-quality segmentation masks, and the authors employ several prompt engineering and static image transformations to further enhance the diversity of the generated images and corresponding segmentation masks.

3.2 Representation Learning for Diffusion Model Guidance

Refer to caption
Figure 5: A hierarchical overview of current diffusion model training frameworks that leverage representation learning techniques for conditional generation and guidance.

Despite the remarkable performance of generative models, there exists a gap in quality between conditional and unconditional image generation approaches [25]. This is especially the case for GANs [53], which suffer from mode collapse when trained in a fully unsupervised setting [110]. Unconditional GANs often fail to accurately model multi-modal distributions, e.g. not being able to generate all digits for MNIST [110]. Class-conditional GANs [22] [123] mitigate this issue, but require labeled data. Recent approaches like self-conditioned GANs [110] and instance-conditioned GANs [25] attempt to train conditional GANs without requiring labeled data, and are able to achieve competitive generation results.

Diffusion models have since surpassed the image generation capabilities of GANs [42], but suffer from a similar performance discrepancy between conditional and fully self-supervised approaches. Current state-of-the-art diffusion models are conditional models that rely on guidance approaches that also require annotated data. Self-supervised guidance approaches can leverage much larger unlabeled datasets for pre-training, and thus have the potential to transcend current image generation approaches. One intuitive approach for leveraging representation learning to facilitate these guidance methods is to explore methods that assign labels to unlabeled data, e.g. through clustering and classification approaches. We introduce several approaches in the following section. Fig. 5 shows a proposed taxonomy of representation learning techniques for diffusion guidance.

3.2.1 Assignment-based guidance

Sheynin et al. [149] propose kNN-Diffusion, an efficient text-to-image diffusion model trained without large-scale image text pairings. To facilitate text-guided image generation without paired text-image data, a shared text-image encoder map** text-image pairs into the same latent space is required. The authors use CLIP to achieve this, a pre-trained encoder trained using contrastive loss on a large-scale text-image pair dataset. kNN-Diffusion leverages k𝑘kitalic_k-Nearest-Neighbors search to generate k𝑘kitalic_k embeddings from a retrieval model. The retrieval model uses the input image representation during training, and the text prompt representation curing inference. This approach eliminates the need for annotated data but still requires a pre-trained encoder like CLIP, which in turn requires a large-scale dataset of text-image embeddings for pre-training.

Blattmann et al. [20] propose retrieval-augmented diffusion models (RDM), which equip diffusion models with an image database for composing new scenes based on retrieved images. Inspired by advances in retrieval-augmented NLP [21, 168], RDM enhances performance with fewer parameters and computational resources. Despite being trained only on images, RDM allows conditional synthesis due to the shared image-text feature space of CLIP [136]. RDM includes a trainable conditional latent diffusion model pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, an external image database 𝒟𝒟\mathcal{D}caligraphic_D, and a fixed sampling strategy ξksubscript𝜉𝑘\xi_{k}italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that selects a subset 𝒟(k)superscriptsubscript𝒟𝑘\mathcal{M}_{\mathcal{D}}^{(k)}caligraphic_M start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT of 𝒟𝒟\mathcal{D}caligraphic_D based on a query 𝐱𝐱\mathbf{x}bold_x. One strategy ξk(𝐱,𝒟)subscript𝜉𝑘𝐱𝒟\xi_{k}(\mathbf{x},\mathcal{D})italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x , caligraphic_D ) is to retrieve the k𝑘kitalic_k nearest neighbors using a distance function d(,𝐱)𝑑𝐱d(\cdot,\mathbf{x})italic_d ( ⋅ , bold_x ). The retrieved data is processed through a frozen image encoder ϕitalic-ϕ\phiitalic_ϕ and used to condition pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. During training, ξksubscript𝜉𝑘\xi_{k}italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT retrieves k𝑘kitalic_k nearest neighbors for a query image 𝐱𝐱\mathbf{x}bold_x using cosine similarity in CLIP’s image feature space as the distance function d(𝐱,𝐲)𝑑𝐱𝐲d(\mathbf{x},\mathbf{y})italic_d ( bold_x , bold_y ). This approach ensures that retrieved image representations are useful for generation tasks and allows for text conditioning due to CLIP’s shared feature space. The dataset 𝒟𝒟\mathcal{D}caligraphic_D and retrieval strategy ξksubscript𝜉𝑘\xi_{k}italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT can be changed at test time, adding flexibility for different conditioning modalities and adaptability to other data distributions.

Hu et al. [75] propose a method also motivated by eliminating the need for annotated data. Self-guided diffusion is a framework encompassing a feature extraction function gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and a self-annotation function fψsubscript𝑓𝜓f_{\psi}italic_f start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT. The feature extraction function is a self-supervised feature extractor that maps the input data 𝐱𝒟𝐱𝒟\mathbf{x}\in\mathcal{D}bold_x ∈ caligraphic_D to a feature space \mathcal{H}caligraphic_H, where 𝒟𝒟\mathcal{D}caligraphic_D denotes the dataset. This feature representation is an input of fψsubscript𝑓𝜓f_{\psi}italic_f start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, which maps feature representation gϕ(𝐱;𝒟)subscript𝑔italic-ϕ𝐱𝒟g_{\phi}(\mathbf{x};\mathcal{D})italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x ; caligraphic_D ) to a guidance signal k𝑘kitalic_k. This framework can be applied to achieve self-labeled guidance, where k𝑘kitalic_k is a one-hot embedding derived using k𝑘kitalic_k-means clustering as the self-annotation function f𝑓fitalic_f on compacted features generated by gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT. More fine-grained spatial guidance is achieved by self-boxed guidance, which uses a map** from feature space \mathcal{H}caligraphic_H to a bounding box as the self-annotation function f𝑓fitalic_f, as well as self-segmented guidance, which uses a map** to a segmentation mask to generate guidance signals by clustering. Self-guidance significantly outperforms unconditional diffusion models, and even outperforms classifier-free guided diffusion models that use ground-truth annotations on image generation. This suggests that the clusters are potentially more aligned with the visual similarity of the images, and are better guidance signals than ground-truth labels alone. While this approach is self-supervised, it still relies on an external pre-trained feature extractor to generate feature representations for clustering.

For this reason, Hu et al. [73] extend their work to propose an online feature clustering method using the Sinkhorn-Knopp algorithm. This is challenging since the idea requires obtaining conditioning signals for clustering during training from a diffusion model that is dependent on this conditioning. This issue is solved by introducing a zero vector into the conditional diffusion model for the signals used to identify the clustering. For each image example, the conditional diffusion model conditioned on this zero vector undergoes a fully-connected feature prediction head used to compute features that are mapped to a set of learnable prototypes denoted M𝑀Mitalic_M. This method uses a combination of the diffusion training loss and a Sinkhorn-Knopp loss to achieve guidance signals 𝐜𝐜\mathbf{c}bold_c that are based on clustering features using M𝑀Mitalic_M. The promise of this method is high, with self-guided diffusion outperforming related unconditional generation baseline comparisons on ImageNet256 and LSUN-Churches while being competitive with class guidance methods that rely on ground truth labels. The online approach specifically does not rely on ground truth labels or any external pre-trained models. Adaloglou et al. [2] build on the aforementioned cluster-based guidance approaches by utilizing EDM [85], TEMI clustering [1] and a method for deriving an upper cluster bound for feature-based clustering.

Other approaches to diffusion model guidance rely on generating pseudo-labels for unlabeled data. You et al. [176] propose dual pseudo training (DPT), which uses a classifier trained on limited labeled data to generate pseudo-labels. These are then used to condition a diffusion model to generate pseudo images, which are in turn used as data augmentation to retrain a classifier on a mix of pseudo and real images. DPT involves three stages. First, a semi-supervised classifier is trained on partially labeled data to predict pseudo-labels 𝐲^^𝐲\hat{\mathbf{y}}over^ start_ARG bold_y end_ARG for all images 𝐱𝒳𝐱𝒳\mathbf{x}\in\mathcal{X}bold_x ∈ caligraphic_X. Second, a conditional generative model is trained on the dataset S1={(𝐱,𝐲)|𝐱𝒳}subscript𝑆1conditional-set𝐱𝐲𝐱𝒳S_{1}=\{(\mathbf{x},\mathbf{y})|\mathbf{x}\in\mathcal{X}\}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { ( bold_x , bold_y ) | bold_x ∈ caligraphic_X } with pseudo-labels. Finally, the classifier is retrained on real data that is augmented by the generated data. DPT achieves highly competitive performance on ImageNet classification and generation with as little as five labels per class, outperforming several supervised diffusion model benchmarks like ADM [42] and LDM [138].

3.2.2 A generalized framework for assignment-based guidance

Refer to caption
Figure 6: A generalization of assignment-based guidance training and sampling pipelines. Samples are conditioned on annotations generated by a self-annotation function f𝑓fitalic_f, using features extracted by a pre-trained image encoder (e.g., CLIP [136]).

Assignment-based guidance approaches all rely on assigning annotation to inputs during training, which enables controlled generation during inference when conditioning on this annotation. We therefore propose to formulate a generalized framework that encapsulates all assignment-based guidance approaches discussed here. This framework consists of three main components. The first is a self-supervised image encoder (𝐱)𝐱\mathcal{E}(\mathbf{x})caligraphic_E ( bold_x ), that maps inputs to a low-dimensional feature representation 𝐳𝐳\mathbf{z}bold_z. Using a multi-modal feature extractor like CLIP has the advantage of enabling text-based as well as image-based conditioning, but other feature extractors can be used, provided they generate semantically meaningful image representations.

The second is a self-annotation function f(𝐳)𝑓𝐳f(\mathbf{z})italic_f ( bold_z ), which uses the image representation to produce annotation 𝐜𝐜\mathbf{c}bold_c for input image 𝐱𝐱\mathbf{x}bold_x. In the simplest case, this self-annotation function is an external pre-trained image classifier that generates pseudo-class labels from image representations, similar to the approach employed in DPT [176], where the external classifier is subsequently re-trained on the conditionally generated images. In other cases, the self-annotation function is a retrieval model, which uses a distance function d𝑑ditalic_d to retrieve images similar to the training image, and uses representations of the retrieved images for generating the guidance signal 𝐜𝐜\mathbf{c}bold_c.

The final component is a denoising network 𝒟θ(𝐱t,𝐜,t)subscript𝒟𝜃subscript𝐱𝑡𝐜𝑡\mathcal{D}_{\theta}(\mathbf{x}_{t},\mathbf{c},t)caligraphic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c , italic_t ), which takes the noisy image 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the diffusion timestep t𝑡titalic_t and the guidance signal 𝐜𝐜\mathbf{c}bold_c as input, and denoises the image. During inference, controlled generation is enabled by passing an initial guidance signal 𝐤𝐤\mathbf{k}bold_k (which can be multi-modal as long as the embedding space of the encoder \mathcal{E}caligraphic_E is shared between modalities) through the encoder to generate representation 𝐳=(𝐤)𝐳𝐤\mathbf{z}=\mathcal{E}(\mathbf{k})bold_z = caligraphic_E ( bold_k ). The conditioning signal 𝐜𝐜\mathbf{c}bold_c is then generated by passing 𝐳𝐳\mathbf{z}bold_z to the self-annotation function f𝑓fitalic_f where 𝐜=f(𝐳)𝐜𝑓𝐳\mathbf{c}=f(\mathbf{z})bold_c = italic_f ( bold_z ). Passing 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, 𝐜𝐜\mathbf{c}bold_c and t𝑡titalic_t to the denoising network 𝒟θsubscript𝒟𝜃\mathcal{D}_{\theta}caligraphic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT now enables synthesis of novel images semantically similar to the initial guidance signal 𝐤𝐤\mathbf{k}bold_k.

One of the main motivations behind the design of assignment-based guidance methods is the reliance on existing methods on labeled data. While it could be argued that the aforementioned assignment-based guidance approaches are indirectly reliant on annotated data through the pre-trained image encoder, it is important to note that this encoder can be replaced with a fully self-supervised encoder as well. CLIP relies on the availability of a large-scale dataset of image-caption pairs and is thus not fully self-supervised, but other representation learning methods are also able to generate semantic representations. CLIP is used in many approaches to facilitate both text prompt-based and image conditioning during inference, which may no longer be possible when using primarily image-based feature extractors. A summary of the training and inference methodology can be found in Fig. 6.

3.2.3 Representation-based guidance

Li et al. [100] present Representation-Conditioned Image Generation (RCG), a framework conditioning diffusion models on a self-supervised representation distribution mapped from the image distribution using a pre-trained encoder. The idea is to train a Representation Diffusion Model (RDM) on the representations generated by a pre-trained encoder to generate low-dimensional image representations. After this, a pixel generator conditioned on the representation is trained to map noise distributions to image distributions. RCG consists of three main components. The first is a pre-trained image encoder, which converts the original image distribution into a representation distribution. The authors propose using self-supervised contrastive learning methods (e.g. MoCo v3) for generating this representation distribution. The second is a representation generator in the form of an RDM, which learns to generate representations from Gaussian noise following the DDIM [152] sampling process. The final component is a pixel generator that crafts image pixels conditioned on image representations. RCG can easily incorporate classifier-free guidance for unconditional generation tasks, since the pixel generator is conditioned on self-supervised representations. RCG emerges as a highly promising method for bridging the gap between conditional and unconditional image generation, outperforming pre-existing unconditional generation approaches on ImageNet, and exhibiting competitive performance with current state-of-the-art class-conditional approaches.

Readout Guidance (RG) [117] makes use of auxiliary readout heads trained on top of a frozen diffusion model to extract properties of the generated image that can be used for guidance. These properties can include human pose, depth maps, edges, and even higher-order properties like similarity to another image. During sampling, the properties extracted by the readout heads can be compared to user-defined control targets, and used in a methodology similar to classifier guidance [43] to guide generation.

Lin and Yang [105] identified a novel self-perceptual objective that enhances diffusion models, enabling them to generate more realistic samples. Contrary to the conventional approach of training or employing an image encoder, the authors demonstrate that a pre-trained diffusion model inherently functions as a perceptual network and can be used to generate perceptual representations. The perceptual loss facilitates the model’s ability to generate more realistic images even with unconditional synthesis.

Also inspired by the downsides of classifier guidance and classifier-free guidance, Hong et al. [69] introduce Self-Attention Guidance (SAG). SAG adversarially blurs regions that contain salient information by leveraging intermediate self-attention activation maps, using the residual information as guidance. This increases the generation quality without requiring external information or additional training. The self-attention mechanism, contained in both U-Net and DiT diffusion backbones, allows the noise predictor to attend to the most informative features of the input. The self-attention maps 𝐀tSRN×(HW)×(HW)superscriptsubscript𝐀𝑡𝑆superscriptR𝑁𝐻𝑊𝐻𝑊\mathbf{A}_{t}^{S}\in\mathrm{R}^{N\times(HW)\times(HW)}bold_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ∈ roman_R start_POSTSUPERSCRIPT italic_N × ( italic_H italic_W ) × ( italic_H italic_W ) end_POSTSUPERSCRIPT are aggregated and reshaped to dimension RH×WsuperscriptR𝐻𝑊\mathrm{R}^{H\times W}roman_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT using global average pooling and nearest-neighbor upsampling to match the resolution of 𝐱𝐭subscript𝐱𝐭\mathbf{x_{t}}bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT. The difference between the blurred image 𝐱~𝐭subscript~𝐱𝐭\mathbf{\tilde{\mathbf{x}}_{t}}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT and 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is used as conditioning, thereby retaining the information masked in this process.

3.2.4 Objective-based guidance

Many of the previous outlined approaches focus on eliminating the need for pre-trained classifiers, encoders and dataset annotations for training conditional diffusion models. Other recent works [86, 46] have demonstrated that internal diffusion model representations can be used to improve generation control over the structural and semantic composition of generated images.

One such approach is Self-guidance for Controllable Image Generation [46] (which we denote SGCIG to distinguish it from [75]). SGCIG is a zero-shot method designed to increase user control over structural and semantic elements of objects in images generated by text-to-image diffusion models. Incorporating similar ideas as [65], the authors of SGCIG leverage representations from intermediate activations and attention maps to steer the generation process. SGCIG works by adding a series of guidance terms to the objective of the denoising network that each define a series of properties that can be used to perform image manipulations. Image edits can then be carried out by guiding properties to change in the pixel generation process. While the method is limited to the manipulation of objects explicitly stated in the conditioning text prompt, it represents a promising first step towards increased control over generated images. Diffusion Handles [131] extend this to 3D object editing, using manipulated diffusion model activations to produce plausible edits.

Depth-aware guidance (DAG) [86] is a related method that uses semantic information from intermediate denoising network layers for improved depth-aware image synthesis. Kim et al. [86] propose training depth predictors with limited depth-labeled data using internal U-Net backbone representations, similar to DDPM-Seg [15]. The used depth predictors are pixel-wise shallow MLP regressors estimating depth values from intermediate U-Net features 𝐟tsubscript𝐟𝑡\mathbf{f}_{t}bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at timestep t𝑡titalic_t. Features are concatenated across layers to form 𝐠tsubscript𝐠𝑡\mathbf{g}_{t}bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, with depth maps 𝐝t=MLP(𝐠t,t)subscript𝐝𝑡MLPsubscript𝐠𝑡𝑡\mathbf{d}_{t}=\text{MLP}(\mathbf{g}_{t},t)bold_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = MLP ( bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) generated using an appended time-embedding block. This depth predictor is trained using a limited depth-labeled dataset. To now guide the diffusion process toward depth-aware generation, two guidance strategies are introduced: Depth consistency guidance uses pseudo-labels with a consistency loss dcsubscriptdc\mathcal{L}_{\text{dc}}caligraphic_L start_POSTSUBSCRIPT dc end_POSTSUBSCRIPT between weak and strong depth predictors, guiding the generation process using the gradient of dcsubscriptdc\mathcal{L}_{\text{dc}}caligraphic_L start_POSTSUBSCRIPT dc end_POSTSUBSCRIPT with respect to 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in a methodology similar to [42]. Depth prior guidance employs an additional small-resolution diffusion U-Net on the depth domain, adding noise to depth predictions and using a denoising objective dpsubscriptdp\mathcal{L}_{\text{dp}}caligraphic_L start_POSTSUBSCRIPT dp end_POSTSUBSCRIPT. The gradient of dpsubscriptdp\mathcal{L}_{\text{dp}}caligraphic_L start_POSTSUBSCRIPT dp end_POSTSUBSCRIPT is treated like an external classifier gradient and added to the image generation objective. Combining both methods during training results in enhanced depth semantics in generated images.

Perturbed Attention Guidance (PAG) [3] is a sampling guidance method that improves generation quality for both conditional and unconditional settings. PAG does not require additional training or external pre-trained models. Instead, Ahn et al. [3] introduce an implicit discriminator 𝒟𝒟\mathcal{D}caligraphic_D that differentiates between desirable and undesirable samples during the diffusion process, where 𝐲𝐲\mathbf{y}bold_y is a desirable and 𝐲^^𝐲\hat{\mathbf{y}}over^ start_ARG bold_y end_ARG is an undesirable sample. The diffusion sampling process is then redefined to incorporate the derivative of the discriminator loss 𝒟subscript𝒟\mathcal{L}_{\mathcal{D}}caligraphic_L start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT. The score with undesirable label 𝐲^^𝐲\hat{\mathbf{y}}over^ start_ARG bold_y end_ARG cannot be approximated using the existing denoising network ϵθ(𝐱t)subscriptitalic-ϵ𝜃subscript𝐱𝑡\epsilon_{\theta}(\mathbf{x}_{t})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Thus the score is estimated by perturbing the forward pass of a pre-trained denoising network, denoted by ϵ^θsubscript^italic-ϵ𝜃\hat{\epsilon}_{\theta}over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. PAG works by perturbing the self-attention maps in the diffusion U-Net, replacing them with an identity matrix to guide the sampling process away from degraded samples. The final noise prediction is obtained by feeding 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into both ϵθ()subscriptitalic-ϵ𝜃\epsilon_{\theta}(\cdot)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) and ϵ^θ()subscript^italic-ϵ𝜃\hat{\epsilon}_{\theta}(\cdot)over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) to get the final noise prediction ϵ~θsubscript~italic-ϵ𝜃\tilde{\epsilon}_{\theta}over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. PAG improves generation quality in both conditional and unconditional settings, and can be combined with existing guidance methods like classifier guidance.

4 Challenges & Future Directions

4.1 General Challenges

Diffusion model-based representation learning is a novel research field with a lot of potential for theoretical and practical improvements. Improving synergies between representation learning and generative models is akin to a chicken-and-egg problem, where better diffusion models simultaneously lead to higher quality image representations, and better representation learning methods improve generative quality of diffusion models when applied to self-supervised guidance methods. Improved online-bootstrap** methods that provide guidance to diffusion models during training can be beneficial here.

To conserve computation in diffusion models [114], the sampling process has been significantly reduced to just a few steps [143, 145] or even a single step [155, 63, 118]. However, maintaining the potential of representation learning with few sampling steps presents a challenge.

4.2 Potential Future Directions

In many works discussed, the quality of representations learned by diffusion models is evaluated indirectly using task-specific metrics from auxiliary models. Interpretability and disentanglement are other important ways to evaluate representation efficacy and are currently underexplored. Methods enhancing the interpretability of the latent space can improve generation control and benefit a wide range of recognition tasks. We look towards methods on interpretable direction discovery as have been proposed for GANs in [163] for inspiration, and see similar approaches for diffusion models as promising. While there are some recent works focusing on disentangled and interpretable representation learning in diffusion models (e.g., [93, 180, 28]), we feel that this area remains underserved.

Current diffusion-based representation learning frameworks use U-Net and DiT backbones, which were primarily designed for generative tasks. Develo** novel architectures tailored for representation learning is a promising area of research. Current transformer-based backbones are popular due to their scalability and performance, but their inability for parallel inference and the quadratic complexity of the attention mechanism are significant downsides, limiting their use for high-resolution images and long videos [76]. Techniques like windowing [112], sliding [16], and ring attention [108] help mitigate these issues, but complexity limitations remain. Recent works [171, 49, 76] have begun to utilize state-space diffusion models [56, 127], which offer linear complexity with respect to token sequence length, and are thus well suited to long token sequence modeling for both text [121] and images/video [97, 29]. The representation-learning capabilities of these models are yet to be fully analyzed, but we expect that conclusions drawn from diffusion models can also be applied to state-space models and their representation learning capabilities.

We also see significant room for further research in using other generative models for representation learning. Flow Matching models [107, 111, 5] have recently gained prominence for their ability to maintain straight trajectories during generation. This characteristic results in faster inference, making Flow Matching a suitable alternative for addressing trajectory issues encountered in diffusion models. Their versatility has been demonstrated across various applications, including image [78, 38], video [39], depth [57], human motion [74], audio [94], boosting diffusion models [50, 147, 156], and even text generation [77]. The close relationship between Diffusion and Flow Matching models suggests that many of the diffusion representation learning frameworks can also be applied to Flow Matching models.

Acknowledgments

We would like to thank Yuki Asano, Stefan Andreas Baumann, Timy Phan, and Frank Fundel for providing additional related literature.

References

  • Adaloglou et al. [2023] N. Adaloglou, F. Michels, H. Kalisch, and M. Kollmann, “Exploring the limits of deep image clustering using pretrained models,” in BMVC.   BMVA, 2023.
  • Adaloglou et al. [2024] N. Adaloglou, T. Kaiser, F. Michels, and M. Kollmann, “Rethinking cluster-conditioned diffusion models,” arXiv, 2024.
  • Ahn et al. [2024] D. Ahn, H. Cho, J. Min, W. Jang, J. Kim, S. Kim, H. H. Park, K. H. **, and S. Kim, “Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance,” arXiv, 2024.
  • Ahn and Kwak [2018] J. Ahn and S. Kwak, “Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation,” in CVPR, 2018, pp. 4981–4990.
  • Albergo and Vanden-Eijnden [2023] M. S. Albergo and E. Vanden-Eijnden, “Building normalizing flows with stochastic interpolants,” in ICLR, 2023.
  • Anand and Achim [2022] N. Anand and T. Achim, “Protein Structure and Sequence Generation with Equivariant Denoising Diffusion Probabilistic Models,” arXiv, 2022.
  • Anderson [1982] B. D. O. Anderson, “Reverse-time diffusion equation models,” Stochastic Processes and their Applications, vol. 12, no. 3, pp. 313–326, 1982.
  • Asano et al. [2019] Y. M. Asano, C. Rupprecht, and A. Vedaldi, “Self-labelling via simultaneous clustering and representation learning,” arXiv, 2019.
  • Austin et al. [2021] J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. van den Berg, “Structured Denoising Diffusion Models in Discrete State-Spaces,” in NeurIPS, vol. 34, 2021.
  • Ayromlou et al. [2024] S. Ayromlou, A. Afkanpour, V. R. Khazaie, and F. Forghani, “Can Generative Models Improve Self-Supervised Representation Learning?” arXiv, 2024.
  • Ba et al. [2016] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer Normalization,” arXiv, 2016.
  • Bao et al. [2023a] F. Bao, S. Nie, K. Xue, Y. Cao, C. Li, H. Su, and J. Zhu, “All are worth words: A vit backbone for diffusion models,” in CVPR, 2023, pp. 22 669–22 679.
  • Bao et al. [2023b] F. Bao, S. Nie, K. Xue, C. Li, S. Pu, Y. Wang, G. Yue, Y. Cao, H. Su, and J. Zhu, “One transformer fits all distributions in multi-modal diffusion at scale,” in ICML.   PMLR, 2023, pp. 1692–1717.
  • Bar-Tal et al. [2023] O. Bar-Tal, L. Yariv, Y. Lipman, and T. Dekel, “MultiDiffusion: Fusing diffusion paths for controlled image generation,” in ICML, vol. 202.   PMLR, 2023, pp. 1737–1752.
  • Baranchuk et al. [2022] D. Baranchuk, A. Voynov, I. Rubachev, V. Khrulkov, and A. Babenko, “Label-efficient semantic segmentation with diffusion models,” in ICLR, 2022.
  • Beltagy et al. [2020] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The Long-Document Transformer,” arXiv, 2020.
  • Ben-Hamu et al. [2024] H. Ben-Hamu, O. Puny, I. Gat, B. Karrer, U. Singer, and Y. Lipman, “D-Flow: Differentiating through Flows for Controlled Generation,” arXiv, 2024.
  • Bengio et al. [2013] Y. Bengio, L. Yao, G. Alain, and P. Vincent, “Generalized denoising auto-encoders as generative models,” in NeurIPS, vol. 26, 2013.
  • Benny and Wolf [2022] Y. Benny and L. Wolf, “Dynamic dual-output diffusion models,” in CVPR, 2022, pp. 11 482–11 491.
  • Blattmann et al. [2022] A. Blattmann, R. Rombach, K. Oktay, and B. Ommer, “Retrieval-augmented diffusion models,” arXiv, 2022.
  • Borgeaud et al. [2022] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clark et al., “Improving language models by retrieving from trillions of tokens,” in ICLR.   PMLR, 2022, pp. 2206–2240.
  • Brock et al. [2019] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for high fidelity natural image synthesis,” in ICLR, 2019.
  • Brooks et al. [2023] T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” in CVPR, 2023, pp. 18 392–18 402.
  • Caron et al. [2021] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in ICCV, 2021, pp. 9650–9660.
  • Casanova et al. [2021] A. Casanova, M. Careil, J. Verbeek, M. Drozdzal, and A. Romero Soriano, “Instance-conditioned gan,” in NeurIPS, 2021.
  • Chang et al. [2022] H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “Maskgit: Masked generative image transformer,” in CVPR, 2022.
  • Chang et al. [2023] Z. Chang, G. A. Koulieris, and H. P. H. Shum, “On the Design Fundamentals of Diffusion Models: A Survey,” arXiv, 2023.
  • Chefer et al. [2024] H. Chefer, O. Lang, M. Geva, V. Polosukhin, A. Shocher, M. Irani, I. Mosseri, and L. Wolf, “The Hidden Language of Diffusion Models,” in ICLR, 2024.
  • Chen et al. [2024b] G. Chen, Y. Huang, J. Xu, B. Pei, Z. Chen, Z. Li, J. Wang, K. Li, T. Lu, and L. Wang, “Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding,” arXiv, 2024.
  • Chen et al. [2016] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs,” arXiv, 2016.
  • Chen et al. [2017] ——, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2017.
  • Chen et al. [2021] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “WaveGrad: Estimating Gradients for Waveform Generation,” in ICLR, 2021.
  • Chen et al. [2020a] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML.   PMLR, 2020, pp. 1597–1607.
  • Chen et al. [2020b] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton, “Big self-supervised models are strong semi-supervised learners,” in NeurIPS, 2020.
  • Chen et al. [2024a] X. Chen, Z. Liu, S. Xie, and K. He, “Deconstructing Denoising Diffusion Models for Self-Supervised Learning,” arXiv, 2024.
  • Cherti et al. [2023] M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language-image learning,” in CVPR, 2023, pp. 2818–2829.
  • Croitoru et al. [2023] F. Croitoru, V. Hondru, R. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 45, no. 09, pp. 10 850–10 869, 2023.
  • Dao et al. [2023] Q. Dao, H. Phung, B. Nguyen, and A. Tran, “Flow matching in latent space,” arXiv, 2023.
  • Davtyan et al. [2023] A. Davtyan, S. Sameni, and P. Favaro, “Efficient video prediction via sparsely conditioned flow matching,” in ICCV, 2023, pp. 23 263–23 274.
  • Deja et al. [2023] K. Deja, T. Trzciński, and J. M. Tomczak, “Learning data representations with joint diffusion models,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases.   Springer, 2023, pp. 543–559.
  • Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
  • Dhariwal and Nichol [2021b] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” in NeurIPS, 2021.
  • Dhariwal and Nichol [2021a] ——, “Diffusion Models Beat GANs on Image Synthesis,” in NeurIPS, vol. 34, 2021.
  • Donahue and Simonyan [2019] J. Donahue and K. Simonyan, “Large scale adversarial representation learning,” in NeurIPS, 2019.
  • Dosovitskiy et al. [2021] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” ICLR, 2021.
  • Epstein et al. [2023] D. Epstein, A. Jabri, B. Poole, A. Efros, and A. Holynski, “Diffusion self-guidance for controllable image generation,” in NeurIPS, vol. 36, 2023, pp. 16 222–16 239.
  • Esser et al. [2021] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in CVPR, 2021, pp. 12 873–12 883.
  • Everingham et al. [2010] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, pp. 303–338, 2010.
  • Fei et al. [2024] Z. Fei, M. Fan, C. Yu, and J. Huang, “Scalable Diffusion Models with State Space Backbone,” arXiv, 2024.
  • Fischer et al. [2023] J. S. Fischer, M. Gui, P. Ma, N. Stracke, S. A. Baumann, and B. Ommer, “Boosting latent diffusion with flow matching,” arXiv, 2023.
  • Geng et al. [2024] Z. Geng, B. Yang, T. Hang, C. Li, S. Gu, T. Zhang, J. Bao, Z. Zhang, H. Li, H. Hu et al., “Instructdiffusion: A generalist modeling interface for vision tasks,” in CVPR, 2024, pp. 12 709–12 720.
  • Geyer et al. [2024] M. Geyer, O. Bar-Tal, S. Bagon, and T. Dekel, “Tokenflow: Consistent diffusion features for consistent video editing,” in ICLR, 2024.
  • Goodfellow et al. [2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NeurIPS, vol. 27, 2014.
  • Graikos et al. [2022] A. Graikos, N. Malkin, N. Jojic, and D. Samaras, “Diffusion models as plug-and-play priors,” in NeurIPS, 2022.
  • Grill et al. [2020] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, B. Piot, k. kavukcuoglu, R. Munos, and M. Valko, “Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning,” in NeurIPS, vol. 33, 2020, pp. 21 271–21 284.
  • Gu and Dao [2024] A. Gu and T. Dao, “Mamba: Linear-Time Sequence Modeling with Selective State Spaces,” arXiv, 2024.
  • Gui et al. [2024] M. Gui, J. S. Fischer, U. Prestel, P. Ma, D. Kotovenko, O. Grebenkova, S. A. Baumann, V. T. Hu, and B. Ommer, “Depthfm: Fast monocular depth estimation with flow matching,” arXiv, 2024.
  • Guo et al. [2024] J. Guo, X. Xu, Y. Pu, Z. Ni, C. Wang, M. Vasu, S. Song, G. Huang, and H. Shi, “Smooth diffusion: Crafting smooth latent spaces in diffusion models,” in CVPR, 2024.
  • Ham et al. [2017] B. Ham, M. Cho, C. Schmid, and J. Ponce, “Proposal flow: Semantic correspondences from object proposals,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 40, no. 7, pp. 1711–1725, 2017.
  • He et al. [2020] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in CVPR, 2020.
  • He et al. [2022] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in CVPR, 2022.
  • Hedlin et al. [2023] E. Hedlin, G. Sharma, S. Mahajan, H. Isack, A. Kar, A. Tagliasacchi, and K. M. Yi, “Unsupervised semantic correspondence using stable diffusion,” in NeurIPS, 2023.
  • Heek et al. [2024] J. Heek, E. Hoogeboom, and T. Salimans, “Multistep consistency models,” arXiv, 2024.
  • Helber et al. [2019] P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019.
  • Hertz et al. [2022] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, “Prompt-to-Prompt Image Editing with Cross Attention Control,” arXiv, 2022.
  • Hinton et al. [2015] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” arXiv, 2015.
  • Ho and Salimans [2021] J. Ho and T. Salimans, “Classifier-free diffusion guidance,” in NeurIPS Workshop, 2021.
  • Ho et al. [2020] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in NeurIPS, 2020.
  • Hong et al. [2023] S. Hong, G. Lee, W. Jang, and S. Kim, “Improving Sample Quality of Diffusion Models Using Self-Attention Guidance,” in ICCV, 2023.
  • Hoogeboom et al. [2021] E. Hoogeboom, D. Nielsen, P. Jaini, P. Forré, and M. Welling, “Argmax flows and multinomial diffusion: Learning categorical distributions,” in NeurIPS, 2021.
  • Hoogeboom et al. [2022] E. Hoogeboom, V. G. Satorras, C. Vignac, and M. Welling, “Equivariant Diffusion for Molecule Generation in 3D,” in Proceedings of the 39th International Conference on Machine Learning.   PMLR, 2022, pp. 8867–8887.
  • Hoogeboom et al. [2023] E. Hoogeboom, J. Heek, and T. Salimans, “simple diffusion: End-to-end diffusion for high resolution images,” in Proceedings of the 40th International Conference on Machine Learning.   PMLR, 2023, pp. 13 213–13 232.
  • Hu et al. [2023a] V. T. Hu, Y. Chen, M. Caron, Y. M. Asano, C. G. M. Snoek, and B. Ommer, “Guided Diffusion from Self-Supervised Diffusion Features,” arXiv, 2023.
  • Hu et al. [2023c] V. T. Hu, W. Yin, P. Ma, Y. Chen, B. Fernando, Y. M. Asano, E. Gavves, P. Mettes, B. Ommer, and C. G. M. Snoek, “Motion flow matching for human motion synthesis and editing,” arXiv, 2023.
  • Hu et al. [2023b] V. T. Hu, D. W. Zhang, Y. M. Asano, G. J. Burghouts, and C. G. Snoek, “Self-guided diffusion models,” in CVPR, 2023, pp. 18 413–18 422.
  • Hu et al. [2024b] V. T. Hu, S. A. Baumann, M. Gui, O. Grebenkova, P. Ma, J. Fischer, and B. Ommer, “ZigMa: A DiT-style Zigzag Mamba Diffusion Model,” arXiv, 2024.
  • Hu et al. [2024a] V. T. Hu, D. Wu, Y. M. Asano, P. Mettes, B. Fernando, B. Ommer, and C. G. M. Snoek, “Flow matching for conditional text generation in a few sampling steps,” in EACL, 2024.
  • Hu et al. [2024c] V. T. Hu, W. Zhang, M. Tang, P. Mettes, D. Zhao, and C. Snoek, “Latent space editing in transformer-based flow matching,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2247–2255.
  • Huang et al. [2021] C.-W. Huang, J. H. Lim, and A. Courville, “A variational perspective on diffusion-based generative models and score matching,” in NeurIPS, 2021.
  • Huang et al. [2023] R. Huang, J. Huang, D. Yang, Y. Ren, L. Liu, M. Li, Z. Ye, J. Liu, X. Yin, and Z. Zhao, “Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models,” in Proceedings of the 40th International Conference on Machine Learning.   PMLR, 2023, pp. 13 916–13 932.
  • Huang et al. [2024] Y. Huang, J. Huang, Y. Liu, M. Yan, J. Lv, J. Liu, W. Xiong, H. Zhang, S. Chen, and L. Cao, “Diffusion Model-Based Image Editing: A Survey,” arXiv, 2024.
  • Hudson et al. [2024] D. A. Hudson, D. Zoran, M. Malinowski, A. K. Lampinen, A. Jaegle, J. L. McClelland, L. Matthey, F. Hill, and A. Lerchner, “Soda: Bottleneck diffusion models for representation learning,” in CVPR, 2024.
  • Itô [1950] K. Itô, “Stochastic differential equations in a differentiable manifold,” Nagoya Mathematical Journal, vol. 1, pp. 35–47, 1950.
  • Karras et al. [2019] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in ICCV, 2019, pp. 4401–4410.
  • Karras et al. [2022] T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models,” in NeurIPS, 2022.
  • Kim et al. [2024] G. Kim, W. Jang, G. Lee, S. Hong, J. Seo, and S. Kim, “Depth-aware guidance with self-estimated depth representations of diffusion models,” Pattern Recognition, vol. 153, p. 110474, 2024.
  • Kingma et al. [2021] D. Kingma, T. Salimans, B. Poole, and J. Ho, “Variational Diffusion Models,” in NeurIPS, vol. 34, 2021.
  • Kingma and Welling [2014] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
  • Kingma and Welling [2019] ——, “An Introduction to Variational Autoencoders,” Foundations and Trends® in Machine Learning, vol. 12, no. 4, pp. 307–392, 2019.
  • Kirillov et al. [2019a] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid networks,” in CVPR, 2019, pp. 6399–6408.
  • Kirillov et al. [2019b] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic segmentation,” in CVPR, 2019, pp. 9404–9413.
  • Kong et al. [2021] Z. Kong, W. **, J. Huang, K. Zhao, and B. Catanzaro, “DiffWave: A Versatile Diffusion Model for Audio Synthesis,” arXiv, 2021.
  • Kwon et al. [2023] M. Kwon, J. Jeong, and Y. Uh, “Diffusion models already have a semantic latent space,” in ICLR, 2023.
  • Le et al. [2023] M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz, M. Williamson, V. Manohar, Y. Adi, J. Mahadeokar et al., “Voicebox: Text-guided multilingual universal speech generation at scale,” arXiv, 2023.
  • Li et al. [2023a] A. C. Li, M. Prabhudesai, S. Duggal, E. Brown, and D. Pathak, “Your diffusion model is secretly a zero-shot classifier,” in ICCV, 2023.
  • Li et al. [2023b] D. Li, H. Ling, A. Kar, D. Acuna, S. W. Kim, K. Kreis, A. Torralba, and S. Fidler, “Dreamteacher: Pretraining image backbones with deep generative models,” in ICCV, 2023, pp. 16 698–16 708.
  • Li et al. [2024d] K. Li, X. Li, Y. Wang, Y. He, Y. Wang, L. Wang, and Y. Qiao, “VideoMamba: State Space Model for Efficient Video Understanding,” arXiv, 2024.
  • Li et al. [2024c] S. Li, C. Chen, and H. Lu, “MoEController: Instruction-based Arbitrary Image Manipulation with Mixture-of-Expert Controllers,” arXiv, 2024.
  • Li et al. [2023c] T. Li, H. Chang, S. Mishra, H. Zhang, D. Katabi, and D. Krishnan, “Mage: Masked generative encoder to unify representation learning and image synthesis,” in CVPR, 2023, pp. 2142–2152.
  • Li et al. [2024a] T. Li, D. Katabi, and K. He, “Return of Unconditional Generation: A Self-supervised Representation Generation Method,” arXiv, 2024.
  • Li et al. [2022] X. L. Li, J. Thickstun, I. Gulrajani, P. Liang, and T. B. Hashimoto, “Diffusion-LM Improves Controllable Text Generation,” arXiv, 2022.
  • Li et al. [2023d] X. Li, K. Han, X. Wan, and V. A. Prisacariu, “SimSC: A Simple Framework for Semantic Correspondence with Temperature Learning,” arXiv, 2023.
  • Li et al. [2024b] X. Li, J. Lu, K. Han, and V. A. Prisacariu, “Sd4match: Learning to prompt stable diffusion model for semantic matching,” in CVPR, 2024, pp. 27 558–27 568.
  • Lin et al. [2016] D. Lin, J. Dai, J. Jia, K. He, and J. Sun, “Scribblesup: Scribble-supervised convolutional networks for semantic segmentation,” in CVPR, 2016, pp. 3159–3167.
  • Lin and Yang [2024] S. Lin and X. Yang, “Diffusion Model with Perceptual Loss,” arXiv, 2024.
  • Lin et al. [2014] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
  • Lipman et al. [2023] Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” in ICLR, 2023.
  • Liu et al. [2024] H. Liu, M. Zaharia, and P. Abbeel, “Ringattention with blockwise transformers for near-infinite context,” in ICLR, 2024.
  • Liu et al. [2022] J. Liu, C. Li, Y. Ren, F. Chen, and Z. Zhao, “DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism,” in AAAI, vol. 36, no. 10, 2022, pp. 11 020–11 028.
  • Liu et al. [2020] S. Liu, T. Wang, D. Bau, J.-Y. Zhu, and A. Torralba, “Diverse Image Generation via Self-Conditioned GANs,” in CVPR, 2020, pp. 14 274–14 283.
  • Liu et al. [2023] X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,” in ICLR, 2023.
  • Liu et al. [2021] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” in ICCV, 2021, pp. 9992–10 002.
  • Long et al. [2015] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015, pp. 3431–3440.
  • Luccioni et al. [2023] A. S. Luccioni, Y. Jernite, and E. Strubell, “Power hungry processing: Watts driving the cost of ai deployment?” arXiv, 2023.
  • Luo [2022] C. Luo, “Understanding Diffusion Models: A Unified Perspective,” arXiv, 2022.
  • Luo et al. [2023] G. Luo, L. Dunlap, D. H. Park, A. Holynski, and T. Darrell, “Diffusion hyperfeatures: Searching through time and space for semantic correspondence,” in NeurIPS, 2023.
  • Luo et al. [2024a] G. Luo, T. Darrell, O. Wang, D. B. Goldman, and A. Holynski, “Readout Guidance: Learning Control from Diffusion Features,” in CVPR, 2024.
  • Luo et al. [2024b] W. Luo, T. Hu, S. Zhang, J. Sun, Z. Li, and Z. Zhang, “Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models,” NeurIPS, vol. 36, 2024.
  • Mardani et al. [2024] M. Mardani, J. Song, J. Kautz, and A. Vahdat, “A variational perspective on solving inverse problems with diffusion models,” in ICLR, 2024.
  • Mariotti et al. [2024] O. Mariotti, O. Mac Aodha, and H. Bilen, “Improving semantic correspondence with viewpoint-guided spherical maps,” in CVPR, 2024, pp. 19 521–19 530.
  • Mehta et al. [2023] H. Mehta, A. Gupta, A. Cutkosky, and B. Neyshabur, “Long range language modeling via gated state spaces,” in ICLR, 2023.
  • Min et al. [2019] J. Min, J. Lee, J. Ponce, and M. Cho, “SPair-71k: A Large-scale Benchmark for Semantic Correspondence,” arXiv, 2019.
  • Mirza and Osindero [2014] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv, 2014.
  • Mittal et al. [2023] S. Mittal, K. Abstreiter, S. Bauer, B. Schölkopf, and A. Mehrjou, “Diffusion based representation learning,” in ICML.   PMLR, 2023.
  • Mukhopadhyay et al. [2023a] S. Mukhopadhyay, M. Gwilliam, V. Agarwal, N. Padmanabhan, A. Swaminathan, S. Hegde, T. Zhou, and A. Shrivastava, “Diffusion Models Beat GANs on Image Classification,” arXiv, 2023.
  • Mukhopadhyay et al. [2023b] S. Mukhopadhyay, M. Gwilliam, Y. Yamaguchi, V. Agarwal, N. Padmanabhan, A. Swaminathan, T. Zhou, and A. Shrivastava, “Do text-free diffusion models learn discriminative visual representations?” arXiv, 2023.
  • Nguyen et al. [2022] E. Nguyen, K. Goel, A. Gu, G. Downs, P. Shah, T. Dao, S. Baccus, and C. Ré, “S4ND: Modeling images and videos as multidimensional signals with state spaces,” in NeurIPS, 2022.
  • Oquab et al. [2024] M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. HAZIZA, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y. Huang, S.-W. Li, I. Misra, M. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features without supervision,” Transactions on Machine Learning Research, 2024.
  • Pan et al. [2021] Z. Pan, P. Jiang, Y. Wang, C. Tu, and A. G. Cohn, “Scribble-supervised semantic segmentation by uncertainty reduction on neural representation and self-supervision on neural eigenspace,” in ICCV, 2021, pp. 7416–7425.
  • Pan et al. [2024] Z. Pan, J. Chen, and Y. Shi, “Masked Diffusion as Self-supervised Representation Learner,” arXiv, 2024.
  • Pandey et al. [2024] K. Pandey, P. Guerrero, M. Gadelha, Y. Hold-Geoffroy, K. Singh, and N. J. Mitra, “Diffusion handles enabling 3d edits for diffusion models by lifting activations to 3d,” in CVPR, 2024.
  • Peebles and Xie [2023] W. Peebles and S. Xie, “Scalable diffusion models with transformers,” in ICCV, 2023.
  • Po et al. [2023] R. Po, W. Yifan, V. Golyanik, K. Aberman, J. T. Barron, A. H. Bermano, E. R. Chan, T. Dekel, A. Holynski, A. Kanazawa, C. K. Liu, L. Liu, B. Mildenhall, M. Nießner, B. Ommer, C. Theobalt, P. Wonka, and G. Wetzstein, “State of the Art on Diffusion Models for Visual Computing,” arXiv, 2023.
  • Preechakul et al. [2022] K. Preechakul, N. Chatthee, S. Wizadwongsa, and S. Suwajanakorn, “Diffusion autoencoders: Toward a meaningful and decodable representation,” in CVPR, 2022.
  • Prince [2023] S. J. Prince, Understanding Deep Learning.   The MIT Press, 2023.
  • Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
  • Rezende et al. [2014] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropagation and approximate inference in deep generative models,” in ICML, 2014.
  • Rombach et al. [2022] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022.
  • Romero et al. [2015] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “FitNets: Hints for Thin Deep Nets,” arXiv, 2015.
  • Ronneberger et al. [2015] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI.   Springer, 2015, pp. 234–241.
  • Saharia et al. [2022a] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic text-to-image diffusion models with deep language understanding,” in NeurIPS, vol. 35, 2022, pp. 36 479–36 494.
  • Saharia et al. [2022b] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” in NeurIPS, 2022.
  • Salimans and Ho [2022] T. Salimans and J. Ho, “Progressive distillation for fast sampling of diffusion models,” in ICLR, 2022.
  • Salimans et al. [2017] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “Pixelcnn++: A pixelcnn implementation with discretized logistic mixture likelihood and other modifications,” in ICLR, 2017.
  • Salimans et al. [2024] T. Salimans, T. Mensink, J. Heek, and E. Hoogeboom, “Multistep distillation of diffusion models via moment matching,” arXiv, 2024.
  • Samuel et al. [2024] D. Samuel, R. Ben-Ari, S. Raviv, N. Darshan, and G. Chechik, “Generating images of rare concepts using pre-trained diffusion models,” in AAAI, vol. 38, no. 5, 2024, pp. 4695–4703.
  • Sauer et al. [2024] A. Sauer, F. Boesel, T. Dockhorn, A. Blattmann, P. Esser, and R. Rombach, “Fast high-resolution image synthesis with latent adversarial diffusion distillation,” arXiv, 2024.
  • Schnell et al. [2024] J. Schnell, J. Wang, L. Qi, V. T. Hu, and M. Tang, “ScribbleGen: Generative Data Augmentation Improves Scribble-supervised Semantic Segmentation,” arXiv, 2024.
  • Sheynin et al. [2023] S. Sheynin, O. Ashual, A. Polyak, U. Singer, O. Gafni, E. Nachmani, and Y. Taigman, “kNN-diffusion: Image generation via large-scale retrieval,” in ICLR, 2023.
  • Shipard et al. [2023] J. Shipard, A. Wiliem, K. N. Thanh, W. Xiang, and C. Fookes, “Diversity is definitely needed: Improving model-agnostic zero-shot classification via stable diffusion,” in CVPR, 2023, pp. 769–778.
  • Sohl-Dickstein et al. [2015] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in ICML, 2015.
  • Song et al. [2021b] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in ICLR, 2021.
  • Song and Ermon [2019] Y. Song and S. Ermon, “Generative Modeling by Estimating Gradients of the Data Distribution,” in NeurIPS, vol. 32, 2019.
  • Song et al. [2021a] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in ICLR, 2021.
  • Song et al. [2023] Y. Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency models,” arXiv, 2023.
  • Song et al. [2024] Y. Song, A. Keller, N. Sebe, and M. Welling, “Flow factorized representation learning,” in NeurIPS, vol. 36, 2024.
  • Tang et al. [2023] L. Tang, M. Jia, Q. Wang, C. P. Phoo, and B. Hariharan, “Emergent correspondence from image diffusion,” in NeurIPS, vol. 36, 2023.
  • Tian et al. [2024b] C. Tian, C. Tao, J. Dai, H. Li, Z. Li, L. Lu, X. Wang, H. Li, G. Huang, and X. Zhu, “ADDP: Learning general representations for image recognition and generation with alternating denoising diffusion process,” in ICLR, 2024.
  • Tian et al. [2024a] K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang, “Visual autoregressive modeling: Scalable image generation via next-scale prediction,” arXiv, 2024.
  • Tumanyan et al. [2023] N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, “Plug-and-play diffusion features for text-driven image-to-image translation,” in CVPR, 2023, pp. 1921–1930.
  • Vincent [2011] P. Vincent, “A connection between score matching and denoising autoencoders,” Neural computation, 2011.
  • Vincent et al. [2008] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in ICML, 2008, pp. 1096–1103.
  • Voynov and Babenko [2020] A. Voynov and A. Babenko, “Unsupervised discovery of interpretable directions in the gan latent space,” in ICML.   PMLR, 2020, pp. 9786–9796.
  • Wallace et al. [2023] B. Wallace, A. Gokul, S. Ermon, and N. Naik, “End-to-end diffusion latent optimization improves classifier guidance,” in ICCV, 2023, pp. 7246–7256.
  • Wang et al. [2023] Y. Wang, Y. Schiff, A. Gokaslan, W. Pan, F. Wang, C. De Sa, and V. Kuleshov, “Infodiffusion: Representation learning using information maximizing diffusion models,” in ICML.   PMLR, 2023.
  • Wei et al. [2023] C. Wei, K. Mangalam, P.-Y. Huang, Y. Li, H. Fan, H. Xu, H. Wang, C. Xie, A. Yuille, and C. Feichtenhofer, “Diffusion models as masked autoencoders,” in ICCV, 2023.
  • Wu et al. [2023] W. Wu, Y. Zhao, M. Z. Shou, H. Zhou, and C. Shen, “Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models,” in ICCV, 2023, pp. 1206–1217.
  • Wu et al. [2022] Y. Wu, M. N. Rabe, D. Hutchins, and C. Szegedy, “Memorizing transformers,” in ICLR, 2022.
  • Xiang et al. [2023] W. Xiang, H. Yang, D. Huang, and Y. Wang, “Denoising diffusion autoencoders are unified self-supervised learners,” in ICCV, 2023.
  • Xu et al. [2023] J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, and S. De Mello, “Open-vocabulary panoptic segmentation with text-to-image diffusion models,” in CVPR, 2023, pp. 2955–2966.
  • Yan et al. [2024] J. N. Yan, J. Gu, and A. M. Rush, “Diffusion models without attention,” in CVPR, 2024, pp. 8239–8249.
  • Yang et al. [2023] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” ACM Computing Surveys, vol. 56, no. 4, pp. 1–39, 2023.
  • Yang and Wang [2023] X. Yang and X. Wang, “Diffusion model as representation learner,” in ICCV, 2023.
  • Yang et al. [2022] X. Yang, S.-M. Shih, Y. Fu, X. Zhao, and S. Ji, “Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model,” arXiv, 2022.
  • Yatim et al. [2024] D. Yatim, R. Fridman, O. Bar-Tal, Y. Kasten, and T. Dekel, “Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer,” in CVPR, 2024.
  • You et al. [2023] Z. You, Y. Zhong, F. Bao, J. Sun, C. Li, and J. Zhu, “Diffusion models and semi-supervised learners benefit mutually with few labels,” in NeurIPS, 2023.
  • Yu et al. [2015] F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao, “Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop,” arXiv, 2015.
  • Yu et al. [2020] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell, “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” in CVPR, 2020, pp. 2636–2645.
  • Yu et al. [2023] J. Yu, Y. Wang, C. Zhao, B. Ghanem, and J. Zhang, “Freedom: Training-free energy-guided conditional diffusion model,” ICCV, 2023.
  • Yue et al. [2024] Z. Yue, J. Wang, Q. Sun, L. Ji, E. I. Chang, and H. Zhang, “Exploring diffusion time-steps for unsupervised representation learning,” in ICLR, 2024.
  • Zagoruyko and Komodakis [2017] S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” in ICLR, 2017.
  • Zagoruyko and Komodakis [2016] ——, “Wide residual networks,” in BMVC, 2016.
  • Zhang et al. [2023a] J. Zhang, C. Herrmann, J. Hur, L. P. Cabrera, V. Jampani, D. Sun, and M.-H. Yang, “A tale of two features: Stable diffusion complements DINO for zero-shot semantic correspondence,” in NeurIPS, 2023.
  • Zhang et al. [2024] J. Zhang, C. Herrmann, J. Hur, E. Chen, V. Jampani, D. Sun, and M.-H. Yang, “Telling left from right: Identifying geometry-aware semantic correspondence,” in CVPR, 2024, pp. 3076–3085.
  • Zhang et al. [2023b] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in ICCV, 2023, pp. 3836–3847.
  • Zhang et al. [2022] Z. Zhang, Z. Zhao, and Z. Lin, “Unsupervised representation learning from pre-trained diffusion probabilistic models,” in NeurIPS, 2022.
  • Zhao et al. [2023] W. Zhao, Y. Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text-to-image diffusion models for visual perception,” in ICCV, 2023, pp. 5729–5739.
  • Zheng et al. [2023] K. Zheng, C. Lu, J. Chen, and J. Zhu, “Improved techniques for maximum likelihood estimation for diffusion odes,” in ICML.   PMLR, 2023.
  • Zhou et al. [2019] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,” IJCV, vol. 127, pp. 302–321, 2019.
  • Zimmermann et al. [2021] R. S. Zimmermann, L. Schott, Y. Song, B. A. Dunn, and D. A. Klindt, “Score-based generative classifiers,” in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
[Uncaptioned image] Michael Fuest is a research intern in the Computer Vision & Learning Group at Ludwig-Maximilians-Universität München (LMU). He recently received his master’s degree in Management & Technology with a major in Computer Science from the Technical University of Munich, and is currently a visiting researcher at the MIT Laboratory for Information & Decision Systems.
[Uncaptioned image] **chuan Ma is a Ph.D. student in the Computer Vision & Learning Group at Ludwig Maximilian University of Munich (LMU) and a Munich Center for Machine Learning (MCML) member. He previously received his master’s degree in Applied Computer Science from Heidelberg University, where he developed an interest in deep metric learning and style transfer. He served as a reviewer for CVPR 2024 and NeurIPS 2024. His current research focuses on leveraging generative models for tasks beyond generation and exploring multi-modality representation learning.
[Uncaptioned image] Ming Gui is currently a PhD researcher at the Computer Vision & Learning Group at Ludwig Maximilian University of Munich (LMU). He received his bachelor’s and master’s degree in electrical and computational engineering from the Technical University of Munich, where he developed interest in deep learning and computer vision. His research is presently centered around the development and enhancement of scalable generative models.
[Uncaptioned image] Johannes S. Fischer is a PhD Student at the University of Munich’s Computer Vision & Learning Group. He did his undergraduate studies in Psychology and Computer Science, both at the University of Munich, followed by a Master’s degree in Intelligent Interactive Systems from Pompeu Fabra University in Barcelona. Besides works in the field of generative modeling, Johannes’ research focuses on the adaptability of Diffusion and Flow-based models for general image understanding.
[Uncaptioned image] Tao Hu is a postdoctoral research fellow at Ludwig Maximilian University of Munich (LMU) where they are develo** next generation of Stable Diffusion models. He obtained his Ph.D. degree in computer science from the University of Amsterdam in 2023 under the supervision of Cees Snoek. He was an intern at Megvii, Amazon AWS in 2017 and 2020. His Ph.D. research has been selected for the CVPR2023 and ICCV2023 Doctoral Consortium. He co-organized the ECCV 2024 Workshop on Audio-Visual Generative Learning. He has also served as the Area Chair for the CVPR AI4CC workshop and the CVPR 2024 Efficient Large Vision Model workshop. He has been a reviewer for top-tier computer vision and machine learning conferences for several years. His current research interests lie in scalable, flexible, and efficient generative models.
[Uncaptioned image] Björn Ommer is a professor at LMU and leads the Computer Vision & Learning Group. He was previously with Heidelberg University’s Department of Mathematics and Computer Science, IWR, and HCI. He studied computer science and physics at the University of Bonn, completed his Ph.D. at ETH Zurich where his dissertation received the ETH Medal, and held a post-doc position with Jitendra Malik at UC Berkeley. He is a member of the Bavarian AI Council, an editor for IEEE T-PAMI, an ELLIS Fellow, faculty of ELLIS unit Munich, a PI at the Munich Center for Machine Learning (MCML), and has held various roles at numerous CVPR, ICCV, ECCV, and NeurIPS conferences. He delivered the opening keynote at NeurIPS’23.