Autoregressive Image Diffusion: Generation of Image Sequence and Application in MRI

Guanxiong Luo Institute for Diagnostic and Interventional Radiology, University Medical Center Göttingen, Germany Shou** Huang Shenzhen Technology University, Shenzhen, China Martin Uecker Institute for Diagnostic and Interventional Radiology, University Medical Center Göttingen, Germany Institute of Biomedical Imaging, Graz University of Technology, Graz, Austria

Abstract

Magnetic resonance imaging (MRI) is a widely used non-invasive imaging modality. However, a persistent challenge lies in balancing image quality with imaging speed. This trade-off is primarily constrained by k-space measurements, which traverse specific trajectories in the spatial Fourier domain (k-space). These measurements are often undersampled to shorten acquisition times, resulting in image artifacts and compromised quality. Generative models learn image distributions and can be used to reconstruct high-quality images from undersampled k-space data. In this work, we present the autoregressive image diffusion (AID) model for image sequences and use it to sample the posterior for accelerated MRI reconstruction. The algorithm incorporates both undersampled k-space and pre-existing information. Models trained with fastMRI dataset are evaluated comprehensively. The results show that the AID model can robustly generate sequentially coherent image sequences. In 3D and dynamic MRI, the AID can outperform the standard diffusion model and reduce hallucinations, due to the learned inter-image dependencies.

1 Introduction

Magnetic resonance imaging (MRI) is a non-invasive imaging modality widely used in clinical practice to visualize soft tissue. Despite its utility, a persistent challenge in MRI is the trade-off between image quality and imaging speed. The trade-off is influenced by the k-space (spatial Fourier domain) measurements, which traverse spatial frequency data points along given sampling trajectories. To reduce acquisition time, the k-space measurements are often undersampled, resulting in image artifacts and reduced image quality.

In recent years, deep learning-based methods have emerged to improve image reconstruction in MRI. These methods are formulated as an inverse problem building upon compressed sensing techniques [1, 2] and benefit from the learned prior information instead of hand-crafted priors [3, 4, 5]. Another successful approach involves learning an image prior parameterized by a generative neural network [6, 7], which is used as regularization on the image. Generative priors offer flexibility in handling changes in the forward model and perform well in reconstructing high-quality images from undersampled data.

Diffusion models [8, 9, 10], a class of generative models, have gained attention in recent years and are making an impact in many fields, including MRI reconstruction [11, 12]. These models learn to reverse a diffusion process that transforms random noise into structured images, producing high-quality, detailed images. Various approaches, including denoising diffusion probabilistic models (DDPMs) [10], denoising score matching [9], and continuous formulations based on stochastic differential equations (SDEs) [13], have been proposed for deriving diffusion models.

Recent studies demonstrate the effectiveness of diffusion models in accelerated MRI and their flexibility in handling various sampling patterns [11, 14, 15, 16, 17]. For example, training score-based generative models using Langevin dynamics yields competitive reconstruction results for both in-distribution and out-of-distribution data [11]. Additionally, score-based diffusion models trained solely on magnitude images can reconstruct complex-valued data [15]. Comprehensive approaches using data-driven Markov chains facilitate efficient MRI reconstruction across variable sampling schemes and enable the generation of uncertainty maps [16].

Autoregressive models are statistical models that predict the current value of a variable based on its past values, capturing temporal dependencies and patterns within the data. They are widely used in various fields such as time series analysis, signal processing, and sequence modeling. In natural language processing, autoregressive models like generative pre-trained transform (GPT) [18, 19] predict each token in a sequence based on previously generated tokens, enabling the generation of coherent and contextually relevant text. Similarly, in image modeling, autoregressive models like PixelCNN [20] and ImageGPT [21] generate images by predicting each pixel value based on previously generated pixel values, often in a left-to-right, top-to-bottom order. Instead of directly modeling pixels, which can be computationally expensive for high-resolution images, the study [22] proposes to first compress the image into a smaller representation using vector quantized variational autoencoder (VQVAE). This VQVAE learns a codebook of visually meaningful image components. Then, a transformer is applied to model the autoregressive relationship between these components, effectively capturing the global structure of the image. By predicting each image component based on previous ones, the model generates high-resolution images in a sequential manner, maintaining consistency and coherence across the entire image.

As in the clinical practice of MRI, we often involves acquiring volumetric image sequences to monitor disease progression and treatment response. Modeling these image sequences and generating realistic sequences is a challenging problem. Autoregressive models can be employed to model the joint distribution of image sequences and extract the dependencies between images. The diffusion process is effective in modelling images by treating each image independently. Therefore, we aim to combine these two models and propose autoregressive image diffusion (AID) model to generate sequences of images.

The contributions of this work are the following aspects. We present how to derive the autoregressive image diffusion training loss starting from a common diffusion loss and how to optimize loss in parallel for efficient training. We present the algorithm to sample the posterior for accelerated MRI reconstruction when using AID to facilitate the incorporation of pre-existing information. We performed experiments to evaluate its ability in generating images when different the amount of initial information is given and to validate its effectiveness in MRI reconstruction. The results show that the AID model can stably generate highly coherent image sequences even without any pre-existing information. When used as a generative prior in MRI reconstruction, the AID outperforms the standard diffusion model and reduces the hallucinations in the reconstructed images, benefiting from the learned prior knowledge about the relationship between images and pre-existing information.

Refer to caption — Figure 1: The interaction between the images in conditioning sequence occurs in the DiTBlock, which has a causal attention module to ensure $x_{n}$ is conditioned on previous images $x_{<n}$ . During training, the net predicts the noise for each noisy image that sampled from the target sequence given the conditioning sequence in parallel. During generation, the net iteratively refines the noisy input to produce a clean image, which is then appended to the conditioning sequence.

2 Methods

2.1 Autoregressive image diffusion

Given a dataset $X$ consisting of multiple sequences of images, each sequence represented as $\mathbf{x}=\{x_{1},x_{2},\ldots,x_{N}\}$ , our goal is to model the joint distribution of these images. This joint distribution is autoregressively factorized into the product of conditional probabilities:

p(\mathbf{x})=q(x_{1}|x_{0})\prod_{t=2}^{N}q(x_{n}|x_{<n}),

(1)

where $x_{<n}=\{x_{1},x_{2},\ldots,x_{n-1}\}$ and the image $x_{0}$ is known. The model parameterized by $\theta$ is trained by minimizing the negative log-likelihood of the data:

\mathcal{L}_{AID}=\mathbb{E}_{X}\left[-\log p_{\theta}(\mathbf{x})\right]=% \mathbb{E}_{X}\left[-\log p_{\theta}(x_{1}|x_{0})-\sum_{t=2}^{N}\log p_{\theta% }(x_{n}|x_{<n})\right].

(2)

Sohl-Dickstein et al. [8] and Ho et al. [10] introduced the denoising diffusion probabilistic model (DDPM). This model gradually introduces fixed Gaussian noise to an observed data point $x^{0}$ using known scales $\beta_{t}$ , generating a series of progressively noisier values $x^{1},x^{2},\ldots,x^{T}$ . The final noisy output $x^{T}$ follows a Gaussian distribution with zero and identity covariance matrix $I$ , containing no information about the original data point. The series of positive noise scales $\beta_{1},\ldots,\beta_{T}$ must be increasing, ensuring that the first noisy output $x^{1}$ closely resembles the original data $x^{0}$ , while the final value $x^{T}$ represents pure noise. We apply this process to the conditional probability $q(x_{n}|x_{<n})$ in Equation 2 by adding the noise to the image independent of the position in the sequence, i.e., $x_{n}^{t}$ and $x_{<n}^{0}$ are conditionally independent given $x_{n}^{t-1}$ . Then the transition from $x_{n}^{t-1}$ to $x_{n}^{t}$ is defined as:

q(x_{n}^{t}|x_{n}^{t-1},x_{<n}^{0})=q(x_{n}^{t}|x_{n}^{t-1})=\mathcal{N}(x_{n}% ^{t};\sqrt{1-\beta_{t}}x_{n}^{t-1},\beta_{t}\mathbf{I})

(3)

Here, $x_{n}^{t}$ represents the image $x_{n}$ at time $t$ , $x_{n}^{t-1}$ is the image at the previous time step, and $x_{<n}^{0}$ denotes all images preceding $x_{n}$ at the initial time step. The parameter $\beta_{t}$ controls the drift and diffusion of this process. The objective is to learn to reverse this process. The reverse process is defined as:

p_{\theta}(x_{n}^{t-1}|x_{n}^{t},x_{<n}^{0})=\mathcal{N}(x_{n}^{t-1};\mu_{% \theta}(x_{n}^{t},x_{<n}^{0},t),\Sigma_{\theta}(x_{n}^{t},x_{<n}^{0},t)),

(4)

where $\mu_{\theta}$ and $\Sigma_{\theta}$ are parameterized by a neural network $\theta$ , taking $x_{n}^{t}$ , $x_{<n}^{0}$ , and $t$ as inputs. Using the variational lower bound, the reverse process can be learned by minimizing the negative log-likelihood of the data:

\mathbb{E}[-\log p_{\theta}(x_{n}|x_{<n}^{0})]\leq\mathbb{E}\left[-\log p(x_{n% }^{T})-\sum_{t\geq 1}\log\frac{p_{\theta}(x_{n}^{t-1}|x_{n}^{t},x_{<n}^{0})}{q% (x_{n}^{t}|x_{n}^{t-1},x_{<n}^{0})}\right]:=L_{D_{n}},

(5)

Given the initial image $x_{n}^{0}$ and that $x_{n}^{t}$ and $x_{<n}^{0}$ are conditionally independent given $x_{n}^{0}$ , $x_{n}^{t}$ at an arbitrary time step $t$ is sampled from a Gaussian distribution:

q(x_{n}^{t}|x_{n}^{0},x_{<n}^{0})=\mathcal{N}(x_{n}^{t};\sqrt{\bar{\alpha}_{t}% }x_{n}^{0},(1-\bar{\alpha}_{t})\mathbf{I}),

(6)

using ${\alpha}_{t}=1-\beta_{t}$ and $\bar{\alpha}_{t}=\prod_{s=1}^{t}\alpha_{s}$ . The posterior distribution $x_{n}^{t-1}$ given $x_{n}^{0}$ and $x_{n}^{t}$ is then calculated as:

q(x_{n}^{t-1}|x_{n}^{t},x_{n}^{0},x_{<n}^{0})=\mathcal{N}(x_{n}^{t-1};\tilde{% \mu}_{t}(x_{n}^{t},x_{n}^{0}),\tilde{\beta}_{t}\mathbf{I}),

(7)

where $\tilde{\mu}_{t}({x}^{t}_{n},{x}_{n}^{0}):={\frac{\sqrt{\alpha_{t-1}}\beta_{t}}% {1-\bar{\alpha}_{t}}}{x}_{n}^{0}+{\frac{\sqrt{\alpha_{t}}(1-\bar{\alpha}_{t-1}% )}{1-\bar{\alpha}_{t}}}{x}_{n}^{t}$ and $\tilde{\beta}_{t}:={\frac{1-\tilde{\alpha}_{t-1}}{1-\tilde{\alpha}_{t}}}\beta_% {t}$ .

The training objective Equation 5 is further written as minimizing the Kullback-Leibler (KL) divergence between the forward and reverse processes in Equation 4 and Equation 7, as proposed by Sohl-Dickstein et al. [8]. (See Appendix A for details.)

In practice, the approach proposed by Ho et al. [10] involves reparameterizing $\mu_{\theta}$ and predicting the noise $\epsilon$ for $x_{n}^{t}$ . The expression for $x_{n}^{t}$ is given by $x_{n}^{t}(x_{n}^{0},\epsilon)=\sqrt{\bar{\alpha}_{t}}x_{n}^{0}+\sqrt{1-\bar{% \alpha}_{t}}\epsilon$ , with $\Sigma_{\theta}(x_{n}^{t},x_{<n}^{0},t)=\beta_{t}$ fixed. We realized this with a neural network $\epsilon_{\theta}(x_{n}^{t},t,x_{<n}^{0})$ shown in Figure 1, which predicts the noise for $x_{n}^{t}$ at each time step given $x_{<n}^{0}$ . In the end, the objective function in Equation 2 for training autoregressive image diffusion is written as

\mathcal{L}_{AID}\geq\sum_{n=1}^{N}L_{D_{n}}=\sum_{n=1}^{N}\mathbb{E}_{t,% \epsilon|x_{n}^{0},x_{<n}^{0}}\left[\left\|\epsilon_{\theta}(\sqrt{\bar{\alpha% }_{t}}x_{n}^{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon,x_{<n}^{0},t)-\epsilon\right% \|_{2}^{2}\right],

(8)

where the expectation is taken over the noise $\epsilon\sim\mathcal{N}(0,I)$ and the time step $t\sim\mathcal{U}({1,...,T})$ . To generate an image sequence, we begin with the noise $x_{1}^{T}$ and update it iteratively using Equation 4 with the given $x_{0}^{0}$ , following the sequence $(x_{1}^{T}\rightarrow x^{T-1}\rightarrow\ldots\rightarrow x_{1}^{0})$ . This process yields a clean sample $x_{1}^{0}$ . Subsequently, we can sample $x_{2}^{0}$ in the same manner using the generated images $x_{<2}^{0}$ , and continue this process iteratively to generate the entire sequence of images.

2.2 Architecture

To optimize the objective function in Equation 8 efficiently, ordered images are loaded as sequences of a certain length $N+1$ during the training phase. We take the first $N$ images $\mathbf{x}_{con}=\{x_{0},x_{1},...,x_{N-1}\}$ as the conditioning sequence and the last $N$ images $\mathbf{x}_{target}=\{x_{1},....,x_{N}\}$ as the target sequence, as shown in Figure 1. We adopt an architecture built on an Unet [23] with capabilities of temporal-spatial conditioning (TSC), designed to process the conditioning sequence and predict the noise for the target sequence. The term "temporal" refers to conditioning in previous frames along the $N$ dimensions, while the "spatial" refers to the conditioning in the previous frame among the $H\times W$ dimensions. Additionally, the TSC block is conditioned on the time steps $t$ of the diffusion process.

The only interaction between images in the conditioning sequence occurs during the attention operation. To maintain proper conditioning with autoregressive property, we implemented a standard upper triangular mask on the $n\times n$ matrix of attention logits. This causal attention module is used in DiTBlock [18, 24]. The modified DiTBlock is followed by a ResNet block [25], which is a standard building block in the Unet architecture. The features output by the TSC block are then passed to the corresponding encoder block in the Unet, which process the target sequence. The change in tensor dimensions inside TSC Block is handled by the einops library¹¹1https://github.com/arogozhnikov/einops and illustrated in Figure 1.

During training, the net predicts the noise in parallel for each noisy image that is sampled from the target sequence, given the conditioning sequence. During generation of sequence, the net iteratively refines the noisy input to produce a clean image, which is then appended to the conditioning sequence.

2.3 Application in MRI inverse problem

Image reconstruction is formulated as a Bayesian problem where the posterior of image $p(x|y)$ is expressed as

\displaystyle p({x}|{y})

\displaystyle=\frac{p({y}|{x})\cdot p({x})}{p({y})}~{}.

(9)

Here, $y$ represents the measured k-space data, $x$ denotes the image, and $p(x)$ is a generative prior. The minimum mean square error (MMSE) estimator for the posterior minimizes the mean square error, given by:

{{x}}_{\mathrm{MMSE}}=\arg\min_{\tilde{{x}}}\int\|\tilde{{x}}-{x}\|^{2}p({x}|{% y})d{x}=\mathbb{E}[x|y]~{}.

(10)

2.4 Likelihood function for k-space

The image $x\in\mathbb{C}^{n\times n}$ is represented as a complex matrix , where $n\times n$ is the image size, and $y\in\mathbb{C}^{m\times m_{C}}$ is a vector of complex-valued k-space samples from $m_{C}$ receive coils. Assuming circularly-symmetric normal noise $\eta$ with zero mean and covariance matrix $\sigma^{2}_{\eta}\mathbf{I}$ , the likelihood $p(y|x)$ of observing $y$ given $x$ is formulated as a complex normal distribution:

	$\displaystyle p({y}\|{x})$	$\displaystyle=\mathcal{CN}({y};\mathcal{A}{x},\sigma^{2}_{\eta}\mathbf{I})$
		$\displaystyle=(\sigma_{\eta}^{2}\pi)^{-N_{p}}e^{\text{-}\\|\sigma_{\eta}^{-1}% \cdot({y}-\mathcal{A}{x})\\|_{2}^{2}}~{},$		(11)

where $\mathbf{I}$ is the identity matrix, $\sigma_{\eta}$ is the standard deviation of the noise, $\mathcal{A}x$ represents the mean, and $N_{p}$ is the length of the k-space data vector. The operator $\mathcal{A}:\mathbb{C}^{n\times n}\rightarrow\mathbb{C}^{m\times m_{C}}$ maps the image $x$ to k-space and is composed of the coil sensitivity maps $\mathcal{S}$ , the two-dimensional Fourier transform $\mathcal{F}$ , and the k-space sampling mask $\mathcal{P}$ , defined as $\mathcal{A}=\mathcal{PFS}$ . For more details and visual understanding on the forward operator, please refer to Appendix C.

2.5 Sampling the posterior

Given a sequence of k-space $\mathbf{y}=\{y_{1},\ldots,y_{N}\}$ , each posterior in $\{p_{\theta}(x_{n}|y_{n},x_{<n}^{0})|1<n<N\}$ is expressed as

	$\displaystyle p_{\theta}(x_{n}\|y_{n},x_{<n}^{0})$	$\displaystyle=\frac{p(y_{n}\|x_{n},x_{<n}^{0})p_{\theta}(x_{n}\|x_{<n}^{0})}{p(y% _{n}\|x_{<n}^{0})}=\frac{p(y_{n}\|x_{n})p_{\theta}(x_{n}\|x_{<n}^{0})}{p(y_{n})}$
		$\displaystyle\propto p(y_{n}\|x_{n})p_{\theta}(x_{n}\|x_{<n}^{0})~{},$		(12)

when the acquisition of $y_{n}$ is independent of the image $x_{<n}^{0}$ , $y_{n}$ and $x_{<n}^{0}$ are conditionally independent given $x_{n}$ . Following the Reference [8], we have

p_{\theta}(x_{n}^{t-1}|x_{n}^{t},y_{n},x_{<n}^{0})\propto p(y_{n}|x_{n}^{t})p_% {\theta}(x_{n}^{t-1}|x^{t}_{<n},x_{<n}^{0})~{}.

(13)

The details for Equation 13 is in Appendix B. To sample the above posterior, the learned reverse process in Equation 4 is used, and the algorithm is constructed with two gradient updates for the log of the prior and k-space likelihood: the DDIM (Denoising Diffusion Implicit Model) reverse step proposed by Song et al. [26], and a data fidelity step derived from the likelihood function Equation 11, which are described as follows:

	$\displaystyle\tilde{x}_{n}^{t-1}$	$\displaystyle\leftarrow\sqrt{\alpha_{t-1}}\left(\frac{x_{n}^{t}-\sqrt{1-\alpha% _{t}}\epsilon_{\theta}(x_{n}^{t},x_{<n}^{0},t)}{\sqrt{\alpha_{t}}}\right)+% \sqrt{1-\alpha_{t-1}}\epsilon_{\theta}(x_{n}^{t},x_{<n}^{0},t)$		(14)
	$\displaystyle x_{n}^{t-1}$	$\displaystyle\leftarrow\tilde{x}_{n}^{t-1}+\lambda\cdot\nabla_{{x}_{n}^{t-1}}% \log p(y_{n}\|\tilde{x}_{n}^{t-1})~{}.$		(15)

where $\lambda$ is the step size, and $\nabla_{{x}_{n}^{t-1}}\log p(y_{n}|x_{n}^{t-1})$ is the gradient of the log-likelihood of Equation 11. Then, the reconstruction of a sequence images from the undersampled k-space data is achieved by sequentially sampling the posterior in $\{p(x_{n}|y_{n},x_{<n}^{0})|1<n<N\}$ using autoregressive diffusion model. The algorithm is summarized in Algorithm 1.

Algorithm 1 Sample the posterior in

\{p(x_{n}|y_{n},x_{<n}^{0})|1<n<N\}

using autoregressive diffusion model.

1:Initial image sequence:

x_{<n}^{0}=x_{0}

; Time steps:

T

; Step size:

\lambda

; Iterations for data fidelity step:

K

; Number of samples:

S

;

2:for

y_{n}

\mathbf{y}=\{y_{1},y_{2},...,y_{N}\}

3: Initialize

x_{n}^{T}

with Gaussian noise.

4: Construct the forward operator

\mathcal{A}

with sampling pattern

\mathcal{P}

and coil sensitivities

\mathcal{S}

5: for

t

\{T-1,\ldots,0\}

6: Run the DDIM reverse step in Equation 14 to get

x_{n}^{t-1}

given

x_{n}^{t}

and

x_{<n}^{0}

7: Run the data fidelity step in Equation 15 to update

x_{n}^{t-1}

for

K

step.

8: Add Gaussian noise scaled by

\sqrt{1-\alpha_{t-1}}

x_{n}^{t-1}

9: end for

10: Update

x_{<n}^{0}\leftarrow\{x_{n}^{0},\ldots,x_{0}^{0}\}

11:end for

3 Experiments and Results

3.1 Model training

Two autoregressive diffusion models were trained on separate datasets: one in image space and the other in latent space. The image space model was trained on brain images that are from the fastMRI training dataset, which includes T1-weighted (some with post-contrast), T2-weighted, and FLAIR images [27]. These complex images were reconstructed from fully sampled multi-channel k-space volumes, with coil sensitivity maps computed using the BART toolbox [28]. The images were then normalized to a maximum magnitude of 1, and the real and imaginary parts were treated as separate channels when input into the neural network. The number of images in each volume ranged from 12 to 16. Images were loaded without reordering and resized to 320 $\times$ 320 pixels if they were not already that size.

The latent space model is trained with the cardiac dataset that contains cine images reconstructed by the SSA-FARY method [29]. Firstly, a VQVAE was trained on the cine images that were preprocessed similarly to images in fastMRI dataset. The cine images have a size of 256 $\times$ 256 pixels. Then, it generates latent space for the training AID. (See the details for configuration of VQVAE in Appendix F). All the training was performed on 4 NVIDIA A100 GPUs with 80GB memory. The models were trained using the Adam optimizer with a learning rate of $10^{-4}$ and a batch size of 1 for image space model and 4 for latent space model. Two models were trained for 440,000 iterations. It took around 2 hours to train brain model for 10k steps and 1.2 hours for cardiac model. The length of conditioning sequence $N$ for brain and cardiac models are 10 and 42. The network as illustrated in Figure 1 was implemented based on OpenAI’s guide diffusion codebase²²2https://github.com/openai/guided-diffusion. Our code will be released upon publication. We also trained a standard diffusion model, Guide, on the brain dataset for comparison. The Guide model was trained using the same hyperparameters as the AID model, except the batch size is 10. The Guide model uses the same Unet blocks as AID.

3.2 Generating sequence of images

To test different aspects of the autoregressive diffusion models, we generate the sequence of images using the following two approaches.

Retrospective sampling: This method generates a new sequence of images $\{\tilde{x}_{1},\ldots,\tilde{x}_{N-1}\}$ based on the given sequence $\{x_{0},\ldots,x_{N}\}$ . $\tilde{x}_{n}$ is sampled from Equation 4 given $\{x_{0},\ldots,x_{n-1}\}$ .

Prospective sampling: A fixed-length sliding window is initialized with the given sequence $x_{<n}=\{x_{0},\ldots,x_{N-1}\}$ . $x_{N}$ is generated from Equation 4 with the current window as conditioning. Subsequently, the window is updated by adding the newly generated $x_{N}$ and removing the earliest image $x_{0}$ . This autoregressive sampling process is repeated until the stop condition is met. We refer to this process as a warm start. In a cold start, the window is initialized with zeros, and each element $x_{n}$ in it is updated with newly generated images from the beginning to the end, after which the generation is warmed up.

In the retrospective sampling, the model generates a sequence of images that are sequentially coherent and visually similar to the conditioning sequence, as shown in Figure 2 (a). The prospective sampling generates a sequence of images that extends the initial images in the sliding window and constitutes multiple volumes, as shown in Figure 2 (b). As for a cold start, Figure 3 demonstrates the model’s ability to generate a sequence of images using black background as initial status. This shows the model’s generative capabilities from a minimal initial condition, thereby proving its robustness and flexibility. Due to the limit of space, the samples with similar quality from the model trained on the cardiac dataset are shown in Appendix D.

3.3 MRI reconstruction

The MMSE estimator in Equation 10 cannot be computed in a closed form, and numerical approximations are typically required. Once the samples from the posterior is obtained with Algorithm 1, a consistent estimate of ${x_{n}}_{\mathrm{MMSE}}$ can be computed by averaging those samples, i.e. the empirical mean of samples converges in probability to ${x_{n}}_{\mathrm{MMSE}}$ due to weak law of large numbers. The variance of those samples provides a solution to the error assessment in the reconstruction assuming the trained model is trusted. To highlight the regions with large uncertainty, we compute the pseudo-confidence intervals based on the assumption that each pixel’s intensity is normally distributed. This involves determining the standard error from the variance, then multiplying it by the t-score corresponding to a 95% confidence level.

Unfolding of aliased single-coil image: To investigate how the trained model, AID, reduces the folding artifacts in the reconstruction, we designed the single coil unfolding experiment. The single-channel k-space is simulated out of multichannel k-space data. The odd lines in k-space are retained, $y$ . Ten samples were drawn from the posterior $p(x_{1}|y,x_{0})$ using Algorithm 1 with parameters: $T=1000,\lambda=1,K=5$ . The experiment was repeated using a standard diffusion model, Guide. The results are shown in Figure 4. The AID model significantly reduces the errors in the region of folding artifacts compared to the Guide model. The mean over samples, $x_{\mathrm{MMSE}}$ , is highlighted with a confidence interval computed from the variance of samples. The highlighted mean image shows the reconstruction by AID is more trustworthy in the folding region. In general, the highlighted region lies in the folding region, where large errors remains, as we expected.

Reconstruction from undersampled data: To further investigate the model’s performance in reconstruction, we conducted experiments on 20 volumes from the fastMRI validation dataset where k-space data was retrospectively undersampled using various sampling masks. We created four types of sampling masks: random with autocalibration signal (ACS), random without ACS, equispaced with ACS, and equispaced without ACS. The undersampling factor is 12. Setting parameters: $T=1000,\lambda=1,K=4$ for Algorithm 1, the images were reconstructed from the undersampled k-space data using the AID and Guide as prior, respectively. We used peak-signal-noise-ratio (PSNR in dB) and normalized root-mean-square error (NRMSE) to evaluate the reconstruction quality against the reference image that is reconstructed from full k-space. The comparison of metrics across experimental conditions is illustrated in Figure 5. The proposed AID model outperforms the Guide model in terms of PSNR and NRMSE, demonstrating its superior performance in image reconstruction from undersampled k-space data. The results are consistent across different undersampling factors and sampling masks, indicating the model’s robustness and flexibility in handling various types of undersampled k-space data.

For the visual impression of the improvement by the AID model in reconstruction, we show the reconstructed images in Figure 6 and more of them in Appendix E. The images reconstructed using AID are more visually similar to the reference images than using Guide, even which also provides aliased-free images. Furthermore, it is worth noting that more visually notable hallucinations were introduced by the Guide model than the AID model, which means AID is more trustworthy.

4 Discussion

In this work, we propose an autoregressive image diffusion model for generating image sequences, with specific applications to accelerated MRI reconstruction. We conducted comprehensive evaluation of its performance as an image prior in reconstruction algorithms, comparing it to a standard diffusion model. Due to the learned prior information on inter-image dependencies, the proposed model outperforms the standard diffusion model across various scenarios. Our model is particularly well-suited for medical applications where image sequences are often acquired (e.g., in volumetric format) from patients in clinical practice. For instance, when different contrast images are acquired during an examination session [levac2023conditional], our model is designed to capture the relationships between these images. This enables more accurate and coherent reconstructions from undersampled k-space data using the proposed Algorithm 1. Additionally, other medical imaging tasks like dynamic MRI, multi-contrast, super-resolution, and denoising could benefit from our model’s ability by leveraging inter-image dependencies [li2024rethinking]. Furthermore, the proposed algorithm holds great promise for facilitating the incorporation pre-existing information from other imaging modalities into MRI image reconstruction. This opens up a wide range of potential medical applications, with the potential to improve patient care and reduce healthcare costs by enabling faster and more accurate image acquisition and diagnosis.

Limitation and future work: We did not evaluate the model on a common image dataset such as ImageNet or Cifar-10, nor did we compute metrics such as FID and Inception Score, which could be a limitation of our work. We plan to address these limitations in future work by running the model on a large dataset and comparing it with other state-of-the-art models. Additionally, given the model’s suitability for modeling image sequences, it is worth exploring its potential for optimizing MRI k-space acquisition strategies, as the acquisition process constitutes a sequence of operations.

5 Conclusion

The proposed autoregressive image diffusion model offers an approach to generating image sequences, with significant potential as a trustworthy prior in accelerated MRI reconstruction. In various experiments, it outperforms the standard diffusion model in terms of both image quality and robustness by taking the advantage of the prior information on inter-image dependencies.

References

Lustig et al. [2007] M. Lustig, D. Donoho, and J. M. Pauly. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magn. Reson. Med., 58(6):1182–1195, 2007.
Block et al. [2007] K. T. Block, M. Uecker, and J. Frahm. Undersampled radial MRI with multiple coils. Iterative image reconstruction using a total variation constraint. Magn. Reson. Med., 57(6):1086–1098, 2007. ISSN 1522-2594. doi: 10.1002/mrm.21236.
Yang et al. [2016] Yan Yang, Jian Sun, Huibin Li, and Zongben Xu. Deep ADMM-Net for Compressive Sensing MRI. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
Hammernik et al. [2017] Kerstin Hammernik, Teresa Klatzer, Erich Kobler, Michael P. Recht, Daniel K. Sodickson, Thomas Pock, and Florian Knoll. Learning a variational network for reconstruction of accelerated MRI data. Magn. Reson. Med., 79(6):3055–3071, 2017. ISSN 1522-2594. doi: 10.1002/mrm.26977.
Mardani et al. [2018] Morteza Mardani, Enhao Gong, Joseph Y Cheng, Shreyas S Vasanawala, Greg Zaharchuk, Lei Xing, and John M Pauly. Deep generative adversarial neural networks for compressive sensing mRI. IEEE transactions on medical imaging, 38(1):167–179, 2018.
Tezcan et al. [2019] Kerem C Tezcan, Christian F Baumgartner, Roger Luechinger, Klaas P Pruessmann, and Ender Konukoglu. MR image reconstruction using deep density priors. IEEE transactions on medical imaging, 38(7):1633–1642, 2019. doi: 10.1109/TMI.2018.2887072.
Luo et al. [2020] Guanxiong Luo, Na Zhao, Wenhao Jiang, Edward S. Hui, and Peng Cao. MRI reconstruction using deep bayesian estimation. Magn. Reson. Med., 84(4):2246–2261, apr 2020. doi: 10.1002/mrm.28274.
Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
Song and Ermon [2019] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 11895–11907, 2019.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Jalal et al. [2021] Ajil Jalal, Marius Arvinte, Giannis Daras, Eric Price, Alexandros G Dimakis, and Jon Tamir. Robust compressed sensing mri with deep generative priors. Advances in Neural Information Processing Systems, 34:14938–14954, 2021.
Yang et al. [2023] Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
Song et al. [2021] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
Güngör et al. [2023] Alper Güngör, Salman UH Dar, Şaban Öztürk, Yilmaz Korkmaz, Hasan A Bedel, Gokberk Elmas, Muzaffer Ozbey, and Tolga Çukur. Adaptive diffusion priors for accelerated mri reconstruction. Medical Image Analysis, 88:102872, 2023.
Chung and Ye [2022] Hyung** Chung and Jong Chul Ye. Score-based diffusion models for accelerated mri. Medical image analysis, 80:102479, 2022.
Luo et al. [2023] Guanxiong Luo, Moritz Blumenthal, Martin Heide, and Martin Uecker. Bayesian mri reconstruction with joint uncertainty estimation using diffusion models. Magnetic Resonance in Medicine, 90(1):295–311, 2023.
Zach et al. [2023] Martin Zach, Florian Knoll, and Thomas Pock. Stable deep mri reconstruction using generative priors. IEEE Transactions on Medical Imaging, 2023.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Radford et al. [2018] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
Van den Oord et al. [2016] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. Advances in neural information processing systems, 29, 2016.
Chen et al. [2020] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020.
Esser et al. [2021] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
Peebles and Xie [2023] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Song et al. [2020] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
Knoll et al. [2020] Florian Knoll, Jure Zbontar, Anuroop Sriram, Matthew J Muckley, Mary Bruno, Aaron Defazio, Marc Parente, Krzysztof J Geras, Joe Katsnelson, Hersh Chandarana, et al. fastmri: A publicly available raw k-space and dicom dataset of knee images for accelerated mr image reconstruction using machine learning. Radiology: Artificial Intelligence, 2(1):e190007, 2020.
Blumenthal et al. [2023] Moritz Blumenthal, Martin Heide, Christian Holme, Martin Juschitz, Bernhard Rapp, Philip Schaten, Nick Scholand, Jon Tamir, Christian Tönnes, and Martin Uecker. mrirecon/bart: version 0.9.00, December 2023. URL https://doi.org/10.5281/zenodo.10277939.
Rosenzweig et al. [2020] Sebastian Rosenzweig, Nick Scholand, H Christian M Holme, and Martin Uecker. Cardiac and respiratory self-gating in radial mri using an adapted singular spectrum analysis (ssa-fary). IEEE transactions on medical imaging, 39(10):3029–3041, 2020.

Appendix A Loss function derivation

Below is a derivation of Equation 5, the reduced variance variational bound for diffusion models. This adapted from Sohl-Dickstein et al. [8] and Ho et al. [10]. We include it here only for completeness. In the forward process, $x_{n}^{t}$ and $x_{<n}^{0}$ are conditionally independent given $x_{n}^{t-1}$ .

$\displaystyle L$	$\displaystyle=\mathbb{E}_{q}\left[-\log p({x}^{T}_{n}\|x_{<n}^{0})-\sum_{t>1}% \log\frac{p_{\theta}({x}^{t-1}_{n}\|{x}^{t}_{n},x_{<n}^{0})}{q({x}^{t}_{n}\|{x}^% {t-1}_{n},x_{<n}^{0})}-\log\frac{p_{\theta}({x}^{0}_{n}\|{x}^{1}_{n},x_{<n}^{0}% )}{q({x}^{1}_{n}\|{x}^{0}_{n},x_{<n}^{0})}\right]$	(16)
	$\displaystyle=\mathbb{E}_{q}\left[-\log p({x}^{T}_{n}\|x_{<n}^{0})-\sum_{t>1}% \log{\frac{p_{\theta}({x}^{t-1}_{n}\|{x}^{t}_{n},x_{<n}^{0})}{q({x}^{t-1}_{n}\|{% x}^{t}_{n},x^{0}_{n},x_{<n}^{0})}}\cdot{\frac{q({x}^{t-1}_{n}\|{x}_{n}^{0})}{q(% {x}^{t}_{n}\|{x}^{0}_{n})}}-\log{\frac{p_{\theta}({x}^{0}_{n}\|{x}^{1}_{n},x_{<n% }^{0})}{q({x}^{1}_{n}\|{x}^{0}_{n},{x}^{0}_{<n})}}\right]$	(17)
	$\displaystyle=\mathbb{E}_{q}\left[-\log{\frac{p({x}^{T}_{n}\|x_{<n}^{0})}{q({x}% ^{T}_{n}\|{x}^{0}_{n},x_{<n}^{0})}}-\sum_{t>1}\log{\frac{p_{\theta}({x}^{t-1}_{% n}\|{x}^{t}_{n},x_{<n}^{0})}{q({x}^{t-1}_{n}\|{x}^{t}_{n},{x}^{0}_{n},x_{<n}^{0}% )}}-\log\frac{p_{\theta}({x}^{0}_{n}\|{x}^{1}_{n},x_{<n}^{0})}{q(x^{1}_{n}\|x^{0% }_{n},x_{<n}^{0})}\right]$	(18)
	$\displaystyle=\mathbb{E}_{q}\Bigg{[}D_{\mathrm{KL}}(q({x}^{T}_{n}\|{x}^{0}_{n},% x_{<n}^{0})\parallel p({x}^{T}_{n}\|x_{<n}^{0}))+\sum_{t>1}D_{\mathrm{KL}}(q({x% }^{t-1}_{n}\|{x}^{t}_{n},{x}^{0}_{n})\parallel p_{\theta}({x}^{t-1}_{n}\|{x}^{t}% _{n},x_{<n}^{0}))$	(19)
	$\displaystyle\phantom{text}-\log p_{\theta}({x}^{0}_{n}\|{x}^{1}_{n},x_{<n}^{0}% )\Bigg{]}$	(20)

Appendix B Posterior derivation

When samples drawn from the posterior started from the standard Gaussian noise, with Equation 12 we have

\displaystyle p(x_{n}^{t}|y_{n},x_{<n}^{0})

\displaystyle\propto p(y_{n}|x_{n}^{t})p(x_{n}^{t}|x_{<n}^{0})~{}

(21)

for all the reverse time steps. Because

\displaystyle p({x}^{t}_{n}|x_{<n}^{0})=\int p({x}_{n}^{t}|{x}_{n}^{t+1},x_{<n% }^{0})p({x}_{n}^{t+1})d{x}_{n}^{t+1}

(22)

and

	$\displaystyle\int p(x_{n}^{t}\|x_{n}^{t+1},y_{n},x_{<n}^{0})p(x_{n}^{t+1})dx_{n% }^{t+1}$	$\displaystyle=p(x_{n}^{t}\|y_{n},x_{<n}^{0})~{},$		(23)
		$\displaystyle=\frac{p(y_{n}\|x_{n}^{t})p(x_{n}^{t}\|x_{<n}^{0})}{p(y_{n})}~{},$		(24)

then we have

\displaystyle\int p(x_{n}^{t}|x_{n}^{t+1},y_{n},x_{<n}^{0})p(x_{n}^{t+1})dx_{n% }^{t+1}

\displaystyle=\frac{p(y_{n}|x_{n}^{t})}{p(y_{n})}\cdot\int p({x}_{n}^{t}|{x}_{% n}^{t+1},x_{<n}^{0})p({x}_{n}^{t+1})d{x}_{n}^{t+1}.

(25)

Therefore, we have

\displaystyle p(x_{n}^{t}|x_{n}^{t+1},y_{n},x_{<n}^{0})=\frac{p(y_{n}|x_{n}^{t% })p(x_{n}^{t}|x_{n}^{t+1},x_{<n}^{0})}{p(y_{n})}~{}.

(26)

$p(y_{n})$ is a constant for evidence. Then with gradient based method, the posterior $p(x_{n}^{t}|x_{n}^{t+1},y_{n},x_{<n}^{0})$ is sampled from the likelihood $p(y_{n}|x_{n}^{t})$ and the reverse process $p(x_{n}^{t}|x_{n}^{t+1},x_{<n}^{0})$ ,

Appendix C Likelihood function for k-space

The autocalibration signal (ACS) region are lines through the center of k-space, however, are fully sampled. The sensitivity of a coil is a spatial profile that describes the receiving field that induces signals in the coil. The simultaneous data acquisition, with each coil’s sensitivity corresponding to a different subregion, leads to a complete image without aliasing artifacts shown in Figure 4.

Appendix D Cardiac samples

Appendix E Reconstruction from undersampled data

Appendix F VQVAE configuration

The VQVAE is trained on the cardiac dataset to generate the latent space for the training of the autoregressive diffusion model, using the official implementation³³3https://github.com/CompVis/taming-transformers.git. The VQVAE is trained with the following configuration:

base_learning_rate: 4.5e-06
params:
  embed_dim: 3
  n_embed: 8192
  ddconfig:
    double_z: false
    z_channels: 3
    resolution: 256
    in_channels: 3
    out_ch: 3
    ch: 128
    ch_mult: [1, 2, 4]
    num_res_blocks: 2
    attn_resolutions: []
    dropout: 0.0
  lossconfig:
    target: losses.vqperceptual.VQLPIPSWithDiscriminator
    params:
      disc_conditional: false
      disc_in_channels: 3
      disc_start: 30001
      disc_weight: 0.8
      codebook_weight: 1.0

	$\displaystyle p_{\theta}(x_{n}\|y_{n},x_{<n}^{0})$	$\displaystyle=\frac{p(y_{n}\|x_{n},x_{<n}^{0})p_{\theta}(x_{n}\|x_{<n}^{0})}{p(y% _{n}\|x_{<n}^{0})}=\frac{p(y_{n}\|x_{n})p_{\theta}(x_{n}\|x_{<n}^{0})}{p(y_{n})}$
		$\displaystyle\propto p(y_{n}\|x_{n})p_{\theta}(x_{n}\|x_{<n}^{0})~{},$		(12)