FORA: Fast-Forward Caching in Diffusion Transformer Acceleration

Pratheba Selvaraju1
&Tianyu Ding2
&Tianyi Chen2
&Ilya Zharkov2
&Luming Liang2
Corresponding author. Preprint version.
(1UMass Amherst   2Microsoft
[email protected], {tianyuding,tiachen,zharkov,lulian}@microsoft.com )
Abstract

Diffusion transformers (DiT) have become the de facto choice for generating high-quality images and videos, largely due to their scalability, which enables the construction of larger models for enhanced performance. However, the increased size of these models leads to higher inference costs, making them less attractive for real-time applications. We present Fast-FORward CAching (FORA), a simple yet effective approach designed to accelerate DiT by exploiting the repetitive nature of the diffusion process. FORA implements a caching mechanism that stores and reuses intermediate outputs from the attention and MLP layers across denoising steps, thereby reducing computational overhead. This approach does not require model retraining and seamlessly integrates with existing transformer-based diffusion models. Experiments show that FORA can speed up diffusion transformers several times over while only minimally affecting performance metrics such as the IS Score and FID. By enabling faster processing with minimal trade-offs in quality, FORA represents a significant advancement in deploying diffusion transformers for real-time applications. Code will be made publicly available at: https://github.com/prathebaselva/FORA.

Refer to caption
Figure 1: Image generations with FORA on DiT and PixArt-α𝛼\alphaitalic_α. The image sizes are 512 ×\times× 512.

1 Introduction

Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) have gained significant attention in generative tasks due to their exceptional ability to produce high-quality and diverse outputs. Initially conceptualized with the U-Net architecture, these models have demonstrated impressive performance in domains such as image and video generation (Rombach et al., 2022; Ho et al., 2022b; Saharia et al., 2022). However, the scalability of U-Net-based diffusion models is inherently limited, posing challenges for applications that require larger model capacities to achieve superior performance. To overcome these, diffusion transformers (DiT) (Peebles and Xie, 2023) have emerged as the preferred choice for generative tasks. By leveraging the inherently scalable architecture of transformers, DiT models facilitate effective scaling of model capacities. This scalability has enabled the development of larger models, resulting in improved performance in generating high-quality images and videos. Despite their advantages, the increased size of DiT models leads to higher inference costs, making them less suitable for real-time applications.

Recently, a new line of research has focused on accelerating diffusion models during inference by recognizing the iterative nature of the sampling process (Ma et al., 2024; Wimbauer et al., 2023; Agarwal et al., 2024; Zhang et al., 2024; Yu et al., 2024). These caching mechanisms exploit the repetitive nature of the diffusion process, storing and reusing intermediate outputs to reduce computational overhead. However, existing caching-based efficiency methods primarily target U-Net-based diffusion models and do not specifically address transformer-based models.

This paper aims to bridge this gap by focusing on transformer-based diffusion models. We introduce Fast-Forward Caching (FORA), a simple yet effective approach designed to accelerate DiT models by leveraging the repetitive nature of the diffusion process. FORA implements a caching mechanism that stores and reuses intermediate outputs from the attention and MLP layers across denoising steps, thereby reducing computational overhead. This approach does not require model retraining and seamlessly integrates with existing transformer-based diffusion models. By enabling faster processing with minimal trade-offs in quality, FORA represents a significant advancement in deploying diffusion transformers for real-time applications.

In summary, we make the following specific contributions:

  • We propose Fast-Forward Caching (FORA), a caching strategy tailored for transformer-based diffusion models. This mechanism capitalizes on the repetitive aspects of the diffusion process by preserving and reusing intermediate outputs from attention and MLP layers during inference.

  • FORA significantly cuts down computational overhead and integrates seamlessly with existing DiT models without necessitating retraining. As a result, it effectively reduces computational costs while maintaining output quality.

  • We perform experiments to assess the performance of FORA, demonstrating notable improvements in inference speed and computational efficiency. Our findings underscore FORA’s potential to make high-performance generative models more suitable for real-time use.

2 Related work

2.1 Diffusion model

In the realm of generative models, diffusion models (Ho et al., 2020; Sohl-Dickstein et al., 2015) have emerged as a cornerstone for their remarkable ability to generate high-quality and diverse outputs. Initially conceptualized with the U-Net architecture, these models have shown impressive performance in areas such as image and video generation (Rombach et al., 2022; Ho et al., 2022b; Saharia et al., 2022). Despite their success, the scalability of U-Net-based diffusion models is inherently limited, presenting challenges for applications that demand larger model capacities for improved performance. In addressing this, the advent of diffusion transformers (DiT) (Peebles and Xie, 2023) marks a significant step forward. Leveraging the inherently scalable architecture of transformers (Vaswani et al., 2017), DiT offers a pathway to effectively scale up model capacities. A notable achievement in this direction is the breakthrough in generating long videos through the large-scale training of Sora (Brooks et al., 2024), which employs this transformer-based diffusion architecture for general-purpose simulations of the physical world. This highlights the significant impact of scaling transformer-based diffusion models.

However, the escalation in scale and capacity brings forth a crucial challenge: the extensive computation necessitated by large-scale transformer-based diffusion models. Specifically, the computational needs of these models, particularly concerning inference speed and resource consumption, pose a bottleneck for real-time applications and efficient deployment. This paper aims to address this pivotal issue by improving the inference efficiency of transformer-based diffusion models in a training-free manner, all while preserving their generative capabilities.

2.2 Efficiency enhancements of diffusion model

Recent research has focused on various strategies to enhance the efficiency of diffusion models, addressing both training and inference costs.

Training Efficiency. Training efficiency refers to the ability to reduce the computational resources, time, and energy required to train models without compromising their performance. Efficient training methods are critical for making advanced diffusion models more accessible and environmentally sustainable. One notable advancement in this area is PixArt series (Chen et al., 2023, 2024b, 2024a), which introduces a novel training strategy decomposition, which significantly reduces training costs and CO2 emissions while achieving competitive image generation quality compared to existing models. Additionally, DREAM Zhou et al. (2023) addresses the misalignment of training with sampling in diffusion models by proposing a novel training framework that achieves several times faster training convergence.

Inference Efficiency. Inference efficiency focuses on accelerating the inference phase of diffusion models without compromising generative quality. Efficient techniques primarily involve reducing sampling steps (Song et al., 2020; Bao et al., 2021; Liu et al., 2021; Lu et al., 2022; Zheng et al., 2024; Song et al., 2023; Salimans and Ho, 2021) and model compression (Fang et al., 2024; He et al., 2024; Shang et al., 2023). Recently, recognizing the iterative nature of the diffusion sampling process, a new line of research has emerged around caching mechanisms to accelerate diffusion models during inference. DeepCache (Ma et al., 2024) leverages temporal redundancy by caching and retrieving features across adjacent denoising stages, reducing redundant computations and speeding up inference. Block caching (Wimbauer et al., 2023) reuses outputs from layer blocks of previous steps to cut down on redundant computations, though it requires a lightweight training stage to address artifacts. Agarwal et al. (2024) employs an approximate-caching technique that reuses intermediate noise states from previous image generations to skip initial denoising steps, achieving significant GPU compute savings, latency reduction, and cost savings. Wang et al. (2024) prunes redundant tokens at runtime using attention maps. Yu et al. (2024) is a cache-enabled sparse diffusion inference engine designed for text-to-image editing, identifying and updating only the affected regions of an image based on textual modifications. Zhang et al. (2024) caches and reuses cross-attention outputs once they converge, reducing computational complexity while maintaining performance. However, all these caching-based efficiency methods focus on conventional U-Net-based diffusion models and do not specifically consider transformer-based models. In contrast, our approach addresses the acceleration of transformer-based diffusion models through a caching mechanism that is both training-free and plug-and-play, aligning with the growing popularity and recent advancements in these models.

3 Fast-Forward Caching for Accelerated Sampling

Refer to caption
Figure 2: Feature similarity analysis across time steps in DiT for attention and MLP layers. This visualization is based on a 250-step DDIM sampling process. The heatmap illustrates the high degree of similarity between features of consecutive time steps, particularly in the later stages of the diffusion process. This observation motivates our caching strategy, highlighting potential areas for computational optimization without significant loss of information.

Feature caching is a powerful technique that enhances speed and efficiency by storing and reusing computed information. This approach can reduce computational overhead and minimize latency. In this paper, we propose FORA (Fast-Forward Caching), which employs this technique to accelerate the sampling process in Diffusion Transformers (DiT) models.

3.1 Motivations

The development of FORA is grounded in a series of critical observations for DiT models. Our first key insight stems from the nature of the denoising process itself. As seen in Fig. 2, we observe a striking visual similarity among the outputs of consecutive time steps during the sampling process. This similarity correlates strongly with a close feature similarity at the layer level across these consecutive time steps, opening up possibilities for computational efficiency.

Secondly, we recognize that the fundamental structure of the diffusion model remains constant throughout the sampling process. The noise removal objective and the overall sampling procedure maintain their integrity from the initial noisy state to the final generated image. This consistency in the model’s behavior provides a stable foundation upon which we can build our caching strategies.

Our third crucial observation relates to the computational architecture of DiT models. We identify that the share of computational overhead is concentrated in specific components of the model, namely the self-attention and Multi-Layer Perceptron (MLP) layers. These layers, while essential for the model’s performance, represent significant bottlenecks in terms of computational efficiency.

3.2 FORA

In light of the above observations, we design FORA to leverage the inherent similarities between consecutive time steps without compromising the model’s underlying architecture. Our approach focuses on caching and reusing features from the computationally intensive layers mentioned above. By doing so, we can significantly reduce redundant calculations while maintaining the integrity of the generation process. This caching mechanism is informed by both qualitative assessments of visual similarity and quantitative measurements of feature similarity across time steps. In the following, we adopt upon DiT (Peebles and Xie, 2023) as representative network and demonstrate the caching mechanism on it. However, our method is generalizable such that any similar attention or high computational overhead layers in a transformer backbone model can utilize our caching strategy.

Refer to caption
Figure 3: Static Caching in FORA for DiT architecture. The hyper-parameter N𝑁Nitalic_N determines the frequency of layer recomputation. At time step t𝑡titalic_t, when tmodN=0modulo𝑡𝑁0t\mod N=0italic_t roman_mod italic_N = 0, the model recomputes and caches features in cache[Attn][k] and cache[MLP][k] for each layer k𝑘kitalic_k. For subsequent time steps (t1𝑡1t-1italic_t - 1 to tN𝑡𝑁t-Nitalic_t - italic_N), the model retrieves and reuses these cached features, avoiding redundant computations. This cycle repeats throughout the reverse diffusion process, balancing computational efficiency with output quality.

FORA implements a static caching mechanism, a straightforward yet powerful approach, to accelerate the sampling process in diffusion models. This method operates on a simple principle: recompute and cache features at regular intervals, and reuse these cached features for a predetermined number of subsequent time steps. At the core of this mechanism is a single hyperparameter N𝑁Nitalic_N, which we call the cache interval. This interval determines how frequently the model recomputes and caches new features. Specifically, N𝑁Nitalic_N is an integer that can range from 1 to T1𝑇1T-1italic_T - 1, where T𝑇Titalic_T is the total number of sampling time steps in the diffusion process. The static caching process unfolds as follows:

  • Initialization: At the start of the sampling process (t=T𝑡𝑇t=Titalic_t = italic_T), the model computes and caches the features for all layers.

  • Caching Condition: For any given time step t𝑡titalic_t, the model checks if t𝑡titalic_t is divisible by N𝑁Nitalic_N (i.e., tmodN=0modulo𝑡𝑁0t\mod N=0italic_t roman_mod italic_N = 0). If this condition is met, it triggers a recomputation and caching event.

  • Recomputation and Caching: When the caching condition is met, the model performs a full forward pass through all layers of the DiT block. The resulting features from the attention (Attn) and point-wise feed-forward (MLP) layers are then stored in the cache dictionary. Specifically, for each layer k𝑘kitalic_k, we store computed attention features in cache[Attn][k] and computed MLP features in cache[MLP][k].

  • Feature Reuse: For the subsequent N1𝑁1N-1italic_N - 1 time steps (i.e., until the next caching event), the model reuses the cached features instead of recomputing them. This means that for any layer k𝑘kitalic_k during these steps, the model retrieves the features from cache[Attn][k] and cache[MLP][k] rather than performing the computationally expensive forward pass.

  • Cycle Repetition: This process of recomputation, caching, and reuse continues cyclically until the sampling process completes at t=0𝑡0t=0italic_t = 0.

3.3 Discussion

The effectiveness of static caching hinges on the choice of the cache interval N𝑁Nitalic_N. A smaller N𝑁Nitalic_N leads to more frequent recomputations, potentially preserving more accuracy but offering less computational savings. Conversely, a larger N𝑁Nitalic_N increases computational efficiency but may impact the quality of the generated outputs. In our experiments, we found that the optimal value of N𝑁Nitalic_N depends on the specific requirements of the task and the desired trade-off between speed and quality. Through extensive testing, we determined that setting N𝑁Nitalic_N max to 7 provides a good balance. Beyond this value, we observed a significant degradation in the Fréchet Inception Distance (FID) score, indicating a decline in the quality of generated images.

It’s worth noting that while static caching is remarkably effective in reducing computational overhead, it does so in a uniform manner across all time steps. This approach, while simple to implement and tune, may not fully capitalize on the varying degrees of similarity between features at different stages of the denoising process. This observation led us to explore more dynamic caching strategies, which we leave as future iteration of the research.

4 Experiments

In this section, we present a comprehensive overview of our experimental setup, including the datasets and evaluation models used to validate our method. We also delve into the details of our experimental configuration, evaluation metrics, and results, providing comparisons with baseline methods.

ImageNet 256 ×\times× 256
Method Speed \uparrow Retrain FID \downarrow sFID \downarrow IS \uparrow Precision \uparrow Recall \uparrow
IDDPM - 12.26 5.42 - 0.70 0.62
Spectral DPM - 10.60 - - - -
Diff-Pruning* - 9.27(9.16) 10.59 214.42(201.81) 0.87 0.30
ADM (Dhariwal and Nichol, 2021) 10.94 6.02 100.98 0.69 0.63
ADM-U 7.49 5.13 127.49 0.72 0.63
ADM-G 4.59 5.25 186.70 0.82 0.52
ADM-G, ADM-U 3.94 6.14 215.84 0.83 0.53
CDM (Ho et al., 2022a) 4.88 - 158.71 - -
LDM-8 (Rombach et al., 2022) 15.51 - 79.03 0.65 0.63
LDM-8-G 7.76 - 209.52 0.84 0.35
LDM-4-G (cfg =1.25absent1.25=1.25= 1.25) 3.95 - 178.22 0.81 0.55
LDM-4-G (cfg =1.50absent1.50=1.50= 1.50) 3.60 - 247.67 0.87 0.48
DiT-XL/2 9.62 6.85 121.50 0.67 0.67
DiT-XL/2-G (cfg =1.25absent1.25=1.25= 1.25) 3.22 5.28 201.77 0.76 0.62
DiT-XL/2-G (cfg =1.50absent1.50=1.50= 1.50) 1×\times× 2.27 4.60 278.24 0.83 0.57
DiT-XL/2-G (cfg =1.50)@superscriptDiT-XL/2-G (cfg =1.50)@\text{DiT-XL/2-G (cfg $=1.50$)}^{@}DiT-XL/2-G (cfg = 1.50 ) start_POSTSUPERSCRIPT @ end_POSTSUPERSCRIPT 1×\times× 2.30 4.56 276.56 0.83 0.58
Static - N=2𝑁2N=2italic_N = 2 2.08×\times× 2.40 5.15 269.74 0.82 0.58
Static - N=3𝑁3N=3italic_N = 3 2.80×\times× 2.82 6.04 253.96 0.80 0.58
Static - N=5𝑁5N=5italic_N = 5 4.57×\times× 4.97 9.15 222.97 0.76 0.59
Static - N=7𝑁7N=7italic_N = 7 5.73×\times× 9.80 13.62 180.54 0.68 0.59
Static - N=10𝑁10N=10italic_N = 10 8.07×\times× 25.24 22.08 110.03 0.53 0.56
Table 1: Comparative analysis of class-conditional image generation on ImageNet. We use DiT-XL/2-G(cfg-1.50)@superscriptDiT-XL/2-G(cfg-1.50)@\text{DiT-XL/2-G(cfg-1.50)}^{@}DiT-XL/2-G(cfg-1.50) start_POSTSUPERSCRIPT @ end_POSTSUPERSCRIPT as our baseline model. Both the baseline and our method utilize 250 DDIM sampling steps. Our findings indicate that a static cache threshold of 3 provides an optimal balance between FID score and computational speed-up. *Results reproduced using DeepCache. @Results reproduced using FORA.
MS-COCO 256 ×\times× 256
Method Sampler FID-30K (Inception)\downarrow FID-30K (CLIP)\downarrow Speed-up\uparrow
PixArt-α@superscriptPixArt-α@\text{{PixArt-$\alpha$}}^{@}PixArt- italic_α start_POSTSUPERSCRIPT @ end_POSTSUPERSCRIPT DPM-Solver 28.15 12.14 1×\times×
PixArt-α@superscriptPixArt-α@\text{{PixArt-$\alpha$}}^{@}PixArt- italic_α start_POSTSUPERSCRIPT @ end_POSTSUPERSCRIPT IDDPM 43.54 12.35 1×\times×
FORA (Static N𝑁Nitalic_N = 2) DPM-Solver 38.20 13.91 1.54×\times×
FORA (Static N𝑁Nitalic_N = 2) IDDPM 55.30 15.44 1.86×\times×
Table 2: Comparative analysis of text-conditional image generation using T2I models. We evaluate our model with static caching (N=2𝑁2N=2italic_N = 2) against the baseline PixArt-α𝛼\alphaitalic_α, using COCO FID-30K scores (zero-shot). Hyper-parameters include a positional embedding scale of 1 for 256×256256256256\times 256256 × 256 image size. We employ two image features extractors: Inception V3 and CLIP-ViT-B-32 for metric evaluation. Our caching approach demonstrates accelerated sampling time while maintaining comparable FID scores. The code is sourced from GigaGAN (Kang et al., 2023). @Results reproduced using FORA. Note: While PixArt-α𝛼\alphaitalic_α (Chen et al., 2023) reports a FID score of 7.32, we omit this from the table due to lack of specificity regarding the feature extractor used and absence of reproduction code.

4.1 Datasets and encoders

To thoroughly evaluate our method, we target on two distinct types of image generation task. For class-conditional image generation, we utilized the widely-recognized ImageNet dataset (Deng et al., 2009), which offers a diverse range of image classes. To assess text-conditional image generation capabilities, we employed the MSCOCO dataset (Lin et al., 2014), known for its rich textual descriptions paired with images. For the text-conditional model, we leveraged the T5 large language model (Raffel et al., 2020) as our text encoder, enabling robust extraction of conditional features from textual inputs.

4.2 Evaluation models

Our method is designed to be versatile, compatible with both unconditional and conditional models. To ensure a fair and comprehensive evaluation, we focus on conditional models, using fixed seeds for all experiments to enable equitable qualitative comparisons. Furthermore, we demonstrate that our method is agnostic to input conditions and can seamlessly integrate with state-of-the-art fast samplers. With these objectives in mind, we select two models for our evaluation: DiT (Peebles and Xie, 2023) and PixArt-α𝛼\alphaitalic_α (Chen et al., 2023). Both of these models employ a Transformer backbone within the diffusion probabilistic model framework, representing the latest advancements in the field.

  • DiT primarily focuses on ImageNet class conditioning, showcasing the improvements achieved by using a Transformer backbone over the traditional U-Net architecture. It utilizes the DDIM fast sampler to enhance its sampling efficiency. In our experiments with DiT, we maintain the classifier-free guidance strength at 1.5, as per the original implementation.

  • PixArt-α𝛼\alphaitalic_α, on the other hand, specializes in text-conditional image generation. This model is particularly noteworthy for its emphasis on reducing carbon footprint by lowering training costs while maintaining high image quality. Beyond training efficiency, PixArt-α𝛼\alphaitalic_α incorporates efficient sampling algorithms such as IDDPM and DPM-Solver to further its goals. For our experiments with PixelArt-Alpha, we keep the classifier-free guidance strength at 4.5, adhering to the model’s original configuration.

By evaluating FORA on the two models, we show its broad applicability and effectiveness. Our results indicate that FORA can be successfully integrated with these fast samplers, potentially reducing the carbon footprint even further by decreasing inference costs while preserving image quality.

4.3 Experimental setup

For the DiT model, we conduct inference using 250 time steps with DDIM, employing various thresholds for static caching. Following DiT’s evaluation protocol, we sampled 50,000 images from the ImageNet dataset for our assessment. For PixArt-α𝛼\alphaitalic_α, we performe qualitative evaluation using sample texts provided by the model’s authors. Quantitative evaluation utilized the MS-COCO-30K dataset. Our method for PixArt-α𝛼\alphaitalic_α employs IDDPM sampling by default, utilizing 100 time steps, along with DPM-Solver for 20 time steps. All experiments were conducted on a single A100 GPU to ensure consistency in hardware performance.

ImageNet 256 ×\times× 256
#TimeSteps caching Interval Recomp% origcomp FID \downarrow sFID \downarrow IS \uparrow Precision \uparrow Recall \uparrow
100 2 50.0 2800 4.28 8.61 231.81 0.77 0.58
100 3 33.3 2800 9.51 14.18 183.83 0.69 0.58
250 3 33.3 7000 2.82 6.04 253.96 0.80 0.58
500 3 33.3 14000 2.33 4.82 267.94 0.82 0.58
500 6 16.7 14000 2.96 6.20 250.83 0.80 0.59
Table 3: Ablation on the relationship between sampling time steps and caching interval for class-conditional image generation on ImageNet. Recomp% represents the ratio of reduced cached recomputations to the actual number of attention and MLP computations. Results demonstrate that as total sampling time steps increase, proportionally adjusting the caching interval yields comparable results while further reducing the number of recomputations. This study highlights the scalability and efficiency of our caching method across various sampling configurations.

4.4 Results

The qualitative comparison with both DiT and PixArt-α𝛼\alphaitalic_α is illustrated in Fig. 1. For class-conditional image generation, our method demonstrates good quality reconstruction with a 3 to 4×\times× speed-up. An interesting observation is that as the caching interval increases, the background gains equal importance to the class subject. While the class object is generated effectively, this opens up avenues for further research, such as exploring the significance of background features on the FID score and investigating whether masking the background can improve the score. The quantitative evaluation using the ImageNet dataset is presented in Table. 1. The results indicate that a cache interval of 3 strikes a good balance between the FID score and the speed-up.

For text-conditional image generation, qualitative results in Fig. 1 applying FORA on top of IDDPM sampling show an artistic effect without compromising image quality. Interestingly, the DPM-Solver achieves softer image generation, enhancing realism by reducing the vibrancy present in the baseline. The quantitative results comparing our method’s speed-up against the T2I model are shown in Table. 2, highlighting the efficiency and potential aesthetic benefits of our approach.

4.5 Ablation studies

We conducted two ablation studies to further understand the nuances of our method.

Caching Interval. We investigate how adjusting the caching interval relative to the total sampling time steps minimizes recomputations while maintaining consistent evaluation metric results. The findings, presented in Table. 3, show that increasing the cache interval as sampling steps increase does not negatively impact the FID or IS score. This suggests that our method can maintain image quality even with larger caching intervals, potentially allowing for even greater computational savings.

Refer to caption
Figure 4: Visual quality ablation study on caching interval N𝑁Nitalic_N for ImageNet class-conditional image generation. This study demonstrates the progressive impact of increasing caching intervals on image quality. Results indicate that beyond an interval of 5, there is a noticeable degradation in image detail and realism. As noted in section. 4.4, we observe that as caching intervals increase, background elements gain prominence equal to the main subject, potentially affecting the overall image composition and quality.
Refer to caption
Figure 5: Visual quality and composition ablation study examining the interplay between guidance strength and caching interval N𝑁Nitalic_N on ImageNet class-conditional image generation. It reveals that a guidance strength of 1.5 frequently results in cluttered compositions with poor object-background separation and diminished overall image quality. In contrast, higher guidance strengths demonstrate improved image clarity and composition.

Conditional Guidance Strength. We analyze the effect of guidance strength on image quality and diversity. Table. 4 highlights the importance of guidance strength for better image quality. The qualitative results are shown in Fig. 5. Interestingly, we find that both the guidance strength and the cache interval affect the composition of the image. As these parameters change, we observe that the focus of the subject can become diluted, with equal importance given to the background, or the image may contain several superimposed objects.

ImageNet 256 ×\times× 256 (250 sampling time steps)
cfgscale FID \downarrow sFID \downarrow IS \uparrow Precision \uparrow Recall \uparrow
1.5 2.82 6.04 253.96 0.80 0.58
3.0 12.58 6.79 451.28 0.93 0.35
4.5 17.19 10.91 483.50 0.93 0.24
Table 4: Ablation on cfg-scale for class-conditional image generation on ImageNet, focusing on Inception Score (IS) improvements. Results demonstrate a positive correlation between increasing guidance scale and higher IS, indicating enhanced image quality and diversity.

5 Conclusion

We present a simple yet effective caching mechanism applicable to any transformer-based diffusion Model. Our approach significantly reduces computational overhead, thereby accelerating the sampling and inference process. The proposed model is a training-free, plug-and-play solution that can be seamlessly integrated with fast samplers. Through extensive experimentation and ablation studies, we have demonstrated the efficiency and versatility of our method.

One limitation of our current approach is that the static caching mechanism cannot fully utilize the complex similarity patterns inherent in transformer-based diffusion models. These models often exhibit dynamic and context-dependent similarities across different stages of the diffusion process, which static caching may not optimally capture. A more sophisticated dynamic caching mechanism, capable of adapting to these evolving patterns in real-time, is desirable for future development. Such an approach could potentially lead to even greater efficiency gains and better preservation of image quality across various sampling scenarios.

References

  • Agarwal et al. [2024] Shubham Agarwal, Subrata Mitra, Sarthak Chakraborty, Srikrishna Karanam, Koyel Mukherjee, and Shiv Kumar Saini. Approximate caching for efficiently serving {{\{{Text-to-Image}}\}} diffusion models. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1173–1189, 2024.
  • Bao et al. [2021] Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. In International Conference on Learning Representations, 2021.
  • Brooks et al. [2024] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li **g, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
  • Chen et al. [2023] Junsong Chen, **cheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, ** Luo, Huchuan Lu, and Zhenguo Li. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023.
  • Chen et al. [2024a] Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, ** Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ𝜎\sigmaitalic_σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. arXiv preprint arXiv:2403.04692, 2024a.
  • Chen et al. [2024b] Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, ** Luo, Hang Zhao, and Zhenguo Li. Pixart-{{\{{\\\backslash\delta}}\}}: Fast and controllable image generation with latent consistency models. arXiv preprint arXiv:2401.05252, 2024b.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • Fang et al. [2024] Gongfan Fang, Xinyin Ma, and Xinchao Wang. Structural pruning for diffusion models. Advances in neural information processing systems, 36, 2024.
  • He et al. [2024] Yefei He, Lu** Liu, **g Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. Ptqd: Accurate post-training quantization for diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Ho et al. [2022a] Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23(1), jan 2022a. ISSN 1532-4435.
  • Ho et al. [2022b] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022b.
  • Kang et al. [2023] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. In CVPR, 2023.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. CoRR, abs/1405.0312, 2014. URL http://dblp.uni-trier.de/db/journals/corr/corr1405.html#LinMBHPRDZ14.
  • Liu et al. [2021] Lu** Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. In International Conference on Learning Representations, 2021.
  • Lu et al. [2022] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  • Ma et al. [2024] Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  • Peebles and Xie [2023] William Peebles and Saining Xie. Scalable diffusion models with transformers. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, October 2023.
  • Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  • Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  • Salimans and Ho [2021] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2021.
  • Shang et al. [2023] Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1972–1981, 2023.
  • Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  • Song et al. [2020] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020.
  • Song et al. [2023] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, pages 32211–32252. PMLR, 2023.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wang et al. [2024] Hongjie Wang, Difan Liu, Yan Kang, Yijun Li, Zhe Lin, Niraj K Jha, and Yuchen Liu. Attention-driven training-free efficiency enhancement of diffusion models. arXiv preprint arXiv:2405.05252, 2024.
  • Wimbauer et al. [2023] Felix Wimbauer, Bichen Wu, Edgar Schoenfeld, Xiaoliang Dai, Ji Hou, Zijian He, Artsiom Sanakoyeu, Peizhao Zhang, Sam Tsai, Jonas Kohler, et al. Cache me if you can: Accelerating diffusion models through block caching. arXiv preprint arXiv:2312.03209, 2023.
  • Yu et al. [2024] Zihao Yu, Haoyang Li, Fangcheng Fu, Xupeng Miao, and Bin Cui. Accelerating text-to-image editing via cache-enabled sparse diffusion inference. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 16605–16613, 2024.
  • Zhang et al. [2024] Wentian Zhang, Haozhe Liu, **heng Xie, Francesco Faccio, Mike Zheng Shou, and Jürgen Schmidhuber. Cross-attention makes inference cumbersome in text-to-image diffusion models. arXiv preprint arXiv:2404.02747, 2024.
  • Zheng et al. [2024] Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Dpm-solver-v3: Improved diffusion ode solver with empirical model statistics. Advances in Neural Information Processing Systems, 36, 2024.
  • Zhou et al. [2023] **xin Zhou, Tianyu Ding, Tianyi Chen, Jiachen Jiang, Ilya Zharkov, Zhihui Zhu, and Luming Liang. Dream: Diffusion rectification and estimation-adaptive models. arXiv preprint arXiv:2312.00210, 2023.