LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Mushui Liu [email protected] Zhejiang University Yuhang Ma yuhang˙[email protected] Fuxi AI Lab, Netease Inc. Xinfeng Zhang Fuxi AI Lab, Netease Inc. Zhen Yang [email protected] Zhejiang University Zeng Zhao [email protected] Fuxi AI Lab, Netease Inc. Zhipeng Hu Fuxi AI Lab, Netease Inc. Bai Liu Fuxi AI Lab, Netease Inc.  and  Changjie Fan Fuxi AI Lab, Netease Inc.
Abstract.

Diffusion Models have exhibited substantial success in text-to-image generation. However, they often encounter challenges when dealing with complex and dense prompts that involve multiple objects, attribute binding, and long descriptions. This paper proposes a framework called LLM4GEN, which enhances the semantic understanding ability of text-to-image diffusion models by leveraging the semantic representation of Large Language Models (LLMs). Through a specially designed Cross-Adapter Module (CAM) that combines the original text features of text-to-image models with LLM features, LLM4GEN can be easily incorporated into various diffusion models as a plug-and-play component and enhances text-to-image generation. Additionally, to facilitate the complex and dense prompts semantic understanding, we develop a LAION-refined dataset, consisting of 1 million (M) text-image pairs with improved image descriptions. We also introduce DensePrompts which contains 7,000 dense prompts to provide a comprehensive evaluation for the text-to-image generation task. With just 10% of the training data required by recent ELLA, LLM4GEN significantly improves the semantic alignment of SD1.5 and SDXL, demonstrating increases of 7.69% and 9.60% in color on T2I-CompBench, respectively. The extensive experiments on DensePrompts also demonstrate that LLM4GEN surpasses existing state-of-the-art models in terms of sample quality, image-text alignment, and human evaluation. The project website is at: https://xiaobul.github.io/LLM4GEN/

Text-to-Image Diffusion Models, Large Language Models, Text-Image Alignment
\astEqual Contribution.
Refer to caption
Figure 1. Image generation comparison using short and dense prompts across SDXL (Podell et al., 2023), Playground v2 (Li et al., [n. d.]), PixArt-α𝛼\alphaitalic_α (Chen et al., 2023b), and our proposed LLM4GENSDXL. The colored text denotes critical entities or attributes.

1. Introduction

Recently, diffusion models (Sohl-Dickstein et al., 2015; Dhariwal and Nichol, 2021; Song et al., 2020b; Yang et al., 2023; Chen et al., 2023a; Huang et al., 2022) have made significant progress in text-to-image generation models, such as Imagen (Saharia et al., 2022), DALL-E 2/3 (Ramesh et al., 2022; Betker et al., 2023), and Stable Diffusion (Rombach et al., 2022; Podell et al., 2023). However, they often encounter challenges in generating images given complex and dense prompt descriptions, such as attribute binding, orientation descriptions, and multiple objects. These limitations may stem from restrictions within the parameters and structure of the text encoder (Huang et al., 2023).

With the emergence of powerful linguistic representations from Large Language Models (LLMs), there has been an increasing trend in leveraging LLMs to aid in text-to-image generation. Current methods mainly consist of two categories: LLM-guidance models (Yang et al., 2024; Feng et al., 2022) and LLM-alignment models (Saharia et al., 2022; Wu et al., 2023b; Hu et al., 2024). LLM-guidance models aim to harness the robust encoding and decoding capabilities of LLMs and external models, such as the Layout model, to enhance the representation of text embeddings, subsequently generating images, as illustrated in Fig. 2 (a). However, these methods necessitate the separation of LLMs and external models from the text-to-image models, resulting in redundancy in both inference time and framework. While LLM-alignment models utilize LLMs instead of the vanilla text encoder to capitalize on their superior representational power. This strategy requires a substantial amount of training data to align the representation of LLMs with the diffusion model, imposing a redundant cost on training resources and time, as depicted in Fig. 2 (b).

Refer to caption
Figure 2. Architecture comparison between (a) LLM-guidance models(Lian et al., 2024; Yang et al., 2024), (b) LLM-alignment models (Wu et al., 2023b; Hu et al., 2024), and (c) our proposed LLM4GEN.

To tackle the aforementioned challenges, we propose an end-to-end text-to-image framework named LLM4GEN, implicitly harnessing the powerful semantic representations of LLMs to bolster the representation of the original text encoder for text-to-image GENeration, as depicted in Fig. 2 (c). Specifically, we design an efficient Cross-Adapter Module (CAM) to implicitly integrate the semantic representation of LLMs with original text encoders that have limited representational capabilities, such as CLIP text encoder (Radford et al., 2021). We apply cross-attention on the representation of both encoder-only and decoder-only LLMs, e.g. Llama (Zhang et al., 2024) and T5 (Raffel et al., 2020), alongside CLIP text embedding, and then concatenate the fused text embedding with CLIP text embedding. The utilization of our CAM substantially improves the performance of text-to-image diffusion models, regardless of whether encoder-only or decoder-only LLMs are used. Furthermore, it preserves the representations produced by original text encoders, thereby diminishing the requirement for extensive training data. Our designed LLM4GEN can be seamlessly integrated into existing diffusion models like SD1.5 (Rombach et al., 2022) and SDXL (Podell et al., 2023). As evidenced in Fig. 1, our proposed method exhibits strong performance in text-to-image generation.

To comprehensively assess the image generation capabilities of text-to-image models, we develop a comprehensive benchmark named DensePrompts, an extension of T2I-CompBench (Huang et al., 2023), which incorporates over 7,000 compositional prompts. The construction of this benchmark involves leveraging LLMs for complex text descriptions, followed by manual refinement. Results from performance metrics and human evaluations consistently demonstrate that LLM4GEN’s representational capability surpasses other existing methods. Additionally, we introduce a LAION-refined dataset, which is a subset of the open-source LAION (Schuhmann et al., 2021) dataset, comprising 1M text-image pairs with refined image descriptions. Both the DensePrompts and LAION-refined dataset will be available to the community for further research.

Overall, our contributions are as follows:

  • We propose an end-to-end text-to-image framework called LLM4GEN to enhance the text-to-image alignment of diffusion models.

  • We introduce a simple but efficient Cross-Adapter Module that integrates robust representation of LLMs into original text encoders that have limited representational capabilities, thereby enhancing semantic understanding for diffusion models.

  • We introduce DensePrompts, a comprehensive evaluation benchmark for text-to-image generation, and LAION-refined dataset, a meticulously curated training dataset.

  • Experiments show that LLM4GEN exhibits superior performance in sample quality, image-text alignment, and human evaluation compared with existing state-of-the-art models.

2. Related Work

2.1. Large Language Models

Large language models (LLMs) (Chang et al., 2023) have shown powerful generalization ability in various NLP tasks, e.g., text generation, question answering. LLMs are built on the transformer (Vaswani et al., 2017) architecture and guided by the scaling law. Recent LLMs, e.g., GPTs (Brown et al., 2020), LLaMA (Touvron et al., 2023), OPT (Zhang et al., 2022), BLOOM (Scao et al., 2022), GLM (Zeng et al., 2022), PaLM (Chowdhery et al., 2022) are all equipped with billions of parameters, enabling the intriguing capability for in-context learning and demonstrating excellent zero-shot performance across various tasks. Certain Multi-modal LLMs (MLLMs) (Achiam et al., 2023; Zhu et al., 2023; Bai et al., 2023; Wu et al., 2023a) have effectively integrated LLMs with other modalities, e.g., visual and audio, facilitating more intelligent interactions through the assistance of LLMs. BLIPs (Li et al., 2022, 2023a), LLaVA (Liu et al., 2023) enhance the synergy of visual understanding and language processing by projecting the visual output to the input layer of LLMs. (Pang et al., 2024) shows that the frozen LLMs can further integrate visual understanding. For the text-to-image generation, recent works (Lian et al., 2024; Koh et al., 2023; Sun et al., 2024; Yang et al., 2024) utilize LLMs to generate the refined text prompts or bounding box layouts (Li et al., 2023b) for high-quality image generation. However, these existing works only consider LLMs as a simple condition generator, e.g., text prompts or layout planning. In this paper, we harness the representation capabilities of LLMs to enhance text-to-image generation, underscoring the significance of their representational power beyond simple text output.

Refer to caption
Figure 3. The overview of LLM4GEN. (a) The inference pipeline. (b) Cross-Adapter Module.

2.2. Text-to-Image Diffusion Models

Text-to-image generation aims to create images based on given textual descriptions. The diffusion-based models (Sohl-Dickstein et al., 2015; Song and Ermon, 2019; Ho et al., 2020; Song and Ermon, 2020; Song et al., 2020b) have demonstrated remarkable performance in image generation, providing enhanced stability and controllability. These models employ a forward process by adding Gaussian noise to input images and can generate high-quality images with intricate details and diversity through an inverse process from random Gaussian noise. GLIDE (Nichol et al., 2022) and Imagen (Saharia et al., 2022) utilize CLIP (Radford et al., 2021) text encoder to enhance the image-text alignment. Latent Diffusion Models (LDMs) (Rombach et al., 2022) have been proposed to transfer the diffusion process from pixel to latent space, improving efficiency and image quality. Recent models such as SD-XL (Podell et al., 2023), DALL-E 3 (Betker et al., 2023), and Dreambooth (Ruiz et al., 2023) have significantly enhanced image quality and text-image alignment using various perspectives, such as training strategies and scaling training data. Despite these notable advancements, generating high-fidelity images aligned with complex and dense textual prompts remains challenging. In this paper, we propose LLM4GEN that implicitly leverages the robust representation capabilities of LLMs to facilitate image generation from textual descriptions.

3. Method

3.1. Preliminaries

Diffusion models convert standard Gaussian noise into realistic images through two processes: the forward process and the reverse process (Ho et al., 2020). In the forward process, Gaussian noise is gradually added to the data x0q(x0)similar-tosubscript𝑥0𝑞subscript𝑥0x_{0}\sim q(x_{0})italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) to obtain a Gaussian distribution with T𝑇Titalic_T steps. The reverse process aims to generate the data from an initial Gaussian noise as follows (Ho et al., 2020):

(1) pθ(xt1|xt)=𝒩(xt1;μθ(xt,t),Σθ(xt,t))subscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡𝒩subscript𝑥𝑡1subscript𝜇𝜃subscript𝑥𝑡𝑡subscriptΣ𝜃subscript𝑥𝑡𝑡\displaystyle p_{\theta}(x_{t-1}|x_{t})=\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t}% ,t),\Sigma_{\theta}(x_{t},t))italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) )

where μθsubscript𝜇𝜃\mu_{\theta}italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT can be obtained by predicting the Gaussian noise z(xt,t)𝑧subscript𝑥𝑡𝑡z(x_{t},t)italic_z ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) as follows:(Ho et al., 2020)

(2) μ0(xt|t)=1a¯t(xtβt1a¯tϵθ(xt,t))subscript𝜇0conditionalsubscript𝑥𝑡𝑡1subscript¯𝑎𝑡subscript𝑥𝑡subscript𝛽𝑡1subscript¯𝑎𝑡subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\displaystyle\mu_{0}(x_{t}|t)=\frac{1}{\sqrt{\overline{a}_{t}}}(x_{t}-\frac{% \beta_{t}}{\sqrt{1-\overline{a}_{t}}}\epsilon_{\theta}(x_{t},t))italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_t ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) )

where ϵ(xt)italic-ϵsubscript𝑥𝑡\epsilon(x_{t})italic_ϵ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the noise added by the forward process predicted by the neural network.

Conditional diffusion models(Ho and Salimans, 2022) can be guided in classifier-free guidance, which replaces the explicit classifier with an implicit classifier that does not require calculating the explicit classifier and its gradient. By using classifier-free guidance, the predicted noise can be described as(Ho et al., 2020):

(3) ϵ¯θ(xt,t,y)=(ω+1)ϵθ(xt,t,y)ωϵθ(xt,t)subscript¯italic-ϵ𝜃subscript𝑥𝑡𝑡𝑦𝜔1subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡𝑦𝜔subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\displaystyle\overline{\epsilon}_{\theta}(x_{t},t,y)=(\omega+1)\epsilon_{% \theta}(x_{t},t,y)-\omega\epsilon_{\theta}(x_{t},t)over¯ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_y ) = ( italic_ω + 1 ) italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_y ) - italic_ω italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t )

where ω𝜔\omegaitalic_ω is the guidance scale,y𝑦yitalic_y is the condition added to guide the diffusion model.

Our work utilizes latent diffusion models (LDMs)(Rombach et al., 2022), consisting of three components: a text encoder such as CLIP(Radford et al., 2021) for extracting text embeddings, a variational autoencoder (VAE)(Van Den Oord et al., 2017) with an encoder \mathcal{E}caligraphic_E to encode images into a low-dimensional latent space, and a decoder 𝒟𝒟\mathcal{D}caligraphic_D to reconstruct images from the encoded latent vectors, and a UNet for predicting noise during the diffusion process. The encoder \mathcal{E}caligraphic_E maps the input image to the latent space ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which is efficient and low-dimensional. The underlying UNet(Ronneberger et al., 2015) is constructed using 2D convolutional layers and reweighted bounds to focus on the most perceptually relevant features, as follows:

(4) LLDM:=𝔼ε(x),ϵ𝒩(0,1),t[ϵϵθ(zt,t)]22]\displaystyle L_{LDM}:=\mathbb{E}_{\varepsilon(x),\epsilon\sim\mathcal{N}(0,1)% ,t}[\parallel\epsilon-\epsilon_{\theta}(z_{t},t)\parallel]_{2}^{2}]italic_L start_POSTSUBSCRIPT italic_L italic_D italic_M end_POSTSUBSCRIPT := blackboard_E start_POSTSUBSCRIPT italic_ε ( italic_x ) , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) , italic_t end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]

where ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be obtained from the encoder \mathcal{E}caligraphic_E, ans latent vectors from p(z)𝑝𝑧p(z)italic_p ( italic_z ) can be decoded to images through the decoder 𝒟𝒟\mathcal{D}caligraphic_D. In this paper, we address the limited representation of CLIP as a text encoder by leveraging the capabilities of large language models (LLMs) to enhance the text encoder of the LDMs.

3.2. LLM4GEN

3.2.1. Framework

The proposed LLM4GEN, which contains a Cross-Adapter Module (CAM) and the UNet, is illustrated in Fig. 3 (a). In this paper, we explore stable diffusion (Rombach et al., 2022; Podell et al., 2023) as the base text-to-image diffusion model, and the vanilla text encoder is from CLIP (Radford et al., 2021). LLM4GEN leverages the strong capability of LLMs to assist in text-to-image generation. The CAM extracts the representation of a given prompt via the combination of LLM and CLIP text encoder. The fused text embedding is enhanced by leveraging the pre-trained knowledge of LLMs through the simple yet effective CAM. By feeding the fused text embedding, LLM4GEN iteratively denoises the latent vectors with the UNet and decodes the final vector into an image with the VAE.

3.2.2. Cross-Adapter Module

The CAM connects the LLMs and the CLIP text encoder using a cross-attention layer, followed by concatenation with the representation of the CLIP text encoder. The last hidden state of the LLMs is extracted as LLMs feature clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. The feature of CLIP text encoder is denoted as ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and we perform a cross-attention to fuse them:

(5) Q=Wq(cl),K=Wk(ct),V=Wv(ct)formulae-sequence𝑄subscript𝑊𝑞subscript𝑐𝑙formulae-sequence𝐾subscript𝑊𝑘subscript𝑐𝑡𝑉subscript𝑊𝑣subscript𝑐𝑡\displaystyle Q=W_{q}(c_{l}),K=W_{k}(c_{t}),V=W_{v}(c_{t})italic_Q = italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , italic_K = italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_V = italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
(6) cl=CrossAttention(Q,K,V)=softmax(QKT)Vsuperscriptsubscript𝑐𝑙CrossAttention𝑄𝐾𝑉softmax𝑄superscript𝐾𝑇𝑉\displaystyle c_{l}^{\prime}=\operatorname{CrossAttention}(Q,K,V)=% \operatorname{softmax}\left(Q\cdot K^{T}\right)\cdot Vitalic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_CrossAttention ( italic_Q , italic_K , italic_V ) = roman_softmax ( italic_Q ⋅ italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ⋅ italic_V

where Wqsubscript𝑊𝑞W_{q}italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, Wksubscript𝑊𝑘W_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, Wvsubscript𝑊𝑣W_{v}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT are the trainable linear projection layers. The output embedding dimension is the same as that of CLIP text encoder. Then the final fused text embedding of the CAM is:

(7) ct=Concat(λcl,ct)superscriptsubscript𝑐𝑡Concat𝜆superscriptsubscript𝑐𝑙subscript𝑐𝑡\displaystyle c_{t}^{\prime}=\text{Concat}(\lambda*c_{l}^{\prime},c_{t})italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = Concat ( italic_λ ∗ italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

where the Concat denotes concatenation in the sequence dimension, and λ𝜆\lambdaitalic_λ is the balance factor.

Algorithm 1 LLM4GEN Pipeline
1:  Input: Pretrained UNet ϵitalic-ϵ\epsilonitalic_ϵ, pre-trained text encoder Tϕsubscript𝑇italic-ϕT_{\phi}italic_T start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, pre-trained LLM TLsubscript𝑇𝐿T_{L}italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, Cross-Adapter Module M𝑀Mitalic_M, LLaVA-7B A𝐴Aitalic_A, training image-text pairs 𝕊={𝕀,}𝕊𝕀\mathbb{S}=\{\mathbb{I},\mathbb{P}\}blackboard_S = { blackboard_I , blackboard_P } .
2:  Offline Process: Apply A𝐴Aitalic_A to enrich the captions of SS\mathrm{S}roman_S, get ^^\hat{\mathbb{P}}over^ start_ARG blackboard_P end_ARG = A𝐴Aitalic_A(II\mathrm{I}roman_I) to replace the original \mathbb{P}blackboard_P.
3:  \rightarrow Begin Training.
4:  Freeze Tϕsubscript𝑇italic-ϕT_{\phi}italic_T start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and TLsubscript𝑇𝐿T_{L}italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, training M𝑀Mitalic_M and ϵitalic-ϵ\epsilonitalic_ϵ.
5:  while Lsimplesubscript𝐿𝑠𝑖𝑚𝑝𝑙𝑒{L_{simple}}italic_L start_POSTSUBSCRIPT italic_s italic_i italic_m italic_p italic_l italic_e end_POSTSUBSCRIPT not converged do
6:     Sample p𝑝pitalic_p from \mathbb{P}blackboard_P; t=T𝑡𝑇t=Titalic_t = italic_T
7:     𝕔𝕥=M(Tϕ(p),TL(p))superscriptsubscript𝕔𝕥𝑀subscript𝑇italic-ϕ𝑝subscript𝑇𝐿𝑝\mathbb{c_{t}^{\prime}}=M(T_{\phi}(p),T_{L}(p))blackboard_c start_POSTSUBSCRIPT blackboard_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_M ( italic_T start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_p ) , italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_p ) ) using Eq. 7.
8:     Calculate Lsimplesubscript𝐿𝑠𝑖𝑚𝑝𝑙𝑒{L_{simple}}italic_L start_POSTSUBSCRIPT italic_s italic_i italic_m italic_p italic_l italic_e end_POSTSUBSCRIPT using Eq. 9.
9:     Backward Lsimplesubscript𝐿𝑠𝑖𝑚𝑝𝑙𝑒{L_{simple}}italic_L start_POSTSUBSCRIPT italic_s italic_i italic_m italic_p italic_l italic_e end_POSTSUBSCRIPT and update ϵitalic-ϵ\epsilonitalic_ϵ, M𝑀Mitalic_M.
10:  end while
11:  \rightarrow End Training.

Then, we feed ctsuperscriptsubscript𝑐𝑡c_{t}^{\prime}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT into the UNet:

(8) x𝑥\displaystyle xitalic_x =CA(x,ct)=λCA(x,cl)+CA(x,ct)absentCA𝑥superscriptsubscript𝑐𝑡𝜆CA𝑥subscript𝑐𝑙CA𝑥subscript𝑐𝑡\displaystyle=\operatorname{CA}(x,c_{t}^{\prime})=\lambda*\operatorname{CA}(x,% c_{l})+\operatorname{CA}(x,c_{t})= roman_CA ( italic_x , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_λ ∗ roman_CA ( italic_x , italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) + roman_CA ( italic_x , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

where x𝑥xitalic_x denotes the latent noise, CA is the cross-attention module within the UNet module, which receives z𝑧zitalic_z as the query and ctsuperscriptsubscript𝑐𝑡c_{t}^{\prime}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as the key and value.

Overall, our designed Cross-Adapter Module implicitly facilitates the strong representation of LLMs with a residual fusion manner, without utilizing extensive training data and resources to condition the latent vectors on text embeddings. Notably, our LLM4GEN is compatible with both decoder-only and encoder-only LLMs and we evaluate on Llama-2 7B/13B (Touvron et al., 2023) and T5-XL (Brown et al., 2020) in further experiments.

3.2.3. Training Objectives

Based on the framework described above, the training loss of LLM4GEN is formulated as:

(9) Lsimple =𝔼𝒙0,ϵ,𝒄t,𝒄l,tϵϵθ(𝒙t,𝒄t,𝒄l,t)2subscript𝐿simple subscript𝔼subscript𝒙0bold-italic-ϵsubscript𝒄𝑡subscript𝒄𝑙𝑡superscriptnormbold-italic-ϵsubscriptbold-italic-ϵ𝜃subscript𝒙𝑡subscript𝒄𝑡subscript𝒄𝑙𝑡2L_{\text{simple }}=\mathbb{E}_{\boldsymbol{x}_{0},\boldsymbol{\epsilon},% \boldsymbol{c}_{t},\boldsymbol{c}_{l},t}\left\|\boldsymbol{\epsilon}-% \boldsymbol{\epsilon}_{\theta}\left(\boldsymbol{x}_{t},\boldsymbol{c}_{t},% \boldsymbol{c}_{l},t\right)\right\|^{2}italic_L start_POSTSUBSCRIPT simple end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_ϵ , bold_italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_t end_POSTSUBSCRIPT ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

Besides, we randomly drop text embedding conditions in the training stage to enable classifier-free guidance in the inference stage:

(10) ϵ^θ(𝒙t,𝒄t,𝒄l,t)=ωϵθ(𝒙t,𝒄t,𝒄l,t)+(1ω)ϵθ(𝒙t,t)subscript^bold-italic-ϵ𝜃subscript𝒙𝑡subscript𝒄𝑡subscript𝒄𝑙𝑡𝜔subscriptbold-italic-ϵ𝜃subscript𝒙𝑡subscript𝒄𝑡subscript𝒄𝑙𝑡1𝜔subscriptbold-italic-ϵ𝜃subscript𝒙𝑡𝑡\hat{\boldsymbol{\epsilon}}_{\theta}\left(\boldsymbol{x}_{t},\boldsymbol{c}_{t% },\boldsymbol{c}_{l},t\right)=\omega\boldsymbol{\epsilon}_{\theta}\left(% \boldsymbol{x}_{t},\boldsymbol{c}_{t},\boldsymbol{c}_{l},t\right)+(1-\omega)% \boldsymbol{\epsilon}_{\theta}\left(\boldsymbol{x}_{t},t\right)over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_t ) = italic_ω bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_t ) + ( 1 - italic_ω ) bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t )

The proposed LLM4GEN is further illustrated in Algorithm 2.

Refer to caption
Figure 4. Pipeline of data construction. (a) The construction of DensePrompts benchmark. (b) The construction of the training dataset. We use GPT-4 (Achiam et al., 2023) and LLaVA (Liu et al., 2023) as the caption models, respectively.
Refer to caption
(a)
Refer to caption
(b)
Figure 5. Statistic of DensePrompts benchmark.

4. Dataset Construction

4.1. DensePrompts Benchmark

A comprehensive benchmark is crucial for evaluating the image-text alignment of generated images. Current benchmarks, e.g., MSCOCO (Lin et al., 2014) and T2I-CompBench (Huang et al., 2023), primarily consist of concise textual descriptions, are not comprehensive enough to describe a diverse range of objects. Thus, we introduce a new comprehensive and complicated benchmark called DensePrompts, comprising lengthy textual descriptions.

Initially, we collect 100 images from the Internet, comprising 50 real and 50 generated images, each with intricate details. Leveraging the robust image comprehension capabilities of GPT-4V (OpenAI, 2023), we utilize it to provide detailed descriptions for these 100 images, encompassing object attributes and their relationships, thereby generating comprehensive prompts abundant in semantic details. As depicted in Fig. 4 (a), we employ GPT-4 (Achiam et al., 2023) to produce massive long texts based on generated prompts mentioned above. DensePrompts provides more than 7,000 extensive prompts whose average word length is more than 40. Word statistics of DensePrompts are outlined in Fig. 5. To assess the performance, DensePrompts benchmark incorporates CLIP Score (Radford et al., 2021) and Aesthetic Score (Schuhmann, [n. d.]). Combining our proposed DensePrompts with T2I-CompBench, we establish a comprehensive evaluation in text-to-image generation.

Table 1. Evaluation results (%) on T2I-CompBench (Huang et al., 2023). The higher is better, and the best results are highlighted in bold.
Model Attribute Binding Object Relationship Complex\uparrow
Color \uparrow Shape \uparrow Texture \uparrow Spatial \uparrow Non-Spatial \uparrow
Composable Diffusion (Liu et al., 2022) 40.63 32.99 36.45 8.00 29.80 28.98
Structured Diffusion (Feng et al., 2022) 49.90 42.18 49.00 13.86 31.11 33.55
Attn-Exct v2 (Chefer et al., 2023) 64.00 45.17 59.63 14.55 31.09 34.01
GORS (Huang et al., 2023) 66.03 47.85 62.87 18.15 31.93 33.28
DALL-E 2 (Ramesh et al., 2022) 57.50 54.64 63.74 12.83 30.43 36.96
PixArt-α𝛼\alphaitalic_α (Chen et al., 2023b) 68.86 55.82 70.44 20.82 31.79 41.17
ELLASDXL (Hu et al., 2024) 72.60 56.34 66.86 22.14 30.69 -
SD1.5 (Rombach et al., 2022) 37.65 35.76 41.56 12.46 30.79 30.80
LLM4GENSD1.5 45.34 43.28 51.52 13.12 30.94 32.33
ΔΔ\Deltaroman_Δ (Margin) +7.69 +7.52 +9.94 +0.66 +0.15 +1.53
SDXL (Podell et al., 2023) 63.69 54.08 56.37 20.32 31.10 40.91
LLM4GENSDXL 73.29 57.34 67.86 22.59 31.94 41.23
ΔΔ\Deltaroman_Δ (Margin) +9.60 +3.26 +11.52 +2.27 +0.84 +0.33
Table 2. Quantitative comparison on text-to-image generation models on the subset of MSCOCO (Lin et al., 2014) dataset.
Method FID\downarrow IS\uparrow CLIP Score(%)\uparrow
SD1.5 (Rombach et al., 2022) 26.89 32.24 28.66
SD1.5 (ft) 25.48 33.53 29.10
LLM4GENSD1.5 25.20 34.24 29.45
SDXL (Podell et al., 2023) 24.75 34.91 30.10
LLM4GENSDXL 24.21 35.10 30.91

4.2. Training Dataset Construction

Current text-to-image diffusion models use text-image pairs collected from the Internet, where the text descriptions are often brief and disconnected. This leads to a weak correlation between images and text, limited dense semantic information, and necessitates a large number of image-text pairs for model convergence.

In this paper, we propose a 1M LAION-refined dataset, including the refined image descriptions of the open-source LAION (Schuhmann et al., 2021) dataset. We select 1M images from LAION (Schuhmann et al., 2021) with an aesthetic score (Schuhmann, [n. d.]) exceeding 6.5, a minimum short edge resolution of 512 pixels, and a maximum aspect ratio of 1.5. We utilize LLaVA-7B (Liu et al., 2023) to generate dense and highly descriptive captions for each image. The process of dataset construction is illustrated in Fig. 4 (b).

5. Experiments

Refer to caption
Figure 6. Aesthetic Score and CLIP Score (%) on DensePrompts benchmark.

5.1. Experimental Details

Framework. In this paper, we explore LLM4GEN based on SD1.5 (Rombach et al., 2022) and SDXL (Podell et al., 2023), denoted as LLM4GENSD1.5 and LLM4GENSDXL. We utilize T5-XL (Raffel et al., 2020) and CLIP text encoder as the text tower. The sequence length of the LLMs is set to 128, enhancing the representation of conditional text embedding.

Implementation Details. We employ AdamW (Loshchilov and Hutter, 2017) optimizer to train our models. The learning rates are set to 2e-5 and 1e-5 for LLM4GENSD1.5 and LLM4GENSDXL, respectively. The batch size is set to 256. The training steps are set to 20k and 40k. Additionally, we further train LLM4GENSDXL using 20K high-quality data with 1024 resolution. The final LLM4GENSD1.5 model is trained on 8 80G A100 for 2 days while 4 days for LLM4GENSDXL model. During inference, we utilize DDIM sampler (Song et al., 2020a) for sampling, setting the number of time steps to 50 and the classifier free guidance scale to 7.5.

Evaluation Benchmarks. We comprehensively evaluate proposed LLM4GEN via four primary benchmarks:

  1. (1)

    MSCOCO Dataset (Lin et al., 2014) We randomly select 30k images from MSCOCO dataset and assess both sample quality and image-text alignment of generated images. The Fréchet Inception Distance (FID) (Heusel et al., 2017), Inception Score (IS) (Salimans et al., 2016), and CLIP Score (Radford et al., 2021) are used for evaluation.

  2. (2)

    T2I-CompBench (Huang et al., 2023) We employ various compositional prompts to assess textual attributes, including aspects such as color, shape, and texture, as well as attribute binding.

  3. (3)

    DensePrompts Our proposed benchmark involves extensive textual descriptions comprising over 7,000 dense prompts. The CLIP Score and Aesthetic Score are used for evaluation.

  4. (4)

    User Study We randomly select 100 prompts from our proposed DensePromtps benchmark and 100 prompts from T2I-CompBench. Subsequently, we enlist 20 participants for the user study.

5.2. Performance Comparisons and Analysis

5.2.1. Fidelity assessment on MSCOCO benchmark

Experimental results on MSCOCO benchmark are shown in Tab. 2. LLM4GEN notably enhances the sample quality and image-text alignment, resulting in improvements of 1.79 and 0.54 on FID compared to SD1.5 and SDXL, respectively. Furthermore, we assess the performance of SD1.5 after extensive fine-tuning with the same training dataset. This modified version, SD1.5 (ft), surpasses the original SD1.5, yet LLM4GENSD1.5 still exhibits superior performance over SD1.5 (ft). This underscores the potent representation of our proposed LLM4GEN and its contribution to text-to-image generation.

Refer to caption
Figure 7. Results on user study regarding the sample quality and image-text alignment of different models.

5.2.2. Evaluation on T2I-CompBench

For T2I-CompBench comparison, we select the recent text-to-image generative models for comparison, e.g., Composable Diffusion (Liu et al., 2022), Structured Diffusion (Feng et al., 2022), Attn-Exct v2 (Chefer et al., 2023), GORS (Huang et al., 2023), DALLE 2 (Ramesh et al., 2022), PixArt-α𝛼\alphaitalic_α (Chen et al., 2023b), ELLASDXL (Hu et al., 2024), SD1.5(Rombach et al., 2022), and SDXL(Podell et al., 2023). Experimental results shown in Tab. 1 demonstrate the distinctive performance of LLM4GENSDXL in T2I-CompBench evaluation, underlining its advancements in attribute binding, object relationship, and mastery in rendering complex compositions. LLM4GEN shows considerable improvement in color, shape, and texture, showcasing enhancements up to +9.60% in color, +11.52% in texture, and +3.26% in shape with SDXL, respectively. LLM4GENSDXL also marks considerable progress in both spatial and non-spatial evaluations, with 2.27% and 0.84% lift, respectively. Furthermore, when compared with the recent PixArt-α𝛼\alphaitalic_α, which employs T5-XL as its text encoder, LLM4GENSDXL surpasses it in several aspects, such as a notable 3.43% lead in color metric. Moreover, LLM4GENSDXL outperforms ELLASDXL. These results verify the potent synergy of LLMs representations in augmenting the sample quality and image-text alignment of diffusion models.

Refer to caption
Figure 8. A comparative analysis of LLM4GEN and other state-of-the-art diffusion models using PartiPrompts (Yu et al., 2022) and our proposed DensePrompts as prompts. The last row represents the prompts used.

5.2.3. Evaluation on DensePrompts

We compare our LLM4GEN with PixArtα𝛼\alphaitalic_α (Chen et al., 2023b), Playground v2 (Li et al., [n. d.]), SD1.5 (Rombach et al., 2022), SDXL (Podell et al., 2023) on our designed DensePrompts benchmark. As illustrated in Fig. 6, LLM4GENSDXL stands out by achieving the highest scores in both Aesthetic Score and CLIP Score among these models. PixArt-α𝛼\alphaitalic_α demonstrates superior results to SDXL, attributed to its use of the T5-XL text encoder for processing dense prompts. LLM4GEN demonstrates an exceptional ability to understand and interpret dense prompts, leading to generated images with high sample quality and image-text alignment. We attribute this performance improvement to the powerful representation of LLMs, which enables the effective adaptation of the original CLIP text encoder through the well-designed Cross-Adapter Module.

To thoroughly evaluate our proposed LLM4GEN framework, we present the qualitative results on the short prompts provided by PartiPrompts (Yu et al., 2022) in the first 4 columns and on the dense prompts provided by DensePrompts in the last 3 columns in Fig. 8. The results indicate that our proposed LLM4GENSD1.5 and LLM4GENSDXL exhibit strong text-image alignment and superior dense prompt generation compared to the recent PixArt-α𝛼\alphaitalic_α and Playground v2, especially in handling the multiple objects and attribute binding.

5.2.4. User Study

We conduct the user study on various combinations of existing methods and LLM4GENSDXL. For each pairing, we assess two criteria: sample quality and image-text alignment. Users are tasked with evaluating the aesthetic appeal and semantic understanding of images with identical text to determine the superior one based on these assessment criteria. Subsequently, we compute the percentage scores for each model, as shown in Fig. 7. The results showcase our LLM4GENSDXL exhibits comparative advantages over both SD1.5 and SDXL. Specifically, LLM4GENSDXL achieves 60.3% and 66.5% higher voting preferences compared to SDXL in terms of Aesthetic and Semantic, respectively. Notably, LLM4GENSDXL also delivers competitive results when compared to DALL-E 3.

Table 3. Impact (%) of the designed Cross-Adapter Module. The symbol ✗ denotes cases where the model fails to generate images without the original text encoder.
Module Attribute Binding
Color \uparrow Shape \uparrow Texture \uparrow
(1)  SD1.5 (Rombach et al., 2022) 37.65 35.76 41.56
(2)  MLP or CrossAttention
(3)  MLP + Concat 40.24 38.23 44.39
(4)  CrossAttention + Concat 45.34 43.28 51.52

5.3. Ablation Studies

Table 4. Impact (%) of Different LLMs based on SD1.5.
LLMs Attribute Binding
Color \uparrow Shape \uparrow Texture \uparrow
SD1.5 (Rombach et al., 2022) 37.65 35.76 41.56
Llama-2/7B (Touvron et al., 2023) 43.21 40.12 48.91
Llama-2/13B (Touvron et al., 2023) 43.98 41.03 49.21
T5-XL (Raffel et al., 2020) 45.34 43.28 51.52

Impact of Cross-Adapter Module. Due to limited computing sources, we evaluate the impact of various architectural enhancements on SD1.5, as outlined in Tab. 3. Our configurations explore different methods for integrating LLMs embeddings: (1) the baseline SD1.5 model, (2) MLP or CrossAttention, which utilizes a simple linear layer or cross-attention layer to transform LLM embeddings, (3) MLP + Concat, representing a process where LLMs embeddings are projected to the same dimension as the original text embeddings before concatenation, and (4) CrossAttention + Concat, our innovative approach detailed in Sec. 3.2. Results show that configuration (2) is incapable of generating images from text, likely due to a misalignment between LLMs and the latent vector, necessitating substantial resources for alignment, as mentioned in (Hu et al., 2024) and (Wu et al., 2023b). Interestingly, simply concatenating the original text embeddings (configuration 3) provides a significant boost over base SD1.5, with a 2.59% improvement in color. This suggests that direct representation alignment between LLMs and the latent vector is challenging, and enhancing the original text embeddings with LLM embeddings is sufficient to improve image-text alignment. Furthermore, the result of our carefully crafted Cross-Adapter Module (configuration 4) achieves a 5.10% increase in color over configuration 3. This emphasizes the substantial benefits of incorporating our Cross-Adapter Module to enrich the representation of the original text encoder and the image-text alignment of generated images.

Table 5. Training resources comparison, including the scale of training data (#Images) and computing cost (#GPU days). The performance (%) is evaluated on the Color in T2I-CompBench.
Method #Images (M \downarrow) #GPU days (\downarrow) Performance (\uparrow)
PixArt-α𝛼\alphaitalic_α (Chen et al., 2023b) 25 753 68.86
ParaDiffusion (Wu et al., 2023b) 500 392 -
ELLASDXL (Hu et al., 2024) 30 112 72.60
LLM4GENSDXL 2 32 73.29
Refer to caption
Figure 9. Performance metrics of LLM4GEN based on different LLMs across λ𝜆\lambdaitalic_λ values from 0 to 2.

Impact of Different LLMs. The analysis encompasses a comparative evaluation between base SD1.5 and the enhancements achieved through the integration of Llama-2/7B, Llama-2/13B, and T5-XL. As depicted in Tab. 4, the inclusion of any LLM improves upon the performance of SD1.5. Importantly, Llama-v2/13B outperforms Llama-v2/7B, demonstrating that LLMs with greater capacity excel in extracting more nuanced semantic embeddings. Furthermore, when compared to decoder-only LLMs, T5-XL encoder demonstrates advantages in semantic comprehension, confirming its superior suitability for enhancing text-to-image generation.

Impact of hyperparameter λ𝜆\lambdaitalic_λ. As depicted in Eq. 7, the hyperparameter λ𝜆\lambdaitalic_λ is used to regulate the weight of the LLM’s embedding injected into the original text embedding. We evaluate the impact of λ𝜆\lambdaitalic_λ in Fig. 9 with FID and CLIP Score for LLM4GENSD1.5 on MSCOCO dataset across different LLMs. The λ𝜆\lambdaitalic_λ varies from 0 to 2 in 0.2 increments. As the λ𝜆\lambdaitalic_λ increases, we observe an initial enhancement in the model’s performance, followed by a slight decline. This pattern indicates that integrating LLMs representation into that of the original text encoder can consistently improve the image-text alignment beyond the capabilities of the original SD1.5, highlighting the beneficial impact of LLMs on semantic enrichment. However, setting higher values of λ𝜆\lambdaitalic_λ does not lead to optimal performance, which we suspect stems from a misalignment between LLMs representation and the diffusion model. The best performance is achieved when λ𝜆\lambdaitalic_λ is set to 1.0.

5.4. Further Analysis

Training Efficiency. When evaluating the effectiveness of integrating LLMs into text-to-image generation models, LLM4GENSDXL stands out for its remarkable efficiency and performance. LLM4GEN achieves significant reductions in both training data requirements and computational costs. It utilizes only 2M data, a 90% reduction compared to ELLA (Hu et al., 2024), and demands merely 32 GPU days for training, drastically lower than PixArt-α𝛼\alphaitalic_α (25 million data, 753 GPU days) and ParaDiffusion (500 million data, 392 GPU days). Despite this, LLM4GENSDXL achieves a superior color metric performance of 73.29%. This notable difference underscores LLM4GEN’s ability to substantially reduce both training data and computational costs while establishing a new standard for performance efficiency.

Refer to caption
Figure 10. Cross-attention visualization (Tang et al., 2022) for two generated images. The two rows are SD1.5 and LLM4GENSD1.5, respectively.

Cross-attention Visualization. Fig. 10 demonstrates the cross-attention visualization (Tang et al., 2022) of SD1.5 and LLM4GENSD1.5, respectively. The heatmaps indicate that our proposed LLM4GEN method exhibits a greater capacity to capture relationships between attributes, such as ”blue” and ”sheep”, as illustrated in Fig. 10 (a). We attribute this capability to the enhanced semantic richness facilitated by the robust representation of LLMs.

More visualization and experimental results are shown in the Supplementary Materials.

6. Conclusion

In this paper, we propose LLM4GEN, an end-to-end text-to-image generation framework. Specifically, we design an efficient Cross-Adapter Module to leverage the powerful representation of LLMs to enhance the original text representation of diffusion models. Despite using fewer training data and computational resources, LLM4GEN outperforms current state-of-the-art text-to-image diffusion models in sample quality and image-text alignment. Additionally, we introduce LAION-refined dataset and DensePrompts benchmark, which promote generating images with dense information and establish a comprehensive evaluation, respectively. Furthermore, we aim to explore the potential of LLM4GEN within the transformer-based diffusion model. We hope this work will pave the way for new research directions in text-to-image generation and provide insights into how LLMs can contribute to improving the performance of text-to-image generation models.

References

  • (1)
  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  • Bai et al. (2023) **ze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and **gren Zhou. 2023. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023).
  • Betker et al. (2023) James Betker, Gabriel Goh, Li **g, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. 2023. Improving image generation with better captions. Computer Science. https://cdn.openai.com/papers/dall-e-3.pdf (2023).
  • Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. NeurIPS.
  • Chang et al. (2023) Yupeng Chang, Xu Wang, **dong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2023. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology (2023).
  • Chefer et al. (2023) Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. 2023. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM TOG 42, 4 (2023), 1–10.
  • Chen et al. (2023a) **gwen Chen, Yingwei Pan, Ting Yao, and Tao Mei. 2023a. Controlstyle: Text-driven stylized image generation using diffusion priors. In ACM MM. 7540–7548.
  • Chen et al. (2023b) Junsong Chen, **cheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, ** Luo, Huchuan Lu, et al. 2023b. PixArt-α𝛼\alphaitalic_α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis. arXiv preprint arXiv:2310.00426 (2023).
  • Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
  • Dhariwal and Nichol (2021) Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. In NeurIPS.
  • Feng et al. (2022) Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Reddy Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. 2022. Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis. In ICLR.
  • Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. NeurIPS.
  • Ho and Salimans (2022) Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022).
  • Hu et al. (2024) Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. 2024. ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment. arXiv preprint arXiv:2403.05135 (2024).
  • Huang et al. (2023) Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. 2023. T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation. In ICCV.
  • Huang et al. (2022) Rongjie Huang, Zhou Zhao, Huadai Liu, **glin Liu, Chenye Cui, and Yi Ren. 2022. Prodiff: Progressive fast diffusion model for high-quality text-to-speech. In ACM MM. 2595–2605.
  • Koh et al. (2023) **g Yu Koh, Daniel Fried, and Russ R Salakhutdinov. 2023. Generating images with multimodal language models. In NeurIPS.
  • Li et al. ([n. d.]) Daiqing Li, Aleks Kamko, Ali Sabet, Ehsan Akhgari, Linmiao Xu, and Suhail Doshi. [n. d.]. Playground v2. https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic
  • Li et al. (2023a) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023a. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. In ICML.
  • Li et al. (2022) Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrap** language-image pre-training for unified vision-language understanding and generation. In ICML. 12888–12900.
  • Li et al. (2023b) Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. 2023b. Gligen: Open-set grounded text-to-image generation. In CVPR. 22511–22521.
  • Lian et al. (2024) Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. 2024. LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models. Transactions on Machine Learning Research (2024).
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV.
  • Liu et al. (2023) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. In NeurIPS.
  • Liu et al. (2022) Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. 2022. Compositional visual generation with composable diffusion models. In ECCV. 423–439.
  • Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
  • Nichol et al. (2022) Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2022. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML.
  • OpenAI (2023) OpenAI. 2023. GPT-4V(ision) System Card. https://api.semanticscholar.org/CorpusID:263218031
  • Pang et al. (2024) Ziqi Pang, Ziyang Xie, Yunze Man, and Yu-Xiong Wang. 2024. Frozen transformers in language models are effective visual encoder layers. In ICLR.
  • Podell et al. (2023) Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023).
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML. 8748–8763.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21, 1 (2020), 5485–5551.
  • Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In CVPR. 10684–10695.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In MICCAI. 234–241.
  • Ruiz et al. (2023) Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR. 22500–22510.
  • Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS.
  • Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. In NeurIPS.
  • Scao et al. (2022) Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
  • Schuhmann ([n. d.]) Christoph Schuhmann. [n. d.]. CLIP+MLP Aesthetic Score Predictor. https://github.com/christophschuhmann/improved-aesthetic-predictor.
  • Schuhmann et al. (2021) Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. 2021. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. CoRR abs/2111.02114 (2021). arXiv:2111.02114 https://arxiv.longhoe.net/abs/2111.02114
  • Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML. 2256–2265.
  • Song et al. (2020a) Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020a. Denoising Diffusion Implicit Models. CoRR abs/2010.02502 (2020). arXiv:2010.02502 https://arxiv.longhoe.net/abs/2010.02502
  • Song and Ermon (2019) Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. NeurIPS.
  • Song and Ermon (2020) Yang Song and Stefano Ermon. 2020. Improved techniques for training score-based generative models. NeurIPS.
  • Song et al. (2020b) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2020b. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020).
  • Sun et al. (2024) Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, **g**g Liu, Tiejun Huang, and Xinlong Wang. 2024. Emu: Generative pretraining in multimodality. In ICLR.
  • Tang et al. (2022) Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, and Ferhan Ture. 2022. What the daam: Interpreting stable diffusion using cross attention. arXiv preprint arXiv:2210.04885 (2022).
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  • Van Den Oord et al. (2017) Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning. In NeurIPS.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NeurIPS.
  • Wu et al. (2023a) Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. 2023a. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519 (2023).
  • Wu et al. (2023b) Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang, and Zhongyuan Wang. 2023b. Paragraph-to-image generation with information-enriched diffusion model. arXiv preprint arXiv:2311.14284 (2023).
  • Yang et al. (2024) Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, and Bin Cui. 2024. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. arXiv preprint arXiv:2401.11708 (2024).
  • Yang et al. (2023) Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. 2023. Diffusion models: A comprehensive survey of methods and applications. Comput. Surveys 56, 4 (2023), 1–39.
  • Yu et al. (2022) Jiahui Yu, Yuanzhong Xu, **g Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. 2022. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 (2022).
  • Zeng et al. (2022) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, et al. 2022. Glm-130b: An open bilingual pre-trained model. In ICLR.
  • Zhang et al. (2023) Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. In ICCV.
  • Zhang et al. (2024) Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. 2024. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. In ICLR.
  • Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
  • Zhu et al. (2023) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).

Supplementary Materials: LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

7. Additional Analysis

We provide the additional analysis and experimental results on our proposed LLM4GEN in the Supplementary Materials.

Formualtion Derivation We provide a formula derivation of Eq. (8):

(11) x𝑥\displaystyle xitalic_x =CA(x,ct)absentCA𝑥superscriptsubscript𝑐𝑡\displaystyle=\operatorname{CA}(x,c_{t}^{\prime})= roman_CA ( italic_x , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
=softmax(QKT)Vabsentsoftmaxsuperscript𝑄superscriptsuperscript𝐾𝑇superscript𝑉\displaystyle=\operatorname{softmax}\left(Q^{\prime}\cdot{K^{\prime}}^{T}% \right)\cdot V^{\prime}= roman_softmax ( italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ⋅ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
=softmax(WQ(x)WK(ct)T)WV(ct)absentsoftmaxsuperscriptsubscript𝑊𝑄𝑥superscriptsubscript𝑊𝐾superscriptsuperscriptsubscript𝑐𝑡𝑇superscriptsubscript𝑊𝑉superscriptsubscript𝑐𝑡\displaystyle=\operatorname{softmax}\left(W_{Q}^{\prime}(x)\cdot{W_{K}^{\prime% }(c_{t}^{\prime})}^{T}\right)\cdot W_{V}^{\prime}(c_{t}^{\prime})= roman_softmax ( italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ⋅ italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ⋅ italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
=softmax(WQ(x)WK([λcl,ct])T)WV([λcl,ct])absentsoftmaxsuperscriptsubscript𝑊𝑄𝑥superscriptsubscript𝑊𝐾superscript𝜆superscriptsubscript𝑐𝑙subscript𝑐𝑡𝑇superscriptsubscript𝑊𝑉𝜆superscriptsubscript𝑐𝑙subscript𝑐𝑡\displaystyle=\operatorname{softmax}\left(W_{Q}^{\prime}(x)\cdot{W_{K}^{\prime% }([\lambda\cdot c_{l}^{\prime},c_{t}])}^{T}\right)\cdot W_{V}^{\prime}([% \lambda\cdot c_{l}^{\prime},c_{t}])= roman_softmax ( italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ⋅ italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( [ italic_λ ⋅ italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ⋅ italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( [ italic_λ ⋅ italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] )
=λsoftmax(WQ(x)WK(cl)T)WV(cl)absent𝜆softmaxsuperscriptsubscript𝑊𝑄𝑥superscriptsubscript𝑊𝐾superscriptsuperscriptsubscript𝑐𝑙𝑇superscriptsubscript𝑊𝑉superscriptsubscript𝑐𝑙\displaystyle=\lambda\cdot\operatorname{softmax}\left(W_{Q}^{\prime}(x)\cdot{W% _{K}^{\prime}(c_{l}^{\prime})}^{T}\right)\cdot W_{V}^{\prime}(c_{l}^{\prime})= italic_λ ⋅ roman_softmax ( italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ⋅ italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ⋅ italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
+softmax(WQ(x)WK(ct)T)WV(ct)softmaxsuperscriptsubscript𝑊𝑄𝑥superscriptsubscript𝑊𝐾superscriptsubscript𝑐𝑡𝑇superscriptsubscript𝑊𝑉subscript𝑐𝑡\displaystyle\quad+\operatorname{softmax}\left(W_{Q}^{\prime}(x)\cdot{W_{K}^{% \prime}(c_{t})}^{T}\right)\cdot W_{V}^{\prime}(c_{t})+ roman_softmax ( italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ⋅ italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ⋅ italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=λCA(x,cl)+CA(x,ct)absent𝜆CA𝑥subscript𝑐𝑙CA𝑥subscript𝑐𝑡\displaystyle=\lambda\cdot\operatorname{CA}(x,c_{l})+\operatorname{CA}(x,c_{t})= italic_λ ⋅ roman_CA ( italic_x , italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) + roman_CA ( italic_x , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

where x𝑥xitalic_x denotes the latent noise, CA is the cross-attention module within the UNet module, which receives z𝑧zitalic_z as the query and ctsuperscriptsubscript𝑐𝑡c_{t}^{\prime}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as the key and value. And the WQ,WK,WVsuperscriptsubscript𝑊𝑄superscriptsubscript𝑊𝐾superscriptsubscript𝑊𝑉W_{Q}^{\prime},W_{K}^{\prime},W_{V}^{\prime}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the projector in the UNet. In this manner, the concatenation operation in Eq. (7) is equal to fuse the LLM-guided semantic feature to the latent noise for better text-to-image alignment.

Algorithm. The proposed LLM4GEN is further illustrated in Algorithm 2.

Algorithm 2 LLM4GEN Pipeline
1:  Input: Pretrained UNet ϵitalic-ϵ\epsilonitalic_ϵ, pre-trained text encoder Tϕsubscript𝑇italic-ϕT_{\phi}italic_T start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, pre-trained LLM TLsubscript𝑇𝐿T_{L}italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, Cross-Adapter Module M𝑀Mitalic_M, LLaVA-7B A𝐴Aitalic_A, training image-text pairs 𝕊={𝕀,}𝕊𝕀\mathbb{S}=\{\mathbb{I},\mathbb{P}\}blackboard_S = { blackboard_I , blackboard_P } .
2:  Offline Process: Apply A𝐴Aitalic_A to enrich the captions of SS\mathrm{S}roman_S, get ^^\hat{\mathbb{P}}over^ start_ARG blackboard_P end_ARG = A𝐴Aitalic_A(II\mathrm{I}roman_I) to replace the original \mathbb{P}blackboard_P.
3:  \rightarrow Begin Training.
4:  Freeze Tϕsubscript𝑇italic-ϕT_{\phi}italic_T start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and TLsubscript𝑇𝐿T_{L}italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, training M𝑀Mitalic_M and ϵitalic-ϵ\epsilonitalic_ϵ.
5:  while Lsimplesubscript𝐿𝑠𝑖𝑚𝑝𝑙𝑒{L_{simple}}italic_L start_POSTSUBSCRIPT italic_s italic_i italic_m italic_p italic_l italic_e end_POSTSUBSCRIPT not converged do
6:     Sample p𝑝pitalic_p from \mathbb{P}blackboard_P; t=T𝑡𝑇t=Titalic_t = italic_T
7:     ct=M(Tϕ(p),TL(p))superscriptsubscript𝑐𝑡𝑀subscript𝑇italic-ϕ𝑝subscript𝑇𝐿𝑝{c_{t}^{\prime}}=M(T_{\phi}(p),T_{L}(p))italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_M ( italic_T start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_p ) , italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_p ) ) using Eq. (7).
8:     Calculate Lsimplesubscript𝐿𝑠𝑖𝑚𝑝𝑙𝑒{L_{simple}}italic_L start_POSTSUBSCRIPT italic_s italic_i italic_m italic_p italic_l italic_e end_POSTSUBSCRIPT using Eq. (9).
9:     Backward Lsimplesubscript𝐿𝑠𝑖𝑚𝑝𝑙𝑒{L_{simple}}italic_L start_POSTSUBSCRIPT italic_s italic_i italic_m italic_p italic_l italic_e end_POSTSUBSCRIPT and update ϵitalic-ϵ\epsilonitalic_ϵ, M𝑀Mitalic_M.
10:  end while
11:  \rightarrow End Training.
Refer to caption
Figure 11. Comparison with ParaDiffusion (Wu et al., 2023b). The generation images of ParaDiffusion are from the original paper.
Refer to caption
Figure 12. Comparison with ELLA (Hu et al., 2024). The generation images of ParaDiffusion are from the original paper.

8. Additional Experimental Results

Comparison with ParaDiffusion based on dense prompts. To further compare the proposed LLM4GEN with ParamDiffusion, we take three prompts provided in (Wu et al., 2023b), and generate images using our LLM4GENSD1.5 and LLM4GENSDXL. We select the generation images of ParaDiffusion from the original literature (Wu et al., 2023b). The results demonstrate that our proposed LLM4GEN can generate semantic-alignment images based on long textual descriptions and alleviate bad cases in ParaDiffusion, such as confusion between character generation and background in Fig. 11 (c). LLM4GENSD1.5 even can generate high-quality and dense prompts alignment images than SDXL(Podell et al., 2023), such as Fig. 11 (b). It demonstrates our proposed methods can efficiently integrate strong semantic representations of LLMs into text-to-image diffusion models to enhance image-text alignment.

Comparison with ELLA (Hu et al., 2024). ELLA (Hu et al., 2024) also utilizes LLMs for text-to-image generation, with the aim of aligning LLMs with diffusion models from scratch, incurring significant training costs. In contrast to ELLA, our proposed LLM4GEN enhances the original text encoder directly with robust LLM semantic embeddings. This approach benefits from the powerful capabilities of LLM without high-cost training and computing sources. Compared to ELLASDXL, our model requires only 10% of the training data, but still produces high-quality images using the prompts provided in (Hu et al., 2024). The visualization comparison is presented in Fig. 12.

Refer to caption
Figure 13. LLMGENSD1.5 can be integrated with ControlNet (Zhang et al., 2023) to generate images with guided poses.

Compatible with ControlNet (Zhang et al., 2023). We apply LLM4GEN with ControlNet (Zhang et al., 2023), as shown in Fig. 13. We can see that LLM4GENSD1.5 can be integrated with exsiting guided tools like ControlNet. LLM4GENSD1.5 can be compatible with these standard methods while generates more consistent and text-align images.