PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

Tao Yang [email protected] Hong Kong Polytechnic University ARC Lab, Tencent PCG Yingmin Luo [email protected] ARC Lab, Tencent PCG Zhongang Qi [email protected] ARC Lab, Tencent PCG Yang Wu [email protected] AI Lab, Tencent Ying Shan [email protected] ARC Lab, Tencent PCG  and  Chang Wen Chen [email protected] Hong Kong Polytechnic University
(2018; 20 February 2007; 12 March 2009; 5 June 2009)
Abstract.

Layout generation is the keystone in achieving automated graphic design, requiring arranging the position and size of various multi-modal design elements in a visually pleasing and constraint-following manner. Previous approaches are either inefficient for large-scale applications or lack flexibility for varying design requirements. Our research introduces a unified framework for automated graphic layout generation, leveraging the multi-modal large language model (MLLM) to accommodate diverse design tasks. In contrast, our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts under specific visual and textual constraints, including user-defined natural language specifications. We conducted extensive experiments and achieved state-of-the-art (SOTA) performance on public multi-modal layout generation benchmarks, demonstrating the effectiveness of our method. Moreover, recognizing existing datasets’ limitations in capturing the complexity of real-world graphic designs, we propose two new datasets for much more challenging tasks (user-constrained generation and complicated poster), further validating our model’s utility in real-life settings. Marking by its superior accessibility and adaptability, this approach further automates large-scale graphic design tasks. The code and datasets will be publicly available on https://github.com/posterllava/PosterLLaVA.

Layout Generation, Multi-modal LLM, User-constrained, Real-world Poster
copyright: acmlicensedjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: Make sure to enter the correct conference title from your rights confirmation email; June 03–05, 2018; Woodstock, NYisbn: 978-1-4503-XXXX-X/18/06ccs: Human-centered computing Visualization design and evaluation methodsccs: Applied computing Media arts

1. Introduction

For diverse sorts of graphic design (commercial posters, mobile app UIs, webpages, video thumbnails, etc.), layout plays a critical role in structuring visual and textual elements to captivate audiences and communicate intended messages. This task has required designers to create layouts manually, demanding their extensive expertise and experience. For large-scale designing tasks, the efficiency of this strategy is far from expected.

The most naive idea for massive graphic design generation is to utilize pre-design templates and replace content according to requirements. However, the production and selection of templates still involve human labor, and mechanically applying the inappropriate layout can lead to obtrusive designs. Previous researchers attempted to frame layout generation as an optimization problem, tackling it with heuristic algorithms like genetic algorithms (Rajasekharan et al., 1998) and simulated annealing (Baykasoğlu and Gindy, 2001). However, these methods hinge on crafting well-designed energy functions, a task that still depends heavily on design expertise and lacks generality across different applications.

Refer to caption
Figure 1. The overall framework of our proposed content-aware layout generation method. Adopting the multi-modal LLM (Liu et al., 2023b) as the central processing unit, we embed information from both visual and textual domains to generate a reasonable and visually pleasing graphic layout. The result is encoded in JSON format and can be rendered into a real-world poster.

With the advance in deep learning, researchers are glad to embrace data-driven methods (Jyothi et al., 2019; Kikuchi et al., 2021; Gupta et al., 2021; Arroyo et al., 2021; Kong et al., 2022; Inoue et al., 2023) in layout generation. Most of these works focus on adopting the latest generative architecture but overlook the necessary conditional requirements for layout. This limits their applicability in real-world scenarios, which frequently demand the integration of complex multi-modal conditions. Recently, more and more researchers have recognized the importance of multi-modal conditions and started to explore content-aware layout generation. For visual conditions, CGL-GAN (Zhou et al., 2022) and DS-GAN (Hsu et al., 2023) take an innovative step to incorporate the semantic information on background images into layout generation, and some later works also (Yu et al., 2022; Yang et al., 2023) consider the content of foreground elements as conditions. For textual conditions, some preliminary attempts (Kikuchi et al., 2021; Jiang et al., 2023; Lin et al., 2023a) generate layouts under given graphic conditions. However, the introduced constrained optimization processes or specific intermediate representations strengthen the training or labeling complexity. An efficient end-to-end framework that can directly translate natural language instructions into desired layouts is still needed.

Although previous approaches have demonstrated progress on certain datasets, most of them rely on highly customized network architectures lack universality. Such specificity necessitates substantial modifications or complete redesigns to accommodate new or varied layout design challenges. Recognizing this limitation, we develop a unified framework named PosterLLaVa (see Fig. 1) for layout generation task, inspired by the simplicity and effectiveness of the recently published multi-modal instruction tuning (Li et al., 2023; Zhu et al., 2023; Liu et al., 2023b; Zhang et al., 2023b; Ye et al., 2023; Wang et al., 2023) method. Pre-trained with numerous amounts of unlabelled corpora and fine-tuned with instruction-following data, MLLMs (Multi-modal Large Language Models) are capable of handling multiple vision-language tasks (e.g., VQA (Zhu et al., 2023; Liu et al., 2023b), visual grounding (Wang et al., 2023; Ye et al., 2023), etc.) according to the given instructions and their background knowledge. For layout generation, we first show how layout information can be naturally represented by structured text content in JSON format. With this representation, we can measure the performance of PosterLLaVa on established content-aware generation datasets and compare it with previous benchmarks. To tackle the multi-modal condition inputs, we utilize the pre-trained visual head of LLaVa(Liu et al., 2023a) to convert visual representation into textual domain and fine-tuning the LLM(Touvron et al., 2023b) to interpret and generate layout data. With the LLM as the central processing unit, our model can manage a wide range of layout generation tasks through simple modifications of the input instructions, eliminating any need for changes in model architecture. Moreover, textual user requirements can be seamlessly integrated into the generation instructions, enhancing the model’s responsiveness to specific design needs.

The main contribution of our work can be summarized as follows.

  1. (1)

    A Unified Layout Generation Tool We propose a unified content-aware layout generation method utilizing multi-modal LLMs, adaptable across various design scenarios through simple modifications of input instructions. Our approach is validated across multiple public datasets (See Tab. 2) and two newly proposed datasets, showcasing its superior performance and versatility.

  2. (2)

    Natural Language User Requirements Our framework’s ability to process natural language inputs significantly enhances the intuitiveness and efficiency of the design process. With the inherited support of LLMs for natural language inputs, our method eliminates any additional network modules or loss functions, achieving this purpose end-to-end manner. We generate large-scale instruction-following data from a small amount of high-quality human-annotated data with the aid of GPT (Brown et al., 2020) and contribute to the field’s largest constrained layout generation dataset of 84,200 samples, far more than the previous efforts.

  3. (3)

    Real-world Complicated Posters We collect a challenging graphic layout dataset named QB-Poster (QQ Browser Poster), composed of 5,188 samples designed with a prevalent in Chinese social media. This dataset is characterized by its intricate geometric relationships between sufficient kinds of content. Through comparative analysis with the latest comparable method, our method demonstrates remarkable adaptability and effectiveness in capturing the distribution of complicated real-world layouts.

2. Related Work

2.1. Automatic Graphic Layout Generation

Rule-based Methods Before the appearance of deep learning, layout generation has been studied for decades (Rajasekharan et al., 1998; Baykasoğlu and Gindy, 2001; Yin et al., 2013; Ma and Chen, 2016). Typically, Yin et al. (Yin et al., 2013) proposed a series of principles according to widely accepted aesthetic or information-conveying rules and a heuristic algorithm to minimize the overall energy function. These methods do not require training. Instead, they perform a runtime searching process during every inference. The true complexity of these methods lies in the design of the energy function, which requires a lot of design experience and expertise. Moreover, these functions must be manually re-designed when encountering a new design element or applied to a different styled layout (e.g., from UI to commercial poster).

Content-agnostic Layout Generation Neural networks offer researchers a way to formulate designing principles implicitly from numerous data, saving human efforts. Most early works (Li et al., 2019; Jyothi et al., 2019; Zheng et al., 2019; Gupta et al., 2021; Arroyo et al., 2021; Zhang et al., 2023a) focus on generating visually reasonable layouts for mobile UIs, documents, and magazine pages. LayoutGAN (Li et al., 2019) employs the GAN (Generative Adversarial Network) paradigm and designs a differentiable rendering process for connecting the visual and graphic domains. LayoutVAE (Jyothi et al., 2019) and CanvasVAE(Yamaguchi, 2021) adopt the VAE (Variational Auto-Encoder) paradigm, while more recent works adopt the auto-regressive architecture (Gupta et al., 2021; Arroyo et al., 2021; Kong et al., 2022) or the diffusion architecture (Inoue et al., 2023; Zhang et al., 2023a; Hui et al., 2023). Despite their achievement on unconditioned layout generation tasks, they are hard to use in real-world scenarios.

Content-aware Layout Generation Recently, some other works (Hsu et al., 2023; Zhou et al., 2022; Yu et al., 2022; Xu et al., 2023; Yang et al., 2023) have paid their attention to commercial-style posters, in which case the graphic designs are usually based on a non-empty background image. CGL-GAN (Zhou et al., 2022) contributes a large dataset with around 60k Chinese commercial posters and proposes to learn with a transformer-based GAN network receiving a saliency map and the inpainted background as input. Similarly, PosterLayout (Hsu et al., 2023) tackles the problem with a CNN-LSTM network with saliency map as input. (Cao et al., 2022) adopts a C-VAE (Conditional Variational Auto-Encoder) to predict the layout. LayoutDETR (Yu et al., 2022) design a DETR-like(Carion et al., 2020) to utilize the pre-trained objects detection model and integrate both GAN and VAE for layout generation. They also include pre-trained ViT (Dosovitskiy et al., 2020) and BETR (Devlin et al., 2018) as visual and textual encoders to get embedded features of the design elements.

Interestingly, some work (Kikuchi et al., 2021; Jiang et al., 2023; Lin et al., 2023a) also attempted to generate layouts following specific constraints. Primitively, LayoutGAN++ (Kikuchi et al., 2021) introduces an additional constrained optimization process based on the Lagrangian multiplier method to get the desired layout. Then, LayoutFormer++ (Jiang et al., 2023) and Parse-then-place (Lin et al., 2023a) design a specific intermediate representation to handle various constraints. The latter also studies the text-to-layout problem, which includes implicitly expressed user requirements and is very similar to ours.

2.2. Multi-modal Large Language Models and Application

LLMs (Large Language Models) (Brown et al., 2020; Achiam et al., 2023; Touvron et al., 2023b) have achieved remarkable success across a wide range of natural language processing (NLP) tasks. With billions of parameters, these models derive extensive knowledge from pre-training on vast unlabeled text corpora. Various instruction-tuning methods have been investigated to enhance the ability of LLMs to comprehend and execute natural language instructions (Ouyang et al., 2022; Wang et al., 2022). While LLMs have proven adept at understanding and generating text, multi-modal LLMs have been facilitated by incorporating additional modalities like visual and auditory data (Li et al., 2023; Zhu et al., 2023; Liu et al., 2023b). A prevalent approach involves injecting LLMs with multi-modal information and leveraging their robust reasoning capabilities.

LLMs-assisted Layout Generation Layouts, which can be encoded in formats such as XML or JSON, are ideally suited to be processed by pre-trained Large Language Models (LLMs). Previous works have used domain-specific data to strengthen their code generation ability. LayoutNUWA (Tang et al., 2023) fine-tunes the LLaMa (Touvron et al., 2023a) and CodeLLaMa (Roziere et al., 2023) to the content-agnostic layout generation task, achieving the SOTA performance in multiple content-agnostic layout datasets. LayoutPrompter (Lin et al., 2023b) introduces an interesting training-free approach, leveraging RAG (Retrieval-Augmented Generation) to strengthen the in-context learning ability of GPT (Brown et al., 2020), dynamically sourcing examples from a dataset. However, this retrieval-centric strategy is limited to open-domain generation. These works overlook the visual domain feature or translate it into hard tokens before feeding into LLM, potentially resulting in severe information loss. To tackle this weakness, we include the latest proposed multi-modal technique - visual instruct tuning (Liu et al., 2023b) to fine-tune a pre-trained large model, which accepts the visual information with a pre-trained and aligned visual adaptation head (Radford et al., 2021). For the layout-to-image generation, interestingly, some contemporaneous work like LayoutGPT(Feng et al., 2023) and TextDiffuser-2(Chen et al., 2023) also adopt LLMs, showing a promising production pipeline for LLM-based graphic design.

2.3. Methodology

2.3.1. Multi-modal Layout Tokenization

Assuming that all complicated attributes and art styles have their default values, we can explicitly represent the information of a graphic design Ljsubscript𝐿𝑗L_{j}italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT by defining the position (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), size (hi,wi)subscript𝑖subscript𝑤𝑖(h_{i},w_{i})( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and content 𝐈𝐢subscript𝐈𝐢\mathbf{I_{i}}bold_I start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT of every element. The position and size can be further expressed as bounding box format if rotation and irregular shapes are not involved. The class labels cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of elements are explicitly given to excavate the relationship between different kinds of elements. We got the following representation of a poster:

(1) Lj={(xi,yi,hi,wi),ci,𝐈i}i=0NsubscriptL𝑗superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖subscript𝑖subscript𝑤𝑖subscript𝑐𝑖subscript𝐈𝑖𝑖0𝑁\text{L}_{j}=\{(x_{i},y_{i},h_{i},w_{i}),c_{i},\mathbf{I}_{i}\}_{i=0}^{N}L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT

in which N𝑁Nitalic_N represents the number of elements. For previous papers, most consider Ljsubscript𝐿𝑗L_{j}italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as a numeric form, which means solving the problem in a continuous space. We, however, design the following process to tokenize Ljsubscript𝐿𝑗L_{j}italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and feed it into LLMs to predict the next token. First, we normalized the bounding box coordinates with the background width and height to facilitate multi-resolution generation. Each coordinate data value of the bounding box vector is truncated to K𝐾Kitalic_K decimal places to avoid redundancy. For class label cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we use the corresponding text label instead, for example {text,logo,underlay}textlogounderlay\{\text{text},\text{logo},\text{underlay}\}{ text , logo , underlay } regarding the PosterLayout (Hsu et al., 2023) dataset. Finally, for image elements, 𝐈𝐢imgsubscriptsuperscript𝐈img𝐢\mathbf{I^{\text{img}}_{i}}bold_I start_POSTSUPERSCRIPT img end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT is encoded by a pre-trained vision header, which is composed of a ViT (Dosovitskiy et al., 2020) encoder and a linear projection head, namely

(2) h(𝐈iimg)=𝐖TCLIP(𝐈iimg).superscriptsubscript𝐈𝑖imgsuperscript𝐖𝑇CLIPsuperscriptsubscript𝐈𝑖imgh(\mathbf{I}_{i}^{\text{img}})=\mathbf{W}^{T}\text{CLIP}(\mathbf{I}_{i}^{\text% {img}}).italic_h ( bold_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT img end_POSTSUPERSCRIPT ) = bold_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT CLIP ( bold_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT img end_POSTSUPERSCRIPT ) .

and the content 𝐈txtsuperscript𝐈txt\mathbf{I^{\text{txt}}}bold_I start_POSTSUPERSCRIPT txt end_POSTSUPERSCRIPT of the text element is inherently in a text format.

2.3.2. Training Scheme

To facilitate the learning of tokenized layout data, we adopt the training scheme proposed by Liu et al. (Liu et al., 2023b), i.e., the visual instruction tuning. The original paper, focusing on general visual-language tasks, recommends fine-tuning a pre-trained LLM (Touvron et al., 2023a) by two phrases: 1. pre-training for feature alignment, and 2. end-to-end fine-tuning. The alignment phase usually requires numerous image-text pairs to adapt visual information into language space, and the fine-tuning phase requires relatively less data to acquire instruction-following outputs. Recognizing that the primary challenge in layout generation resides in decoding the semantic and geometric relationship between graphic elements, we streamline the training process by using the pre-trained linear projection layer to skip the feature alignment phase. This allows us to reduce training expenditure while maintaining comparable performance with the full-trained model.

User:
¡image¿
Please help me to place ¡N¿ foreground elements over the background of ¡resolution¿ to craft a ¡domain_name¿. Remember to avoid unbalance, overlap, misalignment, and occlusion of semantic-meaningful objects on the background image. Return the result by filling in the following JSON file while kee** the number and types of elements unchanged. The initial JSON is defined as: ¡masked_json¿, in which each design element is represented by a bounding box described as [left, top, right, bottom], and each coordinate is a contiguous number in 0-1. The user constraints are defined as: ¡constraints¿, which should be adopted as compulsory design requirements.
Assistant:
Sure! Here is the design result: ¡json¿.
Table 1. Prompt template for applying visual instruction tuning on content-aware generation task. The placeholder tokens in bold type are replaced with specific information during training or inference.

2.3.3. Prompt Template

We introduce the following prompt template for the adopting end-to-end fine-tuning phase of visual instruction tuning in various content-aware layout generation tasks. The template is described in Tab. 1. The pre-trained vision head converts the background image into soft tokens (as Eq. 2 shows) to get ¡image¿. ¡N¿ is replaced with the exact number of design elements, and ¡resolution¿ is replaced with the canvas resolution. We use a domain indicator ¡domain_name¿ to distinguish different tasks and datasets. For example, ”commercial poster” for CGL dataset and ”advertising banner” for ad banner dataset. The ground-truth layout information is expressed by textual representation through the process introduced in Sec. 2.3.1 and arranged in JSON format (as Fig. 1) to replace ¡json¿. For human instruction, we delete bounding boxes and preserve the category labels to get the ¡masked_json¿. As for user-constrained generation tasks, the constraints are given as ¡constraints¿.

3. Experiment

Implementation Details Most experiments are conducted on 8 NVIDIA A10 GPUs and can be finished within 12 hours. The MLLM checkpoint adopted is the full-tuning 7B version of LLaVa-v1.5 (Liu et al., 2023a), which is trained with LLaMa-2 (Touvron et al., 2023b) 7B as base model with visual instruction tuning. For most of the following layout datasets, we fine-tune the MLLM with one epoch, but for the banners dataset, we employ the 3rd epoch model considering its tiny scale. For the adaptation into the QB-Poster dataset, we adopt the pre-trained model on all training sets of Ad Banner, CGL, and PosterLayout as a starting point to enhance its performance. We increase the max token from 2048 to 4096 as the token length grows with the element number. For other training or inference hyper-parameters, we apply the default recipe recommended by LLaVa (Liu et al., 2023b).

3.1. Result on Public Content-aware Layout Dataset

Dataset Description As mentioned in Section 2.1, content-aware layout generation, as a new task, has only received attention since around 2020, and related research is still in its early stages. We extensively investigated datasets published in the past literature to verify the model’s performance on the general content-aware layout generation task. Available public datasets and baselines are listed in Tab. 2.

Table 2. An overall description of the content-aware layout generation datasets. QB-Poster is the complicated real-world poster dataset proposed in this paper, which outperforms previous datasets in both annotation categories and numbers per poster.
Dataset Train Test Classes Boxes/img Total Boxes
CGL dataset 60548 1000 4 4.87 265818
PosterLayout 9974 905 3 4.73 47024
Ad Banner 7672 1000 8 2.23 16593
YouTube 10000 1000 3 5.88 67223
QB-Poster 4675 513 10 15.17 78723

CGL dataset (Zhou et al., 2022), one of the pioneering content-aware collections, comprises 60548 training samples and 1000 test samples collected from e-commerce platforms. The design elements are divided into 4 categories: logo, text, underlay, and embellishment. The class labels and bounding boxes of elements for each poster in the training set are annotated manually, while the test set includes only the background image. Techniques like image inpainting (Suvorov et al., 2021) and saliency detection (Bo et al., 2019) are needed to get additional visual information. Recognizing the limitations of the CGL dataset, particularly its repetitive content and scarcity of complex layouts featuring over ten elements, Hsu et al. (Hsu et al., 2023) introduces PosterLayout, offering 9974 poster-layout pairs for training and 905 background images for testing. LayoutDETR (Yu et al., 2022) contributes an ad banner dataset with multi-modal information, containing 7,672 samples divided into training and testing subsets in a 9:1 ratio. The background images are either from the Pitt Image Ads Dataset or Google Image, and the bounding boxes, categories, and text contents are extracted by OCR automatically. But different from CGL and PosterLayout, this dataset contains banners of multi-resolutions. The YouTube (Yang et al., 2023) dataset is another newly proposed dataset focusing on video thumbnail generation. Compared with the former poster dataset, it incorporates foreground images with rotation angles, thus demanding a more advanced level of multi-modal understanding.

Table 3. Results comparison on PosterLayout dataset. Evaluations are conducted under PosterLayout’s (Hsu et al., 2023) settings. Previous results are copied for comparison.
Methods Content-aware Geometric
Uti\Uparrow Occ\Downarrow Rea\Downarrow Val \Uparrow Ove \Downarrow Ali \Downarrow UndlsubscriptUnd𝑙absent\text{Und}_{l}\UparrowUnd start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⇑ UndssubscriptUnd𝑠absent\text{Und}_{s}\UparrowUnd start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⇑
Ground-Truth 0.2222 0.1900 0.1522 0.9999 0.0001 0.0002 0.9965 0.9912
Content-aware Methods
CGL-GAN 0.2257 0.1546 0.1715 0.7066 0.0605 0.0062 0.8624 0.4043
DS-GAN(Hsu et al., 2023) 0.2541 0.2088 0.1874 0.8788 0.0220 0.0046 0.8315 0.4320
LayoutPrompter(Lin et al., 2023b) 0.2597 0.0992 0.1723 0.9992 0.0036 0.0036 0.8986 0.8802
PosterLLaVa(Ours) 0.2628 0.1649 0.1142 1.0000 7.7e-5 0.0002 1.0000 1.0000
Table 4. Results comparison on CGL-GAN dataset. Evaluations are conducted under CGL-GAN’s (Zhou et al., 2022) settings. Previous results are copied for comparison. \dagger indicates that we apply BASNet (Qin et al., 2019) for saliency detection rather than PFPN (Bo et al., 2019) since the pre-trained link of the latter one expires.
Methods Content-aware Geometric
Rcomsubscript𝑅comR_{\text{com}}italic_R start_POSTSUBSCRIPT com end_POSTSUBSCRIPT \Downarrow Rshmsubscript𝑅shmR_{\text{shm}}italic_R start_POSTSUBSCRIPT shm end_POSTSUBSCRIPT \Downarrow Rsubsubscript𝑅subR_{\text{sub}}italic_R start_POSTSUBSCRIPT sub end_POSTSUBSCRIPT \Downarrow Rovesubscript𝑅oveR_{\text{ove}}italic_R start_POSTSUBSCRIPT ove end_POSTSUBSCRIPT \Uparrow Rundsubscript𝑅undR_{\text{und}}italic_R start_POSTSUBSCRIPT und end_POSTSUBSCRIPT \Uparrow Ralisubscript𝑅aliR_{\text{ali}}italic_R start_POSTSUBSCRIPT ali end_POSTSUBSCRIPT \Uparrow Roccsubscript𝑅occR_{\text{occ}}italic_R start_POSTSUBSCRIPT occ end_POSTSUBSCRIPT \Uparrow
Content-unaware Methods
LayoutTransformer (Gupta et al., 2021) 40.92 21.08 1.310 0.0156 0.9516 0.0049 -
VTN (Arroyo et al., 2021) 41.77 22.21 1.323 0.0130 0.9628 0.0047 -
Content-aware Methods
ContentGAN (Zheng et al., 2019) 45.59 17.08 1.143 0.0397 0.8626 0.0071 93.4
CGL-GAN (Zhou et al., 2022) 35.77 15.47 0.805 0.0233 0.9359 0.0098 99.6
PDA-GAN (Xu et al., 2023) 33.55 12.77 0.688 0.0290 0.9481 0.0105 99.7
PosterLLaVa(Ours) 34.80 8.214 0.277\dagger 2.4e-10 1.0000 0.0008 100

Evluation Metrics For a convenient comparison of different datasets, we adopt the original evaluation measurements without change. The metrics used are similar for CGL-dataset (Zhou et al., 2022) and PosterLayout (Hsu et al., 2023) dataset. The calculation of content-aware metrics is related to background or saliency image: the Rcomsubscript𝑅comR_{\text{com}}italic_R start_POSTSUBSCRIPT com end_POSTSUBSCRIPT and Rea represent the readability of text elements; Rshmsubscript𝑅shmR_{\text{shm}}italic_R start_POSTSUBSCRIPT shm end_POSTSUBSCRIPT, Rsubsubscript𝑅subR_{\text{sub}}italic_R start_POSTSUBSCRIPT sub end_POSTSUBSCRIPT, Occ represent the occlusion of semantic meaningful or saliency region on the background, while Uti indicates the utility of non-saliency region. The geometric metrics are only related to the predicted bounding boxes: Rovesubscript𝑅oveR_{\text{ove}}italic_R start_POSTSUBSCRIPT ove end_POSTSUBSCRIPT, and Ove represents the overlap ratio; Rundsubscript𝑅undR_{\text{und}}italic_R start_POSTSUBSCRIPT und end_POSTSUBSCRIPT, UndlsubscriptUnd𝑙\text{Und}_{l}Und start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and UndssubscriptUnd𝑠\text{Und}_{s}Und start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT indicates whether the underlays are correctly placed under texts; Ralisubscript𝑅aliR_{\text{ali}}italic_R start_POSTSUBSCRIPT ali end_POSTSUBSCRIPT and Ali represent the alignment; Roccsubscript𝑅occR_{\text{occ}}italic_R start_POSTSUBSCRIPT occ end_POSTSUBSCRIPT and Val indicates the valid (e.g., non-empty) layout ratio. For Ad Banner (Yu et al., 2022) and YouTube (Yang et al., 2023) dataset, similarity metrics are included since the ground-truth layouts are available. VB in the YouTube dataset represents Visual Balance, which represents whether the overall placement is balanced. To avoid redundancy, please refer to the original papers for detailed explanations of metrics.

Table 5. Results comparison on the ad banner dataset under LayoutDETR’s (Yu et al., 2022) settings. Results of previous methods are copied for comparison, among which PosterLLaVa achieves SOTA performance in all metrics except misalignment.
Methods Similarity Geometric
Layout FID\Downarrow Image FID\Downarrow IoU \Uparrow DocSim \Uparrow Overlap \Downarrow Misalign (×102)(\times 10^{-2})\Downarrow( × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) ⇓
Ground-Truth - - - - 0.035 1.889
Content-unaware Methods
LayoutGAN++ (Kikuchi et al., 2021) 4.25 28.40 0.163 0.130 0.104 0.759
READ 4.45 32.10 0.177 0.141 0.093 2.867
Vinci 38.97 58.12 0.104 0.143 0.243 0.271
LayoutTransformer (Gupta et al., 2021) 5.47 39.70 0.080 0.115 0.127 3.632
Content-aware Methods
CGL-GAN (Zhou et al., 2022) 4.69 30.50 0.154 0.127 0.116 1.191
ICVT (Cao et al., 2022) 12.54 30.11 0.163 0.137 0.423 0.682
LayoutDETR-VAE (Yu et al., 2022) 3.25 27.47 0.216 0.152 0.119 1.737
PosterLLaVa(Ours) 2.37 24.87 0.242 0.158 0.029 1.161
Table 6. Results comparison on the Youtube dataset under HPCVTG’s (Yang et al., 2023) settings. Previous results are copied for comparison. PosterLLaVa shows promising performance in reducing overlap and saliency occlusion.
Methods Similarity Geometric
mIoU \Uparrow FID \Downarrow VB \Downarrow Overlap \Downarrow Misalign \Downarrow Occlusion \Downarrow
Ground-Truth - - 0.93 6.29 1.55 5.88
Content-unaware Methods
LayoutGAN++ (Kikuchi et al., 2021) 4.06 145.7 6.01 151.02 1.52 21.23
LayoutTransformer (Gupta et al., 2021) 11.42 59.89 6.53 76.15 0.06 18.38
Content-aware Methods
HPCVTG (Yang et al., 2023) 14.16 18.50 2.13 47.51 3.25 14.41
PosterLLaVa(Ours) 27.50 12.14 3.10 8.17 0.49 7.24

Result Comparison The results presented in Tab. 4, 4, 6, and 6 demonstrate that our method outperforms existing approaches, both content-unaware and content-aware, by a significant margin. In the Ad Banner dataset, our model exhibits improvements across all metrics except Misalign. For the PosterLayout dataset, our method markedly enhances geometric metrics, whereas LayoutPrompter (Lin et al., 2023b) achieves a better trade-off between utility and occlusion. This is understandable because all previous methods incorporate additional input (i.e., pre-processed saliency maps), while our method relies solely on the original background image. Similarly, in the CGL dataset, our method outperforms othder approaches, particularly in geometric measurements. These results confirm the effectiveness of our method across various datasets and metrics.

3.2. Towards Real-world Poster Design - Two New Content-aware Layout Dataset

Refer to caption
Figure 2. The left figure shows a sample of the poster and the corresponding annotation information in the proposed QB-Poster dataset. The right shows the distribution of element numbers estimated by KDE (Kernel Density Estimation) across all included datasets, in which QB-Poster significantly surpasses other datasets.

User-constrained Layout Generation Although content-aware layout generation has been a valuable step toward real-world applications, realistic graphic design problems often involve more conditionality. User constraint is one of them, usually including optional suggestions or mandatory opinions for graphic design products. These constraints, typically articulated in natural language, introduce even more complexity due to their potential ambiguity. As Section 2.1 mentioned, several previous works (Kikuchi et al., 2021; Jiang et al., 2023; Lin et al., 2023a) have explored similar topics. Yet a comprehensive end-to-end solution that seamlessly integrates visual content with natural language constraints is still required. Our methodology, leveraging large multi-modal models, is inherently equipped to bridge this gap.

Refer to caption
Figure 3. Qualitative results on the PosterLayout (top), Youtube (middle), and QB-Poster (bottom) dataset. PosterLLaVa achieves the highest overall generation quality on all three datasets.
Table 7. Results comparison on the user-constrained poster dataset (up) and the QB-Poster dataset (down). In both datasets, PosterLLaVa outperforms LayoutPrompter significantly.
Methods Similarity Content-ware Geometric Constraint
Image FID\Downarrow IoU\Uparrow Uti\Uparrow Occ\Downarrow Rea\Downarrow Val \Uparrow Ove \Downarrow Ali \Downarrow UndlsubscriptUnd𝑙absent\text{Und}_{l}\UparrowUnd start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⇑ UndssubscriptUnd𝑠absent\text{Und}_{s}\UparrowUnd start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⇑ VB\Downarrow Vio\Downarrow
User-constrained Poster dataset
LayoutPrompter (Lin et al., 2023b) 20.29 0.0961 0.2024 0.2846 0.1038 0.8512 0.0014 0.0018 0.3916 0.2906 0.0781 0.4130
PosterLLaVa(Ours) 3.823 0.1996 0.1751 0.0924 0.1000 0.9432 0.0014 0.0003 0.9962 0.9944 0.0662 0.1156
QB-Poster dataset
LayoutPrompter (Lin et al., 2023b) 96.86 0.0195 0.2467 0.4504 0.1956 0.9509 0.0233 0.0004 0.2686 0.1501 0.2784 -
PosterLLaVa(Ours) 35.97 0.1996 0.2656 0.3377 0.1659 0.9949 0.0117 4.75e-5 0.9418 0.9141 0.1221 -
Refer to caption
Figure 4. Qualitative results on the User-constrained Poster dataset. The user requirement texts are shown on the left side, and the bolden requirement means it was violated by either method.

To this end, we propose a new dataset to validate the constrained generation ability of our approach. Firstly, we ask human annotators to write 3 user constraints according to the original poster layout in the CGL (Zhou et al., 2022) validation set (6,006 samples), which are later used as test samples in this experiment. Then, with these high-quality human-annotated constraints serving as in-context learning examples, we utilize ChatGPT to generate constraints automatically. This approach enable us to expand our constraint dataset to include the entire training corpus of the CGL dataset (54,546 samples) and the PosterLayout dataset (9,974 samples), thereby assembling a enormous training dataset to mirror the diverse demands of real-world graphic design tasks.

A New Real-world Poster Dataset A notable limitation of existing content-aware datasets is their oversimplification. Typically, these datasets feature layouts with no more than 15 design elements, categorized into fewer than 5 types. Such simplicity falls short of conveying sufficient semantic information and mirroring the complexity of designs employed in real-world graphic designs.

To better align with the demands of real-life scenarios, we collect a new dataset named QB-Poster with a much more complicated style. As shown in Fig. 2, the elements per poster and geometric complexity of QB-Poster surpasses other datasets significantly. This includes 5,188 poster-layout pairs, with 4,675 for training and 513 for testing. The dataset categorizes design elements into 10 categories: title, subtitle, item logo, item, item title, object, text background, decoration, frame, and text. These fine-grained class labels reveal the design pattern of elements and provide the algorithm with additional semantic information. Text elements are organized using a hierarchical classification to indicate their levels of importance. Meanwhile, visual elements are categorized as decoration, text background, object, and frame, which respectively identify decorative icons, underlays, semantically significant objects within background images, and the canvas area.

Baseline and Evaluation Metrics To be fair in model scale, we choose LayoutPrompter (Lin et al., 2023b) for comparison, which also employs LLM as its central component. We use gpt-3.5-turbo-instruct instead of text-daVinci-003 since OpenAI has abandoned the latter model. Unlike our method, which uses a visual encoder, LayoutPrompter only accepts textual input. Thus, for user-constrained content-aware generation task, we extend the original method by concatenating the pre-extracted saliency bounding box and the constraint texts. Other methods are omitted since they cannot support multi-modal input, and LayoutPrompter already surpasses other methods by a clear margin in the PosterLayout dataset. For evaluation metrics, since the metrics used in PosterLayout (Hsu et al., 2023) and CGL-GAN (Zhou et al., 2022) are very similar to each other. We chose the PosterLayout style for our evaluation as it relies less on additional data and pre-trained models. But different from the original paper, our definition (which uses validation split) includes the ground truth layout for the testing, which enables the computation of similarity metrics. We cut the patches in the original poster with a ground-truth bounding box and resize it with the predicted bounding box to form the predicted poster image for computing image FID. The IoU is also introduced as a measurement of similarity. For geometric measurements, we adopt the VB (Visual Balance) used in HPCVTG (Yang et al., 2023) as an important supplement, reflecting whether the elements’ placement is spatially balanced. Most importantly, to measure to what degree the model follows the input constraints, we sample a subset (50 layouts) of the test set and ask human annotators to verify the average constraint violation ratio, noted as vio. The overall result shown in Tab 7.

Result Comparison As shown in Tab.7, PosterLLaVa significantly surpasses LayoutPrompter in all metrics, no matter similarity or geometry, showing the power of utilizing visual instruction tuning in layout generation. This result differs from the PosterLayout dataset shown in Tab. 4, but it is still within expectation once recognizing the difference between RAG and fine-tuning. This shows that despite the efficiency of the learning-free method, it may fail to fit target distribution when dealing with complicated and highly customized data. Besides, the RAG doesn’t use the training set for tuning the model directly, whereas it still requires a large database size to ensure the retrieval of high-quality and low-variance exemplars, which worsens the performance of this method under data scarcity.

4. Ablation

We design several ablation experiments to verify the necessity of our proposed method on the following dimensions. We assume that 1. Considering the small scale of the existing content-aware dataset (¡100,000 samples), the generation performance of the model is positively correlated to the number of training samples and model size; 2. the multi-modal information used should contribute to the generated layout quality. The ad banner dataset is selected for ablation because it is the most lightweight but still contains sufficient multi-modal information, and the metrics used are stable (in contrast, the reliability of utility and occlusion scores highly depends on the quality of saliency detection).

Table 8. Ablation Studies conducted on ad banner dataset (Yu et al., 2022). Results demonstrate the necessity of applying large models, large datasets, and multi-modal information in content-aware layout generation.
Methods Similarity Geometric
Layout FID\Downarrow Image FID\Downarrow IoU \Uparrow DocSim \Uparrow Overlap \Downarrow Misalign (×102)(\times 10^{-2})\Downarrow( × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) ⇓
PosterLLaVa(Ours) 2.37 24.87 0.242 0.158 0.029 1.161
+ extra training data 3.91 24.40 0.251 0.160 0.027 0.949
+ 7B\rightarrow13B LLM 2.78 23.86 0.262 0.156 0.026 1.676
- textual info 2.98 25.14 0.225 0.115 0.021 1.522
- visual info 8.27 40.59 0.092 0.115 0.020 2.193

Result The result shown in Tab. 8 demonstrates the assumption proposed above. For extra training data, we apply the whole training set of CGL, PosterLayout, and ad banner datasets (78,194 samples in total) for fine-tuning, which improves all geometric measurements. Surprisingly, it also improves the similarity metrics except for Layout FID, which reveals the generality in content-aware generation datasets. Furthermore, the similarity measurement continues to increase by upgrading the pre-trained LLaVa model from 7B to 13B. For multi-modal information, we reduce the visual input (i.e., background image) and textual input (i.e., text element content), respectively, and both of these degrade the overall performance (with a slight improvement in overlap metric, probably because the reduction of information has lower the learning difficulty). These results together demonstrate the effectiveness of utilizing multi-modal large models in content-aware layout generation tasks, and with whose enormous learning capacity, the corresponding demand for more high-quality layout data.

5. Conclusion

Content-aware layout generation is a highly multi-modal problem. Utilizing the latest multi-modal large model instruction fine-tuning techniques, we propose a method named PoserLLaVa that represents multi-modal layout information as tokens, which are then processed by a Large Language Model (LLM). The proposed method achieves SOTA performance across multiple content-aware layout generation datasets. Additionally, by surveying existing content-aware layout generation datasets, we identify significant shortcomings in the current public datasets, namely the lack of user-constrained data and complicated data, both of which are crucial in real-world applications. We further collect two new datasets to bridge this gap, the user-constrained poster dataset and the QB-Poster, based on which we verify the extended ability of our method. In summary, to achieve large-scale automated production, high-quality multi-modal layout data and a unified learning approach are still under demand, for which our method paves the way.

References

  • (1)
  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  • Arroyo et al. (2021) Diego Martin Arroyo, Janis Postels, and Federico Tombari. 2021. Variational transformer networks for layout generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13642–13652.
  • Baykasoğlu and Gindy (2001) Adil Baykasoğlu and Nabil NZ Gindy. 2001. A simulated annealing algorithm for dynamic layout problem. Computers & Operations Research 28, 14 (2001), 1403–1426.
  • Bo et al. (2019) Wang Bo, Chen Quan, Zhou Min, Zhang Zhiqiang, ** Xiaogang, and Gai Kun. 2019. Progressive Feature Polishing Network for Salient Object Detection. (2019).
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  • Cao et al. (2022) Yunning Cao, Ye Ma, Min Zhou, Chuanbin Liu, Hongtao Xie, Tiezheng Ge, and Yuning Jiang. 2022. Geometry aligned variational transformer for image-conditioned layout generation. In Proceedings of the 30th ACM International Conference on Multimedia. 1561–1571.
  • Carion et al. (2020) Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213–229.
  • Chen et al. (2023) **gye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. 2023. TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering. arXiv preprint arXiv:2311.16465 (2023).
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  • Feng et al. (2023) Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. 2023. LayoutGPT: Compositional Visual Planning and Generation with Large Language Models. arXiv preprint arXiv:2305.15393 (2023).
  • Gupta et al. (2021) Kamal Gupta, Justin Lazarow, Alessandro Achille, Larry S Davis, Vijay Mahadevan, and Abhinav Shrivastava. 2021. Layouttransformer: Layout generation and completion with self-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1004–1014.
  • Hsu et al. (2023) Hsiao Yuan Hsu, Xiangteng He, Yuxin Peng, Hao Kong, and Qing Zhang. 2023. PosterLayout: A New Benchmark and Approach for Content-aware Visual-Textual Presentation Layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6018–6026.
  • Hui et al. (2023) Mude Hui, Zhizheng Zhang, Xiaoyi Zhang, Wenxuan Xie, Yuwang Wang, and Yan Lu. 2023. Unifying Layout Generation with a Decoupled Diffusion Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1942–1951.
  • Inoue et al. (2023) Naoto Inoue, Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yamaguchi. 2023. LayoutDM: Discrete Diffusion Model for Controllable Layout Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10167–10176.
  • Jiang et al. (2023) Zhaoyun Jiang, Jiaqi Guo, Shizhao Sun, Huayu Deng, Zhongkai Wu, Vuksan Mijovic, Zijiang James Yang, Jian-Guang Lou, and Dongmei Zhang. 2023. LayoutFormer++: Conditional Graphic Layout Generation via Constraint Serialization and Decoding Space Restriction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18403–18412.
  • Jyothi et al. (2019) Akash Abdu Jyothi, Thibaut Durand, Jiawei He, Leonid Sigal, and Greg Mori. 2019. Layoutvae: Stochastic scene layout generation from a label set. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9895–9904.
  • Kikuchi et al. (2021) Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yamaguchi. 2021. Constrained graphic layout generation via latent optimization. In Proceedings of the 29th ACM International Conference on Multimedia. 88–96.
  • Kong et al. (2022) Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan Hao, Haifeng Gong, and Irfan Essa. 2022. BLT: bidirectional layout transformer for controllable layout generation. In European Conference on Computer Vision. Springer, 474–490.
  • Li et al. (2023) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).
  • Li et al. (2019) Jianan Li, Jimei Yang, Aaron Hertzmann, Jianming Zhang, and Tingfa Xu. 2019. Layoutgan: Generating graphic layouts with wireframe discriminators. arXiv preprint arXiv:1901.06767 (2019).
  • Lin et al. (2023a) Jiawei Lin, Jiaqi Guo, Shizhao Sun, Weijiang Xu, Ting Liu, Jian-Guang Lou, and Dongmei Zhang. 2023a. A parse-then-place approach for generating graphic layouts from textual descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 23622–23631.
  • Lin et al. (2023b) Jiawei Lin, Jiaqi Guo, Shizhao Sun, Zijiang James Yang, Jian-Guang Lou, and Dongmei Zhang. 2023b. LayoutPrompter: Awaken the Design Ability of Large Language Models. arXiv preprint arXiv:2311.06495 (2023).
  • Liu et al. (2023a) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023a. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023).
  • Liu et al. (2023b) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023b. Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023).
  • Ma and Chen (2016) Shuang Ma and Chang Wen Chen. 2016. Automatic creation of magazine-page-like social media visual summary for mobile browsing. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 469–473.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  • Qin et al. (2019) Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao, Masood Dehghan, and Martin Jagersand. 2019. BASNet: Boundary-Aware Salient Object Detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  • Rajasekharan et al. (1998) Maheswaran Rajasekharan, Brett A Peters, and Taho Yang. 1998. A genetic algorithm for facility layout design in flexible manufacturing systems. International journal of Production research 36, 1 (1998), 95–110.
  • Roziere et al. (2023) Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, **gyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
  • Suvorov et al. (2021) Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Nae** Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. 2021. Resolution-robust Large Mask Inpainting with Fourier Convolutions. arXiv preprint arXiv:2109.07161 (2021).
  • Tang et al. (2023) Zecheng Tang, Chenfei Wu, Juntao Li, and Nan Duan. 2023. LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language Models. arXiv preprint arXiv:2309.09506 (2023).
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  • Wang et al. (2023) Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, ** Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. 2023. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175 (2023).
  • Wang et al. (2022) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560 (2022).
  • Xu et al. (2023) Chenchen Xu, Min Zhou, Tiezheng Ge, Yuning Jiang, and Weiwei Xu. 2023. Unsupervised Domain Adaption with Pixel-level Discriminator for Image-aware Layout Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10114–10123.
  • Yamaguchi (2021) Kota Yamaguchi. 2021. Canvasvae: Learning to generate vector graphic documents. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5481–5489.
  • Yang et al. (2023) Tao Yang, Fan Wang, Junfan Lin, Zhongang Qi, Yang Wu, **g Xu, Ying Shan, and Changwen Chen. 2023. Toward Human Perception-Centric Video Thumbnail Generation. In Proceedings of the 31st ACM International Conference on Multimedia. 6653–6664.
  • Ye et al. (2023) Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. 2023. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023).
  • Yin et al. (2013) Wenyuan Yin, Tao Mei, and Chang Wen Chen. 2013. Automatic generation of social media snippets for mobile browsing. In Proceedings of the 21st ACM international conference on Multimedia. 927–936.
  • Yu et al. (2022) Ning Yu, Chia-Chih Chen, Zeyuan Chen, Rui Meng, Gang Wu, Paul Josel, Juan Carlos Niebles, Caiming Xiong, and Ran Xu. 2022. LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer. arXiv preprint arXiv:2212.09877 (2022).
  • Zhang et al. (2023a) Junyi Zhang, Jiaqi Guo, Shizhao Sun, Jian-Guang Lou, and Dongmei Zhang. 2023a. LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models. arXiv preprint arXiv:2303.11589 (2023).
  • Zhang et al. (2023b) Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. 2023b. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023).
  • Zheng et al. (2019) Xinru Zheng, Xiaotian Qiao, Ying Cao, and Rynson WH Lau. 2019. Content-aware generative modeling of graphic design layouts. ACM Transactions on Graphics (TOG) 38, 4 (2019), 1–15.
  • Zhou et al. (2022) Min Zhou, Chenchen Xu, Ye Ma, Tiezheng Ge, Yuning Jiang, and Weiwei Xu. 2022. Composition-aware graphic layout GAN for visual-textual presentation designs. arXiv preprint arXiv:2205.00303 (2022).
  • Zhu et al. (2023) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).

6. Appendices

6.1. More Comparisons

In Section 3.2 we only included the state-of-the-art method, LayoutPrompter (Lin et al., 2023b), for comparison on the two proposed datasets, since it serves as a strong representative (another reason is other methods did not release their code). We reproduced several previous methods to ensure a comprehensive comparison and included these results in the Appendix. We still found that most previous methods could not converge well on our proposed QB-Poster dataset since it contains more complicated element categories and spatial layouts. Tab. 9 presents a preliminary reproduction result, demonstrating that our method outperforms in most metrics. Although some traditional methods excel in one or two specific metrics, our model achieves the best overall trade-offs.

We also included other instruction-tuning techniques for comparison. MiniGPT-4 (Zhu et al., 2023) is an instruction-tuning method based on Q-Former(Li et al., 2023), while mPLUG (Ye et al., 2023) is a more recent method that tunes both the large language model (LLM) and the visual encoder simultaneously. The results indicate that the visual tuning scheme adopted by LLaVa (Liu et al., 2023b) generally performs the best. This is understandable because the primary task for layout generation is to adapt the input and output format, making the LLM the central component for tuning. Additionally, the existing layout data is still limited in both quality and quantity, and aligning the visual encoder using such data would weaken its general feature extraction ability.

Table 9. Additional comparison on the QB-Poster dataset.
Methods Similarity Content-ware Geometric
Image FID\Downarrow IoU\Uparrow Uti\Uparrow Occ\Downarrow Rea\Downarrow Val \Uparrow Ove \Downarrow Ali \Downarrow UndlsubscriptUnd𝑙absent\text{Und}_{l}\UparrowUnd start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⇑ UndssubscriptUnd𝑠absent\text{Und}_{s}\UparrowUnd start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⇑ VB\Downarrow
QB-Poster dataset
DS-GAN (Hsu et al., 2023) 85.19 0.0558 0.5048 0.4146 0.1995 1.0000 0.1541 0.0034 0.3094 0.1627 0.0287
CGL-GAN (Zhou et al., 2022) 67.10 0.0373 0.2908 0.3904 0.1800 0.9959 0.1375 0.0040 0.3726 0.0600 0.0956
ICVT (Cao et al., 2022) 97.59 0.0231 0.1121 0.3629 0.1442 0.9599 0.4666 0.0018 0.4673 0.3617 0.2903
LayoutDM (Inoue et al., 2023) 159.3 0.0144 0.2218 0.4096 0.1850 0.9980 0.2240 0.0003 0.4736 0.3618 0.1223
LayoutPrompter (Lin et al., 2023b) 96.86 0.0195 0.2467 0.4504 0.1956 0.9509 0.0233 0.0004 0.2686 0.1501 0.2784
PosterLLaVa(Ours) 35.97 0.1996 0.2656 0.3377 0.1659 0.9949 0.0117 4.75e-5 0.9418 0.9141 0.1221
Table 10. Training efficiency test.
Methods Training Device Training Time (sec) Training Epochs
DS-GAN (Hsu et al., 2023) 16 X NVIDIA A10 (24GB) 9030 300
CGL-GAN (Zhou et al., 2022) 16 X NVIDIA A10 (24GB) 21667 300
ICVT (Cao et al., 2022) 16 X NVIDIA A10 (24GB) 12030 300
LayoutDM (Inoue et al., 2023) 16 X NVIDIA A10 (24GB) 19740 300
PosterLLaVa (LLaVa-v1.5) 8 X NVIDIA A10 (24GB) 4186 2
PosterLLaVa (LLaVa-v1.5 LoRA) 8 X NVIDIA A10 (24GB) 2093 2
PosterLLaVa (mPLUG-owl2) 16 X NVIDIA A10 (24GB) 1628 2
PosterLLaVa (miniGPT4) 8 X NVIDIA A100 (40GB) 3414 20

6.2. Efficiency Tests

The utilization of LLMs brings better performance but also a larger computational burden. In this section, we present training time experiments to demonstrate that the increased complexity introduced by LLMs is manageable. Results are shown in Table 10. Interestingly, our method, despite incorporating a much larger model (LLaVa-7B or 13B), requires significantly fewer epochs to converge compared to previous methods (2 epochs vs. 300 epochs) and can thus save 50% time (4186 sec v.s. 9030 sec). This improvement is likely due to the spatial arranging knowledge implicitly encoded in the pre-trained LLM models. Additionally, by using the zero3_offload script for DeepSpeed, the LLM can be tuned on constrained GPU devices, such as 8 x NVIDIA A10 GPUs with only 24 GB of memory each, which is the same as the previous method required. Furthermore, the LoRA scheme can further reduce training time and memory requirements, making it a better alternative than full-tuning when adopting larger models (¿13B). In summary, using LLM for layout generation is promising for achieving both better effectiveness and improved efficiency.