Auto Cherry-Picker : Learning from High-quality Generative Data Driven by Language

Yicheng Chen^1,2, Xiangtai Li^1,3, Yining Li^1†, Yanhong Zeng¹,
Jianzong Wu^1,4, Xiangyu Zhao^1,5, Kai Chen^1†
¹Shanghai AI Laboratory ²Tongji University
³S-Lab, Nanyang Technological University
⁴Peking University ⁵Shanghai Jiao Tong University

Project page: https://yichengchen24.github.io/projects/autocherrypicker

Abstract

Diffusion-based models have shown great potential in generating high-quality images with various layouts, which can benefit downstream perception tasks. However, a fully automatic layout generation driven only by language and a suitable metric for measuring multiple generated instances has not been well explored. In this work, we present Auto Cherry-Picker (ACP), a novel framework that generates high-quality multi-modal training examples to augment perception and multi-modal training. Starting with a simple list of natural language concepts, we prompt large language models (LLMs) to generate a detailed description and design reasonable layouts. Next, we use an off-the-shelf text-to-image model to generate multiple images. Then, the generated data are refined using a comprehensively designed metric to ensure quality. In particular, we present a new metric, Composite Layout and Image Score (CLIS), to evaluate the generated images fairly. Our synthetic high-quality examples boost performance in various scenarios by customizing the initial concept list, especially in addressing challenges associated with long-tailed distribution and imbalanced datasets. Experiment results on downstream tasks demonstrate that Auto Cherry-Picker can significantly improve the performance of existing models. In addition, we have thoroughly investigated the correlation between CLIS and performance gains in downstream tasks, and we find that a better CLIS score results in better performance. This finding shows the potential for evaluation metrics as the role for various visual perception and MLLM tasks. Code will be available.

{NoHyper}^†^†footnotetext: ^† Corresponding Author.

1 Introduction

Refer to caption — Figure 1: Illustration of quality assessment of generated data samples using CLIS. (a) and (c) compare the quality of samples with different CLIS-L and CLIS-I scores, respectively. Samples with low CLIS fail to align accurately with the condition (e.g., contain extraneous objects or exhibit visual flaws). (b) and (d) compare the preferences of CLIS and CLIP score [28].

Recently, diffusion-based image generation methods [64, 71] have made remarkable progress, which has led to various applications, including text-to-image generation(T2I) [22, 23, 63, 67, 77], image editing [4, 27, 55, 66], video generation [24, 30, 31], art creation [18, 65], and more. Compared with previous generative models [15, 34, 57], diffusion models can generate high-quality and high-resolution examples. Thus, one essential usage of the diffusion-based model is to create useful training examples for various downstream vision tasks, such as segmentation [37, 43, 84], detection [8, 46], and visual representation learning [38, 75]. Using generated data alleviates the severe demand for human annotation and provides a more controllable data production process.

Thus, previous works explore how to generate training examples for various tasks conditioned on various references, such as captioning [45], layout guidance [8, 80], reference masks [52, 73], and reference image [87]. In particular, InstanceDiffusion [80] generates images with precise instance-level control, including attribute binding and localization, while maintaining high fidelity conditioned on detailed instances. However, these data generation approaches still rely on manual annotations. In addition, due to the inherent randomness of generative models, the quality of generated data tends to vary, potentially impairing the effectiveness of using this data to train downstream tasks. Consequently, appropriate quality assessment metrics must be employed to filter out the synthetic data, which has not been extensively studied. Therefore, as illustrated in Figure 2, current methods are limited by expensive manual annotations and a lack of effective quality assessment metrics. This study aims to explore a new generation pipeline that does not require real textual or visual annotations, such as captions and layouts. To solve this issue, several essential questions are raised: 1) How can we reduce the reliance on annotations, including caption and layout? 2) How do we measure the quality for multiple instances, and can we propose a new metric to select good ones? 3) Does the proposed metric reflect the final downstream performance when used for training?

To this end, we propose Auto Cherry-Picker (ACP), a data generator pipeline based on pre-trained generative models, to generate images and corresponding detailed descriptions and layout annotations simultaneously for perception and reasoning tasks. Our method comprises a generative models-based raw data generator and a comprehensive data filter. To reduce the reliance on real annotations, our raw data generator alters the paradigm of current data synthesis. It is purely driven by natural language, i.e., an arbitrary list of objects. We first use LLMs to sample scene-level descriptions with fine-grained details, including object attributes, relations, and spatial layouts. Then, we adopt T2I models to generate images conditioned on previous-generated information. This pipeline allows us to easily scale up synthetic examples from a simple list of interested objects. Besides, we can address the unbalanced distribution problem by arranging the category proportion, especially in long-tailed scenarios. To ensure the quality of synthetic data, we design a comprehensive metric, namely Composite Layout and Image Score (CLIS), to filter the raw data from the generator. This metric evaluates the reasonableness of layout generation (CLIS-L) and the quality of image generation (CLIS-I). CLIS-L is calculated by comparing the similarity to data prior from the open source dataset. As shown in Figure 1(a,b), a high-quality layout assessed by CLIS-L resembles real-world layouts and is more likely to produce high-quality images. CLIS-I is calculated by evaluating both visual quality and alignment with text descriptions. As shown in Figure 1(c, d), a high-quality image assessed by CLIS-I exhibits high visual quality and strong alignment with the corresponding text description. The filtered scene graph and corresponding images are used as training examples.

Through comprehensive evaluations, our designed CLIS significantly enhances the performance of the state-of-the-state generation model, InstanceDiffusion, from various generation perspectives, including image fidelity, alignment to text, and layout control. Additionally, we observe a positive correlation between the CLIS score and performance gains on downstream tasks. By scaling up the training examples generated by our Auto Cherry-Picker pipeline, we achieve substantial performance improvements for perception and reasoning tasks, particularly in long-tail and open-vocabulary scenarios. Specifically, on the LVIS dataset [21], we observe a +5.2% improvement in AP ${}_{r}^{mask}$ in the long-tail setting using Mask R-CNN [25] and a +1.3% improvement in the open-vocabulary setting using Grounding DINO [51]. Additionally, we achieve a score of +80.1 on the MME benchmark and an improvement of +0.4 on the GQA benchmark based on LLaVA-v1.5 [49], validating its efficiency in multi-modal perception and reasoning settings.

We summarize our technical contributions as follows:

$\bullet$

We propose a new Auto Cherry-Picker system, a novel training data generator pipeline for perception and reasoning tasks. Our method is conditioned solely on an object list, eliminating the need for textual or visual annotations.
$\bullet$

We designed a comprehensive metric CLIS to filter generated data effectively. We evaluate the reasonableness of generated layouts and the quality of generated images based on priors from real data or pre-trained large-scale models.
$\bullet$

Extensive experiments on visual and cross-modality perception and reasoning benchmarks demonstrate that our method can enhance model performance for various downstream tasks. The correlation between CLIS and performance gain among downstream tasks is well studied.

2 Related Work

Text to Image Generation. Diffusion-based approaches [56, 63, 64, 67] model the text-to-image generation process as iterative denoising steps from random noise. Stable Diffusion [64] performs diffusion steps in the latent space of pre-trained autoencoders to achieve efficient training and sampling. Subsequent studies extend text-to-image diffusion models with layout controllability by introducing auxiliary input signals [41, 88] or spatial tokens [86] during training. Another line of works [2, 3, 9, 13, 70, 85] follows a training-free approach by directly intervening the cross-attention layers during the sampling process. With recent progress in the field of Large Language Models (LLM), LLM has been introduced in the T2I system to enhance text understanding and alignment [11, 16, 19, 44, 60, 61, 62, 90]. LMD [44] and LayoutLLM-T2I [62] firstly use LLMs as text-guided layout generators through in-context learning. Control-GPT [90] query GPT-4 to generate TikZ-encoded sketch references for T2I models. DiffusionGPT [61] utilizes LLM to accommodate diverse text input forms and select domain-expert models for superior generation quality. LLM Blueprint [19] leverages LLMs to extract critical components from text prompts. LayoutGPT [16] focuses on numerical and spatial correctness for layout generation. Compared to previous works, we extend the generation paradigm to be conditioned on a simple object list.

Learning from Synthetic Data. Deep learning models, especially for dense prediction tasks, typically require large amounts of data, which can be costly. Therefore, many works use synthetic data to approximate information gathered or measured in the real world [79, 82]. Synthetic training examples are conditioned on various references. Some works [8, 52, 73] utilize the layout-to-image paradigm to synthesize training samples, conditioning on visual annotations like segmentation masks or bounding boxes. Others [43, 83] utilize an off-the-shelf perception model or adopt a perception head to get dense annotations of synthetic images, which are generated conditioning on detailed text description. Synthetic data can also be used in self-supervised learning domains [6, 10, 35, 38, 40, 75]. Among all these studies, no works explore a full language-driven pipeline. To fill this gap, our method is driven purely by language without the need for expensive manually annotated dense labels.

Generative Model Evaluation. Assessment of AI-generated content is challenging due to its subjective nature and the complexity of factors contributing to the generation quality. Metrics like Inception Score (IS) [68], Fréchet Inception Distance (FID) [29], and LPIPS [89] are commonly used for quality and diversity assessment. Some methods utilize VLM to evaluate the alignment between text and generated image. CLIPScore [28] calculates the cosine similarity between text features and generated-image features extracted by CLIP. BLIP-CLIP [7] applies BLIP [39] to generate captions, then calculates the CLIP text-text cosine similarity between the generated captions and text prompts. For layout quality assessment, LayoutDM [33] further proposes Maximum IoU (Max.) to measure the similarity between generated and real layouts as the average box IoU of the optimal instance matching. Our proposed CLIS evaluates instance-level results in complex scenes and combines the reasonableness of layout and content quality in one shot, making it a suitable metric to generate high-quality data for downstream tasks.

3 Method

Our Auto Cherry-Picker is a training-free cross-modality perception and reasoning dataset generation pipeline. It can produce image and scene graph pairs conditioned on a simple list of objects while automatically selecting the high-quality ones for training downstream models. We adopt an off-the-shelf LLM as a Scene Graph and Layout Generator and combine an off-the-shelf diffusion model as an Image Generator. We first introduce the problem setting in Sec. 3.1. Then, we detail our framework, including the raw data generator and the data filter in Sec. 3.2. Finally, we explain the deployment on various downstream tasks in Sec. 3.3.

3.1 Task Formation

We now present the problem setting of data generation. Given an object list $O=\{o_{1},o_{2},...,o_{n}\}$ , the raw generated data is denoted as $D_{r}$ , with detailed scene graph $SG$ annotations and a set of raw images $I$ .

D_{r}=G(O),D_{r}=\{SG,I\}

(1)

where $G$ represents a raw data generator. Then, a data filter $F$ is applied to the initially generated $D_{r}$ to automatically produce a high-quality dataset $D_{h}$ with a high-quality scene graph $SG_{h}$ and corresponding images $I_{h}$ .

D_{h}=F(D_{o}),D_{h}=\{SG_{h},I_{h}\}

(2)

Given $D_{h}$ as newly generated data, we can evaluate them from two aspects: 1) using generation metrics, such as FID [29]. 2) using $D_{h}$ to co-train downstream tasks to check the gains.

3.2 ACP Framework

We aim to design a high-quality cross-modality training data generator conditioned on a simple object list in natural language, as depicted in Figure 3. The design of ACP comprises two key components: a raw data generator and a data filter. The former aims to generate data, while the latter selects good ones with our proposed CLIS metric.

Raw Data Generator. We propose to generate data samples by harnessing the information from a simple object list. This enables us to easily scale the data size and align it to specific downstream tasks. We term such process as a raw data generator.

As shown in Figure 3(a), we first require LLMs to sample descriptions from the initial object list, leveraging its in-context learning capability [5]. The description contains detailed attributes of each object, relations between different objects, and an overall dense caption. Our method involves crafting specific prompt engineering templates that guide the LLM in generating the required description. Then, we utilize the spatial reasoning ability of LLMs to plan layouts given relations and descriptions of objects. The LLM involved in the above process is referred to as the scene graph (description and layouts of the input object list) generator. Conditioned on the synthesis scene graph, we adopt an off-the-shelf diffusion-based image generator to generate a set of initial images by initiating the reverse diffusion process with different random noise. Following the approach used in StableRep [75] and SynCLR [74], we produce four images for each scene graph.

Data Filter. For the raw generated data from stage one, we utilize a data filter to cherry-pick high-quality training data. As depicted in Figure 3(b), we use the Layout Filter with CLIS-L and the Image Filter with CLIS-I to separately assess the quality of image contents and layouts. We will first describe the layout filter and then the image filter.

(a) Layout Filter. To enable models with reasoning abilities, we also emphasize the rationality of scene graphs. We evaluate the rationality of layouts in the scene graph based on the priors of the ground truth layout. Specifically, we extract layout from real dataset annotations like Flickr30K [59] with corresponding categories and relations to construct an example pool $E$ . Given a subject $s$ , an object $o$ , and their relation $r$ , we consider the relative size, distance, and direction similarity between their corresponding layouts and layouts in $E$ with the same categories and relation. Formally, we define the relative similarity score between $A$ and $B$ as follows:

S_{sim}(A,B)=1-\frac{\left|A_{n}-B_{n}\right|}{max(A_{n},B_{n})}

(3)

where $A_{n}$ , $B_{n}$ represent the normalization of $A$ and $B$ , respectively. Here we have the size score:

S_{size}(s,o,r)=\mathop{max}\limits_{\{s_{e},o_{e}\}\in E(s,o,r)}\{S_{sim}(% \frac{{\rm Area}(s)}{{\rm Area}(o)},\frac{{\rm Area}(s_{e})}{{\rm Area}(o_{e})% })\}

(4)

where ${\rm Area}$ represents the area of corresponding layout.

For the distance score, we consider both IoU and relative distance between centers of two layouts:

$\displaystyle S_{IoU}(s,o,s_{e},o_{e})$	$\displaystyle=S_{sim}({\rm IoU}(s,o),{\rm IoU}(s_{e},o_{e}))$	(5)
$\displaystyle S_{RelDist}(s,o,s_{e},o_{e})$	$\displaystyle=S_{sim}({\rm RelDist}(s,o),{\rm RelDist}(s_{e},o_{e}))$	(6)
$\displaystyle S_{Dist}(s,o,r)$	$\displaystyle=\mathop{max}\limits_{\{s_{e},o_{e}\}\in E(s,o,r)}(\alpha\cdot S_% {IoU}(s,o,s_{e},o_{e})+\beta\cdot S_{RelDist}(s,o,s_{e},o_{e}))$	(7)

where ${\rm IoU}$ and ${\rm RelDist}$ are functions to calculate IoU and relative distance between two objects. The coefficients $\alpha$ and $\beta$ control the weight of two items.

For direction score:

S_{Dir}(s,o,r)=\mathop{max}\limits_{\{s_{e},o_{e}\}\in E(s,o,r)}{\rm Norm}(cos% [{\rm Dir}(s,o),{\rm Dir}(s_{e},o_{e})])

(8)

where ${\rm Dir}$ calculates the direction vector between two layouts.

In summary, CLIS-L for generated layouts conditioned on subject, object, and relation can be expressed as:

\text{CLIS-L}(s,o,r)=w_{1}S_{size}(s,o,r)+w_{2}S_{Dist}(s,o,r)+w_{3}S_{Dir}(s,% o,r)

(9)

where $w$ assigns different significance to different evaluation perspectives.

(b) Image Filter. To enable a model with strong perception abilities and complex reasoning abilities, we emphasize the visual quality of the image itself and its alignment with the corresponding category description. Therefore, we filter images from these two perspectives. Specifically, we utilize a pre-trained multi-modal caption model to describe the entire image and local parts corresponding to each layout. This ensures that the object in the generated image is highly qualified enough to be identified by perception models. Then, we calculate the similarity between the predicted description from the caption model and the target description from the scene graph. The image filtering by CLIS-I can be formulated as:

\text{CLIS-I}={\rm Sim}(SG_{pred},SG)={\rm Sim}(C(I,L),SG)

(10)

where $L$ represents layouts in the scene graph. We adopt a pre-trained VLM, Qwen-VL [1], for caption model $C$ and an LLM to calculate the similarity within text modality. By manually designing task instructions for LLM, we can assign different significance to different parts of the scene graph.

3.3 Deployment on Downstream Tasks

We first generate training samples for visual perception tasks in long-tailed instance segmentation and open-vocabulary object detection. Then, we generate cross-modality training samples for perception and reasoning tasks in multi-modal visual question answering. We employ Qwen1.5-14B as our Scene Graph Generator as described in Appendix A.1.

Visual Perception Training Samples. To maintain consistency with the distribution of the original training set, we sampled our object list from training annotations. We utilize the image and layout annotation from our generated data. Additionally, we adopt a pre-trained SAM to generate segmentation masks within the layout.

Model	FID $\downarrow$
Stable Diffusion [64]	56.8
BoxDiff-SD [85]	60.0
BoxDiff-GLIGEN [85]	61.0
GLIGEN [41]	63.5
InstanceDiffusion [80]	53.5
GLIGEN w. CLIS	59.9 (-3.6)
InstanceDiffusion w. CLIS	48.9 (-4.6)

Table 1: Generation results of CLIS.

Figure 4: Consistent with human judgement.

Table 2: Correlation between CLIS-I and performance improvements on both long-tailed instance segmentation and open-vocabulary object detection scenarios of LVIS benchmarks. The baseline is trained on the original training set, while the others are trained on the training and synthetic set. The annotations of rare categories are not used for open-vocabulary settings. We use Mask R-CNN R50-FPN (1X schedule) for long-tailed instance segmentation and Grounding-DINO for open-vocabulary object detection.

Method	Range of CLIS-I	long-tailed				open-vocabulary
Method	Range of CLIS-I	AP ${}_{r}^{box}$	AP ${}_{r}^{mask}$	AP^box	AP^mask	AP ${}_{r}^{box}$	AP^box
baseline	N/A	8.9	9.3	22.5	21.7	44.4	57.3
Image Data Filter	70 - 75	9.9	10.7	22.8	22.1	42.5	57.4
Image Data Filter	75 - 80	10.1	11.5	22.8	22.1	44.2	57.3
Image Data Filter	80 - 85	11.2	12.1	23.4	22.6	45.8	57.7

Cross-modality Perception and Reasoning Training Samples. We further utilize a template to construct question-answer pairs for instruction fine-tuning of VLLMs. Instruction data for perception is constructed based on the generated image and scene graph. We focus on two aspects. 1) Localization: We create instructions that require models to localize an object based on a detailed description. Conversely, we also create instructions that require models to describe an object, given its localization. 2) Attribute-binding: We construct instructions that query models regarding attributes (e.g., color, count) of certain objects. Additionally, we formulate instruction data for reasoning. Relation: We create instructions that query models about the relationship between different objects, which may be spatial or action-based. The expected responses should be descriptive in natural language. Please refer to Appendix A.3 for details of templates used to construct question-answer pairs.

CLIS Setting. We employ CLIS to filter high-quality training samples by setting independent score thresholds for layouts (CLIS-L) and images (CLIS-I). Notably, layouts assessed as low-quality are excluded from generating corresponding images.

4 Experiments

We first validate the proposed CLIS metric from two perspectives: 1) its correlation with image fidelity of generated samples and 2) the performance gain observed in downstream tasks when using CLIS as a training data filter. Next, we proceed to verify the effectiveness of the ACP system on several downstream tasks and demonstrate its potential for continuous scaling up of data size. We verify the designed modules in ACP via a series of ablation studies.

Table 3: Correlation between CLIS-L and performance improvements on multi-modal perception and reasoning MME and GQA benchmarks. Based on the same pre-trained weight of the LLaVA-v1.5 model, the baseline is instruction fine-tuned on original data from [48], while the others additionally trained on synthetic instruction fine-tuning set with different layout score threshold for filtering.

Method	Threshold of CLIS-L	MME	GQA
baseline	N/A	1434.4	58.9
Layout Data Filter	50	1445.2	59.2
Layout Data Filter	70	1494.2	59.5

4.1 Implementation Details

Datasets. We evaluate generation quality using the MS-COCO [47] dataset following [11, 13, 14]. To prevent ambiguous bounding box annotations, we first filter images to those containing at most one instance of a single category. We then randomly sample 1181 images, each with a fixed caption, from the COCO validation set. For downstream tasks, we conduct experiments of object detection and instance segmentation on MS-COCO and LVIS v1.0 [21] datasets. Additionally, we carry out experiments on image-based visual question answering (VQA) using the MME [17] and GQA [32] benchmarks. The MME Perception benchmark is a widely used benchmark to evaluate the perception abilities of MLLMs. GQA is a comprehensive dataset that assesses visual reasoning abilities.

Baselines. For generation models, we primarily select popular controllable diffusion-based text-to-image models, include GLIGEN [41], BoxDiff [85] and InstanceDiffusion [80], following their official settings. We also include Stable Diffusion [64] with the v1-5 model weight from Huggingface [81] as a T2I model baseline. For downstream tasks, we consider two popular baselines: Mask R-CNN [25] and CenterNet2 [93] for long-tailed instance segmentation. For open-vocabulary object detection, we use Grounding-DINO [51, 92], and for VQA, we employ LLaVA-v1.5 [48, 49] . Please refer to Appendix A.4 for the specific baseline settings.

Evaluation Protocols. For assessing image-level quality, we employ the Fréchet Inception Distance (FID) [29] using the COCO2017 validation set as the reference dataset, computed with the Inception V3 [72]. Additionally, we compute a CLIP score to evaluate the alignment between captions and generated images. For layout accuracy, we use the YOLO score [42] to evaluate the precision of control exerted by the generated model based on the derived layout condition. We adopt a pre-trained YOLOv8m following [80] and report the standard average precision (AP), which is averaged at different IoU thresholds (from 0.5 to 0.95) across categories. For instance segmentation, AP is utilized as the evaluation metric for segmentation and object detection. We also report AP for novel and rare categories in open-vocabulary and long-tailed scenarios. For VQA benchmarks, we report the averaged score for MME and accuracy for GQA.

4.2 Study Efficacy of CLIS

Generation Results. We first study the efficacy of our processed CLIS by evaluating its performance from a conventional generative perspective and its alignment with human judgment. As shown in Table 3.3, our CLIS enhances generation outcomes, evidenced by a decrease in FID scores based on both GLIGEN [41] and InstanceDiffusion [80]. This highlights its efficacy and generalizability. In Figure 3.3, we present images generated from the same scene graph. Notably, the quality of the images improves as the CLIS increases, confirming its consistency with human judgment.

Correlation with Performance Gains on Downstream Tasks. We further evaluate the efficacy of our proposed CLIS by analyzing the correlation between the scores it assigns and the performance gains in downstream tasks. We evaluate our image filter on visual perception tasks, specifically long-tailed instance segmentation and open-vocabulary object detection, as shown in Table 2. We sample the same number of generated images (10K) across different score ranges assigned by our image filter in CLIS. Our findings indicate that a relatively higher score assigned by the image filter correlates with greater performance gain in visual perception tasks, thereby demonstrating the effectiveness of CLIS in enhancing visual perception tasks.

We additionally evaluate the layout filter from CLIS for cross-modality perception and reasoning tasks of MLLMs on MME and GQA benchmarks. Initially, we generate an instruction fine-tuning dataset as described in Sec. 3.3. We then filter the synthetic training samples using different layout score thresholds. The quantity of instruction fine-tuning data varies, as a higher score threshold results in fewer training samples. Nonetheless, as shown in Table 3, adding synthetic training samples enhances performance compared to baseline, and a higher score threshold yields better results on both MME and GQA benchmarks. This validates the effectiveness of our layout filter in improving performance on cross-modality perception and reasoning tasks.

Table 4: Results on visual perception downstream tasks. (left): LVIS long-tailed instance segmentation benchmarks. (right): Open-vocabulary object detection benchmarks.

Method	Backbone	AP ${}_{r}^{mask}$	AP^mask
Mask R-CNN [25]	ResNet-50	9.3	21.7
w. ACP	ResNet-50	14.5 (+5.2)	22.8 (+1.1)
CenterNet2 w. Copy-Paste [20]	Swin-B	29.3	39.3
w. ACP	Swin-B	30.7 (+1.4)	39.6 (+0.3)

Dataset	Method	Backbone	AP ${}_{novel}^{box}$	AP^box
LVIS	Grounding-DINO	Swin-T	31.7	48.7
LVIS	w.ACP	Swin-T	33.0 (+1.3)	49.2
COCO	Grounding-DINO	Swin-T	60.4	57.1
COCO	w.ACP	Swin-T	60.8 (+0.4)	56.9

Table 5: (left): ACP boosting the results on multi-modal MME and GQA benchmarks. (right): Compared with X-Paste data generation methods on LVIS benchmark.

Method	LM Backbone	MME	GQA
LLaVA-1.5	Vicuna-7B	1434.4	58.9
LLaVA-1.5	Vicuna-13B	1438.3	60.7
LLaVA-1.5	LLama-3-8B	1445.3	60.1
LLaVA-1.5 w. ACP	Vicuna-7B	1514.5 (+80.1)	59.3 (+0.4)

Method	Backbone	AP ${}_{r}^{mask}$	AP^mask
CenterNet2 (baseline)	ResNet-50	17.8	26.1
w. X-Paste [91]	ResNet-50	17.9	28.0
w.ACP	ResNet-50	19.2	28.0

4.3 Synthetic Dataset Scale Up

We conduct experiments to verify the generated data on visual perception tasks, including long-tailed instance segmentation and open-vocabulary object detection. We also explore cross-modality perception and reasoning tasks, i.e., visual question-answering tasks.

Long-tailed Instance Segmentation Benchmark. Table 4 (left) presents our results on the LVIS long-tailed instance segmentation benchmark. We utilized Auto Cherry-Picker to construct a total of 50K training samples with detailed descriptions and corresponding annotations. ACP demonstrates significant performance gains over the commonly-used Mask R-CNN baseline, with an improvement of 1.1% in $AP^{mask}$ and the most notable improvement in rare categories (+5.2% AP ${}_{r}^{mask}$ ). We also observed consistent performance improvements with a stronger CenterNet2 baseline, employing Swin-B as the backbone and copy-paste [20] for data augmentation, achieving a 1.4% higher AP ${}_{r}^{mask}$ . This underscores the strong generalization ability of ACP across different detector architectures and its effectiveness in conjunction with existing data augmentation methods.

Open-vocabulary Object Detection Benchmark. We further demonstrate the effectiveness of the ACP in the challenging open-vocabulary object detection setting. We utilize Grounding-DINO [51], pre-trained on Objects365 [69], GoldG [36], GRIT [58], and V3Det [78], following [92]. The results are shown in Table 4 (right). ACP still outperforms Grounding-DINO by 1.3% in LVIS AP ${}_{novel}^{box}$ and 0.4% in COCO AP ${}_{novel}^{box}$ , despite using only an additional 50K generated training samples. This is notable considering the Grounding-DINO baseline is pre-trained on 61.8 M real images (30 epochs $\times$ 128 batch size $\times$ 16102 iterations). This validates how high-quality synthesis data can complement real data effectively.

Multi-modal Image-based Benchmarks. We further evaluate the effectiveness of ACP on cross-modality perception and reasoning tasks. We adopt LLaVA-v1.5 with Vicuna-7B as our baseline. Table 5 (left) indicates that ACP significantly enhances model perception ability on the MME benchmark, achieving an 80.1 improvement, which exceeds the performance of LLaVA-v1.5 even with stronger language model backbones such as Vicuna-13B and LLama-3-8B. Additionally, ACP improves performance on the widely recognized GQA reasoning benchmark. This validates the effectiveness of our method in cross-modality settings.

Comparison with Previous Methods. We compare quantitatively with X-Paste [91], using CenterNet2 with ResNet-50 backbone and 1 $\times$ training schedule. It should be noted that while X-Paste generates 100K images, we utilize only 50K generated images. Additionally, to ensure a fair comparison, both methods are trained on the same number of real images, i.e., 45K iterations with a batch size of 64 for X-Paste and 33.75K iterations with a batch size of 64 for ACP. Results in Table 5 (right) show a 1.3% higher in AP ${}_{r}^{mask}$ , demonstrating the superiority in synthesizing images with reasonable layouts, as opposed to composing training samples by pasting multiple synthesized instances onto a background.

4.4 Ablation and Analysis

We conduct an ablation study on CLIS-I using InstanceDiffusion [80] as our baseline. For CLIP-I with the visual quality setting, we exclusively employ a multi-modal caption model to identify objects. For alignment, we simplify CLIS-I to CLIP score. Based on the same raw dataset, we evaluate the filtered data (1K samples) using these methods from both a generation perspective and performance improvements in the downstream task. In particular, we use the COCO detection benchmark as the downstream task, combining the filtered samples and the original training data to train a Faster R-CNN detector on a standard 1 $\times$ training schedule. Table 6 shows that CLIS-I demonstrates the most significant performance gain in the downstream task, aligning with our initial motivation. While CLIS-I, focusing solely on alignment, excels in generation evaluation, it produces suboptimal results in the downstream task. This finding emphasizes that our proposed CLIS has a stronger correlation with downstream task performance gains compared to conventional generation metrics, such as FID, CLIP score, and YOLO score.

Table 6: Ablation study on CLIS-I. We compare three variants of CLIS-I by filtering the same generated raw dataset. We report FID, CLIP, and YOLO scores for generation metrics. We report AP, AP₅₀, and AP₇₅ on the COCO detection benchmark.

Model	Visual Quality	Alignment	FID $\downarrow$	CLIP score $\uparrow$	YOLO score $\uparrow$	AP $\uparrow$	AP₅₀ $\uparrow$	AP₇₅ $\uparrow$
InstanceDiffusion [80]	✗	✗	53.5	25.2	45.6	37.5	58.3	40.8
CLIS-I	✓	✗	48.4	25.6	46.0	37.3	58.3	40.5
CLIS-I	✗	✓	47.8	27.7	48.3	37.5	58.3	40.7
CLIS-I	✓	✓	48.9	25.8	47.9	37.7	58.5	40.9

5 Conclusion

In this paper, we propose Auto Cherry-Picker, a cross-modality training data generator conditioned on a simple object list with a novel designed CLIS metric to ensure the quality of generated data. Auto Cherry-Picker is effective in various downstream tasks, including perception and reasoning tasks, particularly in improving the performance in annotation-scarce scenarios. Our proposed CLIS can be used to pick high-quality generation examples, where we also find the generated data with better CLIP scores can lead to better performance for perception tasks. Moreover, our method can be easily adapted to stronger LLMs and image generation models. Our research bridges the gap between high-quality generation data and downstream performance. We hope our results can inspire generation metric design in the future.

References

[1] **ze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and **gren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
[2] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
[3] Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023.
[4] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020.
[6] Lucy Chai, Jun-Yan Zhu, Eli Shechtman, Phillip Isola, and Richard Zhang. Ensembling with deep generative views. In CVPR, 2021.
[7] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. TOG, 2023.
[8] Kai Chen, Enze Xie, Zhe Chen, Yibo Wang, Lanqing Hong, Zhenguo Li, and Dit-Yan Yeung. Geodiffusion: Text-prompted geometric con-trol for object detection data generation. arXiv preprint arXiv:2306.04607, 2023.
[9] Minghao Chen, Iro Laina, and Andrea Vedaldi. Training-free layout control with cross-attention guidance. arXiv preprint arXiv:2304.03373, 2023.
[10] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020.
[11] Xiaohui Chen, Yongfei Liu, Yingxiang Yang, Jianbo Yuan, Quanzeng You, Li-** Liu, and Hongxia Yang. Reason out your layout: Evoking the layout master from large language models for text-to-image synthesis. arXiv preprint arXiv:2311.17126, 2023.
[12] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
[13] G. Couairon, M. Careil, M. Cord, S. Lathuiliere, and J. Verbeek. Zero-shot spatial layout conditioning for text-to-image diffusion models. In ICCV, 2023.
[14] Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, **rong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023.
[15] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
[16] Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models. In NeurIPS, 2024.
[17] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, **rui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2024.
[18] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
[19] Hanan Gani, Shariq Farooq Bhat, Muzammal Naseer, Salman Khan, and Peter Wonka. Llm blueprint: Enabling text-to-image generation with complex and detailed prompts. arXiv preprint arXiv:2310.10640, 2023.
[20] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR, 2021.
[21] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
[22] Yue Han, Jiangning Zhang, Junwei Zhu, Xiangtai Li, Yanhao Ge, Wei Li, Chengjie Wang, Yong Liu, Xiaoming Liu, and Ying Tai. A generalist facex via learning unified facial representation. arXiv preprint arXiv:2401.00551, 2023.
[23] Yue Han, Junwei Zhu, Keke He, Xu Chen, Yanhao Ge, Wei Li, Xiangtai Li, Jiangning Zhang, Chengjie Wang, and Yong Liu. Face adapter for pre-trained diffusion models with fine-grained id and attribute control. arXiv preprint arXiv:2405.12970, 2024.
[24] William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. In NeurIPS, 2022.
[25] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017.
[26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
[27] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
[28] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Ye** Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
[29] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
[30] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
[31] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. In NeurIPS, 2022.
[32] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
[33] Naoto Inoue, Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yamaguchi. Layoutdm: Discrete diffusion model for controllable layout generation. In CVPR, 2023.
[34] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
[35] Ali Jahanian, Xavier Puig, Yonglong Tian, and Phillip Isola. Generative models as a data source for multiview representation learning. arXiv preprint arXiv:2106.05258, 2021.
[36] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. In ICCV, 2021.
[37] Laurynas Karazija, Iro Laina, Andrea Vedaldi, and Christian Rupprecht. Diffusion models for zero-shot open-vocabulary segmentation. arXiv preprint arXiv:2306.09316, 2023.
[38] Daiqing Li, Huan Ling, Amlan Kar, David Acuna, Seung Wook Kim, Karsten Kreis, Antonio Torralba, and Sanja Fidler. Dreamteacher: Pretraining image backbones with deep generative models. In ICCV, 2023.
[39] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrap** language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
[40] Xiaojie Li, Yibo Yang, Xiangtai Li, Jianlong Wu, Yue Yu, Bernard Ghanem, and Min Zhang. Genview: Enhancing view quality with pretrained generative model for self-supervised learning. In arXiv preprint arXiv:2403.12003, 2024.
[41] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In CVPR, 2023.
[42] Zejian Li, **gyu Wu, Immanuel Koh, Yongchuan Tang, and Lingyun Sun. Image synthesis from layout with locality-aware mask adaption. In ICCV, 2021.
[43] Ziyi Li, Qinye Zhou, Xiaoyun Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Open-vocabulary object segmentation with diffusion models. In ICCV, 2023.
[44] Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023.
[45] Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. LLM-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. TMLR, 2024.
[46] Shaobo Lin, Kun Wang, Xingyu Zeng, and Rui Zhao. Explore the power of synthetic data on few-shot object detection. In CVPR, 2023.
[47] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
[48] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
[49] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2024.
[50] Jialun Liu, Yifan Sun, Chuchu Han, Zhaopeng Dou, and Wenhui Li. Deep representation learning on long-tailed data: A learnable embedding augmentation perspective. In CVPR, 2020.
[51] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
[52] Shuangting Liu, Jiaqi Zhang, Yuxin Chen, Yifan Liu, Zengchang Qin, and Tao Wan. Pixel level data augmentation for semantic image segmentation using generative adversarial networks. In ICASSP, 2019.
[53] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
[54] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[55] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
[56] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
[57] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In CVPR, 2019.
[58] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
[59] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015.
[60] Lu Qi, Yi-Wen Chen, Lehan Yang, Tiancheng Shen, Xiangtai Li, Weidong Guo, Yu Xu, and Ming-Hsuan Yang. Generalizable entity grounding via assistance of large language model. arXiv preprint arXiv:2402.02555, 2024.
[61] Jie Qin, Jie Wu, Weifeng Chen, Yuxi Ren, Huixia Li, Hefeng Wu, Xuefeng Xiao, Rui Wang, and Shilei Wen. Diffusiongpt: Llm-driven text-to-image generation system. arXiv preprint arXiv:2401.10061, 2024.
[62] Leigang Qu, Shengqiong Wu, Hao Fei, Liqiang Nie, and Tat-Seng Chua. Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation. In ACMMM, 2023.
[63] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
[64] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
[65] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
[66] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In SIGGRAPH, 2022.
[67] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
[68] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In NeurIPS, 2016.
[69] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, **g Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
[70] Jaskirat Singh, Stephen Gould, and Liang Zheng. High-fidelity guided image synthesis with latent diffusion models. In CVPR, 2023.
[71] Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. D2c: Diffusion-decoding models for few-shot conditional generation. In NeurIPS, 2021.
[72] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
[73] Weimin Tan, Siyuan Chen, and Bo Yan. Diffss: Diffusion model for few-shot semantic segmentation. arXiv preprint arXiv:2307.00773, 2023.
[74] Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, and Phillip Isola. Learning vision from models rivals learning vision from data. arXiv preprint arXiv:2312.17742, 2023.
[75] Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, and Dilip Krishnan. Stablerep: Synthetic images from text-to-image models make strong visual representation learners. In NeurIPS, 2024.
[76] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023.
[77] Chaoyang Wang, Xiangtai Li, Lu Qi, Henghui Ding, Yunhai Tong, and Ming-Hsuan Yang. Semflow: Binding semantic segmentation and image synthesis via rectified flow. arXiv preprint arXiv:2405.20282, 2024.
[78] Jiaqi Wang, Pan Zhang, Tao Chu, Yuhang Cao, Yujie Zhou, Tong Wu, Bin Wang, Conghui He, and Dahua Lin. V3det: Vast vocabulary visual detection dataset. In ICCV, 2023.
[79] Qi Wang, Junyu Gao, Wei Lin, and Yuan Yuan. Learning from synthetic data for crowd counting in the wild. In CVPR, 2019.
[80] Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, and Ishan Misra. Instancediffusion: Instance-level control for image generation. arXiv preprint arXiv:2402.03290, 2024.
[81] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
[82] Erroll Wood, Tadas Baltrušaitis, Charlie Hewitt, Sebastian Dziadzio, Thomas J Cashman, and Jamie Shotton. Fake it till you make it: face analysis in the wild using synthetic data alone. In ICCV, 2021.
[83] Weijia Wu, Yuzhong Zhao, Hao Chen, Yuchao Gu, Rui Zhao, Yefei He, Hong Zhou, Mike Zheng Shou, and Chunhua Shen. Datasetdm: Synthesizing data with perception annotations using diffusion models. NeurIPS, 2023.
[84] Jiahao Xie, Wei Li, Xiangtai Li, Ziwei Liu, Yew Soon Ong, and Chen Change Loy. Mosaicfusion: Diffusion models as data augmenters for large vocabulary instance segmentation. arXiv preprint arXiv:2309.13042, 2023.
[85] **heng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, and Mike Zheng Shou. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In ICCV, 2023.
[86] Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Reco: Region-controlled text-to-image generation. In CVPR, 2023.
[87] Hanrong Ye, Jason Kuen, Qing Liu, Zhe Lin, Brian Price, and Dan Xu. Seggen: Supercharging segmentation models with text2mask and mask2img synthesis. arXiv preprint arXiv:2311.03355, 2023.
[88] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023.
[89] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
[90] Tianjun Zhang, Yi Zhang, Vibhav Vineet, Neel Joshi, and Xin Wang. Controllable text-to-image generation with gpt-4. arXiv preprint arXiv:2305.18583, 2023.
[91] Hanqing Zhao, Dianmo Sheng, Jianmin Bao, Dongdong Chen, Dong Chen, Fang Wen, Lu Yuan, Ce Liu, Wenbo Zhou, Qi Chu, et al. X-paste: Revisiting scalable copy-paste for instance segmentation using clip and stablediffusion. arXiv preprint arXiv:2212.03863, 2022.
[92] Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, and Haian Huang. An open and comprehensive pipeline for unified object grounding and detection. arXiv preprint arXiv:2401.02361, 2024.
[93] Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461, 2021.

Appendix A Implementation Details

A.1 LLMs for ACP Pipeline

We conducted experiments using a series of LLMs as the scene graph generator in our ACP pipeline on a limited data scale. Specifically, we employ Qwen-1.5-14B, Qwen-1.5-72B, and Qwen-1.5-110B to generate scene graphs. Each model produced 15K training samples from the same input object list. These samples are then amalgamated with the original training data for COCO detection tasks. We apply them separately to a Mask R-CNN baseline under a standard 1 $\times$ training schedule. Table 7 illustrates that the performance of different LLMs is comparable in downstream tasks. For the experiments described, we opt for the smaller LLM, Qwen-1.5-14B, due to its faster inference speed.

A.2 Prompts in ACP

We provide our full prompt in the ACP pipeline with the description generator and layout generator.

A.3 Templates for Multi-modal Tasks

We provide templates for constructing question-answer pairs for multi-modal downstream tasks. For perception, we design two types of tasks: localization and attribute-binding. Localization tasks necessitate that models pinpoint an object detailed in the instructions or, alternatively, describe an object situated at a specific location. Attribute-binding tasks require models to identify precise attributes of an object within a given location. For reasoning, we craft relation reasoning tasks. These tasks require models to deduce the relationship between a specified subject and object based on the provided description.

Table 7: Different LLMs as scene graph generator in ACP on COCO detection task.

Scene Graph Generator	AP^mask $\uparrow$	AP^box $\uparrow$
Qwen1.5-14b	34.5	37.8
Qwen1.5-72b	34.5	37.8
Qwen1.5-110b	34.2	37.7

A.4 Baseline Settings

Our specific baseline settings in experiments are as follows:

$\bullet$

Mask R-CNN baseline. We follow the same setup outlined in [21]. Specifically, we adopt ResNet-50 [26] with FPN [50] backbone, using the standard 1 $\times$ training schedule.
$\bullet$

CenterNet2 baseline. We follow the setup outlined in [91]. Specifically, we use two configurations: 1) ResNet-50 with a 1 $\times$ training schedule, and 2) Swin-B with a 4 $\times$ training schedule. We employ the AdamW optimizer and utilize repeat factor sampling with an oversample threshold of $10^{-3}$ .
$\bullet$

Grounding-DINO baseline. We follow the setup outlined in [92]. Specially, we use the model pretrained on Objects365 [69], GoldG [36], GRIT [58], and V3Det [78] with Swin-T [53] as the backbone. The fine-tuning process uses the standard 1 $\times$ training schedule. We use AdamW [54] optimizer with a weight decay of 0.0001. The initial learning rate is 0.00005, dropped by 10 $\times$ at the 8th and 11th epochs.
$\bullet$

LLaVA-v1.5 baseline. We follow the setup outlined in [48]. We adopt a two-stage training process. For the LLM backbone, we adopt Vicuna-7B [12], Vicuna-13B, and LLama-3-8B [76]. We use an AdamW optimizer with a weight decay of 0. Pre-training for 1 epoch with a 1e-3 learning rate and batch size of 32, and fine-tuning for 1 epoch with a 2e-5 learning rate and a batch size of 16. The warmup ratio of the learning rate is 0.03.

Appendix B Limitations and Future Work

While ACP generates data samples leveraging the capabilities of large-scale pretrained generative models, it also inherits their limitations. This results in certain compromises when compared to real data. Future improvements could potentially come from integrating more advanced models, such as GPT-4. Additionally, the computation of CLIS-L necessitates a pool of real-data layout examples, which can be resource-intensive. Currently, our layout example pool is constructed from Flickr30K. However, practical limitations in computational resources restrict our capacity to build and evaluate a larger layout example pool. We encourage more future studies focusing on the design of generation metrics.

Appendix C More Results of Comparison with Other Metrics

We provide visual results comparing our CLIS with other metrics. Using the same scene graph from our previous generator, we produce images evaluated with our CLIS and other metrics, such as CLIP and YOLO scores. As illustrated in Fig 5, our CLIS demonstrates superior performance in both textual alignment and visual quality.

Appendix D Visualization of Synthetic Training Samples

Additionally, we showcase visualizations of our synthetic training samples in Fig. 6 and Fig. 7. By leveraging the extensive vocabulary of large generative models, we can produce high-quality training samples for rare categories. These training samples are closely aligned with their respective scene graphs, capturing both detailed attribute descriptions and complex relationships between multiple objects effectively.