ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models
Abstract
The ability to understand visual concepts and replicate and compose these concepts from images is a central goal for computer vision. Recent advances in text-to-image (T2I) models have lead to high definition and realistic image quality generation by learning from large databases of images and their descriptions. However, the evaluation of T2I models has focused on photorealism and limited qualitative measures of visual understanding. To quantify the ability of T2I models in learning and synthesizing novel visual concepts (a.k.a. personalized T2I), we introduce ConceptBed, a large-scale dataset that consists of 284 unique visual concepts, and 33K composite text prompts. Along with the dataset, we propose an evaluation metric, Concept Confidence Deviation (), that uses the confidence of oracle concept classifiers to measure the alignment between concepts generated by T2I generators and concepts contained in target images. We evaluate visual concepts that are either objects, attributes, or styles, and also evaluate four dimensions of compositionality: counting, attributes, relations, and actions. Our human study shows that is highly correlated with human understanding of concepts. Our results point to a trade-off between learning the concepts and preserving the compositionality which existing approaches struggle to overcome. The data, code, and interactive demo is available at: https://conceptbed.github.io/
1 Introduction
Humans reason about the visual world by aggregating entities that they see into “visual concepts”: both cats and elephants are animals, and both palms and pines are trees. We use natural language to describe images and things that we see. Although this type of visual concept learning is well-defined in human psychology (Murphy 2004), it remains elusive in the context of data-driven techniques capable of learning and reasoning from images and their natural language descriptions.
Text-to-Image (T2I) generative models are trained to translate natural language phrases into images that correspond to that input. High-quality T2I models, therefore, serve as a link between human-level concepts (expressed in natural language) and their visual representations and are one way to reproduce visual concepts. On the other hand, this has also sparked interest in visual concept learning (a.k.a. personalized T2I) through the procedure of “image inversion” – to translate one or many images corresponding to a visual concept into a latent representation of that visual concept. While earlier methods primarily explored image inversion using generative adversarial networks (Xia et al. 2022), methods such as Textual Inversion (Gal et al. 2022) and Dreambooth (Ruiz et al. 2022) combine image inversion with T2I – this has led to an effective way to quickly learn concepts from a few images and reproduce them in novel combinations and compositions with other concepts, attributes, styles, etc. These methods aim to learn concepts with minimal reference images by fine-tuning pre-trained text-conditioned diffusion models (Figure 1). Therefore this paradigm of T2I and image inversion is a powerful new way of learning and reproducing concepts.
Within this paradigm of novel visual concept learning via image inversion, two primary evaluation criteria have emerged: (1) concept alignment, which assesses the correspondence between the generated images and the target concept images, and (2) composition alignment, which evaluates whether the generated images maintain compositionality. Previous studies have been small scale, evaluating only a small number of hand-picked concepts and compositions; as such making generic claims via such findings is difficult. Furthermore, the established evaluation metrics such as DINO-based cosine similarity (Ruiz et al. 2022) (for measuring concept alignment), KID (Kumari et al. 2022) (for measuring the amount of concept overfitting), and CLIPScore (Hessel et al. 2021) (for evaluating compositionality), have encountered challenges in accurately capturing human preferences. Consequently, there is a growing need for better automated evaluations.
Therefore, we introduce ConceptBed, comprehensive dataset and evaluation framework that is aligned with human preferences. The ConceptBed dataset comprises 284 distinct concepts and approximately 33,000 composite text prompts, which can be further extended using the provided automatic realistic dataset creation pipeline. The dataset focuses on four diverse concept learning evaluation tasks: learning styles, learning objects, learning attributes, and compositional reasoning. To gain a deeper understanding of previous methodologies, we incorporate four composition categories – action, attribution, counting, and relations.
We use our large-scale dataset to evaluate concept learners, by develo** a novel evaluation metric called Concept Confidence Deviation (CCD). We conduct a human study and find that relative evaluations of models in terms of CCD are well aligned with human preferences. Therefore, CCD combined with the ConceptBed dataset, offers an alternative to existing evaluation strategies, facilitating more effective large-scale evaluations. For each evaluation criteria, we train supervised classifiers (oracles) to detect whether generated concept images are accurate. Subsequently, the confidence scores from these oracles are utilized to calculate the instance-level concept deviations of the generated concept images in relation to the reference target ground truth images using the proposed metric. This approach enables us to assess concept and composition alignment more effectively. We further show that CCD calculated using a pre-trained few-shot classifier also maintains a high correlation with human preferences. This allows CCD to measure concept alignment on unseen concepts.
We conduct extensive experiments on four recently proposed concept learning methodologies. In total, we fine-tune approximately 1100 models (one model per concept) and generate over 500,000 concept-specific images. Our results reveal a surprising trade-off between concept alignment and composition alignment, wherein methods excelling at concept alignment tend to fall short in preserving compositions and vice versa. This suggests that previous concept learning approaches are either highly overfitted or severely underfitted. Furthermore, our experiments demonstrate that utilizing a pre-trained CLIP (Radford et al. 2021) textual encoder aids in maintaining compositionality, but it lacks the flexibility required to learn complex concepts, such as sketch.
In summary, we make the following key contributions:
-
•
We introduce ConceptBed, a comprehensive benchmark for grounded quantitative evaluations of text-conditioned concept learners.
-
•
The Concept Confidence Deviation () evaluation metric, measures the learners’ ability to preserve concepts and compositions. We demonstrate a strong correlation between and human preferences.
-
•
Through extensive experiments with 1,100+ models, we identify shortcomings in prior works and suggest future research directions. ConceptBed sets a standard for evaluating personalized text-to-image generative models.
2 Preliminaries
Prior studies on concept learning have focused on text-conditioned diffusion models, such as Textual Inversion (Gal et al. 2022), DreamBooth (Ruiz et al. 2022), and Custom Diffusion (Kumari et al. 2022). These models operate within the T2I paradigm, where a text prompt () serves as input to generate the corresponding image () representing the given prompt . A popular approach within T2I is the Latent Diffusion Model (LDM) (Rombach et al. 2022), which incorporates two key modules:
-
1.
Textual Encoder (): This module generates embeddings corresponding to the input text prompt;
-
2.
Generator (): The generator estimates the noise iteratively from the input randomly sampled matrix at timestamp (), conditioned on the text.
Since T2I models solely consider text input, the target concept () is represented in terms of text tokens. These tokens can subsequently be employed to generate images associated with concept . Therefore, in Textual Inversion, the concept learning task is approached as an image inversion problem, aiming to map the target concept back to the text-embedding space.
Let V* denote the text tokens corresponding to the learned concept . Once the optimal map** from V* to the target concept is determined, we can generate concept-specific images using the LDM by providing V* in the text prompt. Suppose we are provided with images () of the target concept . Now, in order to learn the text tokens V* corresponding to the concept from the set of images , the Textual Inversion methodology aims to optimize V* by reconstructing using the objective function of the LDM with frozen parameters and :
(1) |
In the case of DreamBooth and Custom Diffusion, instead of finding the optimal V*, it optimizes the model parameter associated with the noise estimator (). This optimization process enables the model to learn the map** between randomly initialized V* and the target concept .
(2) |
Once is obtained, it can be used to generate images related to the target concept.111DreamBooth and Custom-Diffusion use additional regularizer to improve compositionally by using same objective function on a diverse set of image-caption pairs.
Once the images are generated, in order to evaluate these generated images, it is essential to verify whether they align with the learned concepts while maintaining compositionality.
3 ConceptBed
In this section, we introduce ConceptBed, a comprehensive collection of concepts, designed to accurately estimate concept and composition alignment by quantifying deviations in the generated images. Later, we introduce the novel evaluation framework associated with ConceptBed. Please refer to the Appendix for additional insights on the proposed dataset and evaluation framework.
3.1 ConceptBed: Dataset Construction
ConceptBed incorporates existing datasets such as ImageNet (Deng et al. 2009), PACS (Li et al. 2017), CUB (Wah et al. 2011), and Visual Genome (Krishna et al. 2017), enabling the creation of a labeled dataset. Figure 2 provides an overview of the ConceptBed dataset.
Learning Styles. We use styles from the PACS dataset: Art Painting, Cartoon, Photo, and Sketch. Each style contains images corresponding to seven categories. The concept learner aims to use examples from one style as a reference and generate style-specific images for all seven entities.
Learning Objects. Extracting object-level concepts is accomplished through the utilization of the ImageNet dataset. It comprises 1000 low-level concepts from the WordNet (Fellbaum 2010) hierarchy. However, due to the presence of noise in ImageNet images and the lack of relevance to daily life for many concepts, we employ an automated filtering pipeline to ensure the usefulness and quality of the reference concept images. The pipeline involves extracting a list of low-level concepts and their parent concepts from ImageNet, followed by extracting text phrases from Visual Genome containing the concept as a subject in the caption. If an insufficient number of such captions exists (less than 10 in Visual Genome) or they cannot be found, the concepts are discarded. This filtering process results in 80 concepts such as (brambling, squirrel monkey, etc.). We select the top 100 high-quality images for each concept that will be used to train the concept learning methodologies.
Learning Attributes. Since ImageNet dataset images are not labeled based on the attributes present in the image, it is necessary to rely on datasets that provide attribute-level grounded labels. Therefore, we additionally employ the CUB dataset, which offers attribute-level labels (such as orange wing, blue forehead, etc.), enabling the ConceptBed to perform evaluations and measure the attribute-level performance of concept learners.
Compositional Reasoning. In addition to learning new concepts, it is crucial to maintain prior knowledge and associate the acquired concepts with it. To conduct these evaluations holistically, we use Visual Genome to extract captions in which the concept appears as the subject of the sentence. These captions are categorized into four composition categories (actions, attributes, counting, and relation) through few-shot classification using GPT3 (Brown et al. 2020). This categorization allows us to measure the performance of the baselines on each category, and an in-depth understanding of the varying difficulty levels of different compositions.
3.2 ConceptBed: Dataset Statistics
The ConceptBed dataset consists of 284 unique concepts, comprising 80 concepts from ImageNet, 200 concepts from CUB, and 4 concepts from PACS. In total, the dataset contains approximately 33,000 composite prompts for the evaluation of all 80 processed concepts from ImageNet, with each composite prompt having up to two composition categories. Out of these composite prompts, 18987, 16902, 8014, and 1083 prompts contribute to the attribute, relation, action, and counting categories, respectively.
Our dataset curation pipeline is flexible to be extended to larger datasets such as OpenImages-v7 (Kuznetsova et al. 2020) and LAION-5B (Schuhmann et al. 2022). However, it is important to note that this extension would significantly increase the resource requirements. With the introduction of this dataset, our primary objective is to provide a standardized and benchmarked evaluation framework for concept learners, enhancing research in the field.
3.3 : Concept Confidence Deviation
Model | Concepts | Fine-grainedCUB | Composition | ||
---|---|---|---|---|---|
DomainPACS | ObjectsImageNet | Object-level | Attribute-level | ||
TI (LDM) | 0.0478 | 0.0955 | 0.2289 | 0.1174 | 0.1906 |
TI (SD) | 0.2456 | 0.0472 | 0.0859 | 0.0332 | 0.1090 |
DB | 0.6825 | 0.0678 | 0.0963 | 0.0469 | 0.3527 |
CD | 0.6206 | 0.2085 | 0.3934 | 0.1743 | 0.4916 |
Original | 0.0000 | 0.0000 | 0.0000 | 0.000 | 0.0000 |
Models | Relation | Action | Attribute | Counting | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CLIP | VQA. | CLIP | VQA | CLIP | VQA. | CLIP | VQA. | |||||
TI (LDM) | 0.6589 | 66.60% | 0.2074 | 0.6523 | 68.69% | 0.2098 | 0.6599 | 72.22% | 0.1331 | 0.6515 | 65.78% | 0.1231 |
TI (SD) | 0.6294 | 70.09% | 0.1735 | 0.6274 | 70.81% | 0.1884 | 0.6360 | 74.75% | 0.1091 | 0.6301 | 68.38% | 0.1020 |
DB | 0.7051 | 82.20% | 0.0542 | 0.6995 | 84.61% | 0.0496 | 0.6862 | 82.24% | 0.0355 | 0.6924 | 78.90% | -0.0016 |
CD | 0.7065 | 82.94% | 0.0471 | 0.7053 | 86.35% | 0.0347 | 0.6940 | 84.20% | 0.0163 | 0.6921 | 79.36% | -0.0054 |
SD | 0.7222 | 83.42% | 0.0403 | 0.7178 | 87.39% | 0.0256 | 0.7053 | 83.85% | 0.0184 | 0.7085 | 81.07% | -0.0206 |
Original | 0.6626 | 87.45% | 0.0000 | 0.6831 | 89.78% | 0.0000 | 0.6306 | 85.79% | 0.0000 | 0.6553 | 78.32% | 0.0000 |
Models | DomainPACS | ObjectsImageNet | Compositional Reasoning | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
DINO () | KID () | () | H.S. () | DINO () | KID () | () | H.S. () | CLIP () | () | H.S. () | |
TI (LDM) | 0.5073 | 0.0117 | 0.0478 | 4.028 | 0.4708 | 0.0552 | 0.0955 | 4.069 | 0.6611 | 0.1684 | 2.851 |
TI (SD) | 0.4104 | 0.0422 | 0.2456 | 4.084 | 0.4457 | 0.0294 | 0.0472 | 4.159 | 0.6309 | 0.1432 | 3.694 |
DB | 0.3925 | 0.1101 | 0.6825 | 3.083 | 0.4525 | 0.0290 | 0.0678 | 4.075 | 0.6919 | 0.0344 | 3.556 |
CD | 0.3956 | 0.0593 | 0.6206 | 3.164 | 0.4450 | 0.0492 | 0.2085 | 3.803 | 0.6968 | 0.0232 | 4.178 |
Correlation | 0.6557 | -0.8252 | -0.9515 | 1.000 | 0.2787 | -0.5347 | -0.9892 | 1.000 | 0.3486 | -0.7342 | 1.000 |
Model | PACS | ImageNet | |
---|---|---|---|
Domain | Object | Composition | |
TI (LDM) | 72.84 | 64.53 | 58.28 |
TI (SD) | 52.25 | 70.79 | 65.42 |
DB | 24.71 | 67.45 | 39.42 |
CD | 20.12 | 52.06 | 26.31 |
Problem Statement.
Consider a pre-trained text-conditioned diffusion model , which can be further fine-tuned on a specific concept such that . We assume the availability of concept-specific target images from the ConceptBed dataset, denoted as . Denote the concept learner fine-tuned on concept using as . First, we generate a collection of images using the learned concept , and denote this set of images as , where is the concept-specific prompt and is the random seed.
The alignment between two distributions (i.e., and ) is typically computed by first extracting features from the model (i.e., ; ) and then employing a distance metric (i.e., ). Several combinations of models () and distance measures () have been used in prior work. For concept alignment, Ruiz et al. (2022) use with and Kumari et al. (2022) use with . For composition alignment, all prior work utilizes with . However, these methods fail to accurately capture the concept deviations within the generated images; rendering them ineffective in comparing performance across the methodologies (as shown in Section 4.2).
Concept Confidence Deviation (CCD).
To address the above limitations, we propose training the oracle classifier , specifically for the concept detection task using the ConceptBed training dataset, . Then one can simply use and to verify whether is aligned with . However, measuring accuracy does not allow instance-level evaluations. By leveraging the output probabilities of the oracle (concerning the concept label ), we can estimate the deviations associated with each generated image w.r.t. the output probabilities of real target images . Concept Confidence Deviation is defined as:
(3) |
first calculates the mean target probability on the test ground truth images and then measures the difference in probability of the generated images. with negative or close to values indicates that the generated images closely follow the distribution of the ground truth concept images. A positive value suggests that the generated images deviate from the original distribution. Figure 4 shows an intuitive example of CCD by calculating the distance between two probability densities corresponding to the real and generated target concept.
3.4 Task Specific Evaluation Settings
To efficiently leverage the ConceptBed evaluation pipeline, we trained separate oracles on the corresponding ConceptBed datasets. Two different types of evaluations are conducted, each with its respective set of oracles: 1) concept alignment, measured by concept classifiers, and 2) compositional reasoning, measured by a VQA model.
Concept Alignment:
Concept alignment evaluation was performed on all tasks, including the generated concept images with different composite text prompts. To evaluate the style, a ResNet18 (He et al. 2015) model is trained to distinguish the images between four style concepts. To evaluate the objects, a ConvNeXt (Liu et al. 2022) model is fine-tuned on 80 classes from the ConceptBed using the ImageNet training subset. The Concept Embedding Model (CEM) (Zarlenga et al. 2022) was trained on CUB to detect the concepts and attributes. Images corresponding to the concepts were generated for each task by following the prompts: “A photo of V*” for objects and “A photo of a entity-name in the style of V*” for styles. Here, entity-name belongs to the seven classes from PACS. The remaining task, composition, utilizes the same pre-trained ConvNeXt model for concept alignment, as ConceptBed compositions are specifically for 80 ImageNet concepts.
Compositional Reasoning:
To measure the image-text alignment with respect to the input prompts, the concept-specific token () was removed and replaced with the corresponding ground truth label (i.e., dogs, cats, etc.). The image-text similarity was then measured. Unlike previous works, CLIP was not used due to its inability to capture compositions (Thrush et al. 2022). Instead, taking after (Cho, Zala, and Bansal 2022), we propose to use a pre-trained ViLT (Kim, Son, and Kim 2021) as a VQA model for composition evaluations. Specifically, from each composite prompt, the boolean questions with positive answers are generated (Banerjee et al. 2021). As ViLT is essentially a classifier, the can be calculated with respect to the confidence of the model associated with a “yes” answer.
4 Experiments & Results
In this section, we benchmark four state-of-the-art concept learning methodologies. We first explain the experimental setup and report the evaluation results using the ConceptBed framework along with human preferences. Additional details about the experimental setup, results, and human evaluations are in the appendix.
4.1 Experimental Setup
In our experiments, we study four text-conditioned diffusion modeling-based concept learning strategies: Textual Inversion (TI) on LDM and SD, DreamBooth (DB) (Ruiz et al. 2022), and Custom Diffusion (CD) (Kumari et al. 2022). We generate images for all concepts to measure the concept alignment and images for 33K composite text prompts. For a total of 284 concepts, we train all four baselines. This leads to + concept-specific fine-tuned models and we generate a total of images for evaluations. To show the stability of , we report the mean performance across the three seeds of oracle training.
4.2 Results
Concept Alignment.
Table 1 shows the overall performance of the baselines in terms of , where lower score indicates better performance. First, we can observe that CCD for concept alignment is low for the original images; suggesting that the oracle is certain about its predictions. Second, it can be inferred that Custom Diffusion performs poorly, while Textual Inversion (SD) outperforms the other methodologies except for the case of the learning styles. We attribute this behavior to differences in textual encoders. LDM trains the BERT-style textual encoder from scratch while SD uses pre-trained CLIP to condition the diffusion model. CLIP contains vast image-text knowledge leading to better performance on learning objects but less flexibility to learn different styles as a concept. Surprisingly, if we compare the concept alignment performance with and without composite prompts, we observe that the performance further drops significantly for all baseline methodologies when composite prompts are used. This shows that existing concept learning methodologies find it difficult to maintain the concepts whenever the prompt contains the composition.
Compositional Reasoning.
Previously, we discussed concept alignment on composite prompts. Table 2 summarizes the evaluations on composition tasks. Here, we observe the complete opposite trend in results. Custom Diffusion outperforms the other approaches across the composition categories. This result shows the trade-off between learning concepts and at the same time maintaining compositionality in recent concept learning methodologies. Moreover, CLIPScore estimates the better performance of the baselines compared to the original image-text pairs which are inaccurate.
Qualitative Results.
Figure 3 provides the qualitative examples of the concept learning. It can be inferred that Textual Inversion (LDM) learns the sketch concept very well (the first row), while DreamBooth and Custom Diffusion struggle to learn it. All baselines perform comparatively well in reproducing the learned concept (the second row). Interestingly, in the case of compositions, DreamBooth and Custom Diffusion perform well with the cost of losing the concept alignment (the last two rows). At the same time, textual inversion approaches cannot reproduce the compositions (like, “Two V*”) but they maintain concept alignment. Overall, these qualitative examples align with our quantitative results and strengthen our evaluation framework.
Human Evaluations.
We perform Human Evaluations using Amazon Mechanical Turk for both types of evaluations: 1) concept alignment – to measure the alignment between generated images and ground truth reference images on DomainPACS and ObjectImageNet, and 2) compositional reasoning – to measure the image-text alignment. For concept alignment, we ask human annotators to rate the likelihood of the target image the same as three reference images. While for compositional reasoning we simply ask the annotators to rate the likelihood alignment of the image and the corresponding caption. Table 3 summarizes the performance of prior and proposed () quantitative metrics w.r.t. the Human Score. KID performs better for domains than objects as image dynamics varies a lot in domains. (Kumari et al. 2022) proposed to use KID with LAION-retrieved concept images as a reference instead of ground truth due to the scarcity of reference images. However, ConceptBed alleviates this limitation. Therefore, we use actual ground truth images to report KID which is more accurate. It can be inferred that the CCD is strongly correlated with human preferences and outperforms the prior evaluation metrics by a large amount.
Percentage of highly aligned instances.
Using , we can further measure the recall of the concept learning models. DINO and KID metrics do not allow us to measure the recall. Hence, it becomes hard to investigate the actual quality of the generated images. Table 4 shows the recall () for the concept alignment shown in Table 1. It can be inferred that Custom-Diffusion can work once in every four generation attempts. While Textual Inversion will work at least once in every two attempts. At the same time, when composition prompts are provided, Textual Inversion consistently maintains the concept alignment at the cost of achieving the composition alignment.
Models | ConvNeXt | Inception | ViT | Few-Shot |
---|---|---|---|---|
TI (LDM) | 0.0955 | 0.0773 | 0.1165 | 0.0823 |
TI (SD) | 0.0472 | 0.0201 | 0.0599 | 0.0489 |
DB | 0.0678 | 0.0485 | 0.0786 | 0.0596 |
CD | 0.2085 | 0.1845 | 0.2286 | 0.1384 |
Correlation | -0.9892 | -0.9888 | -0.9816 | -0.9763 |
Generalization. Fine-tuned oracles cannot be generalized to unseen concepts; making CCD unreliable on OOD concepts. Hence, we propose to utilize a few-shot classifier (5-way 5-shot) instead, which can allow the generalization to unseen concepts while maintaining a high correlation (shown in Table 5). This shows the effectiveness of using confidence and CCD as the alternative to the DINO, KID, and CLIP.
5 Related Work
Concept Learning. Concept learning encompasses various problem statements and approaches, depending on the perspective adopted. Concept Bottleneck Models (CBMs) (Koh et al. 2020) and Concept Embedding Models (CEMs) (Zarlenga et al. 2022) treat object attributes as concepts and propose classification strategies to identify these concepts. Neuro Symbolic Concept Learner (NS-CL) (Mao et al. 2019) aims to learn visual concepts by associating them with language semantics, enabling the model to perform visual question answering. Image Inversion Style Concept Learning (Xia et al. 2022), takes a different approach. Its objective is to invert a given concept image back into the latent space of a pre-trained model. However, text-based concept composition is not possible for such models.
Text-to-Image Generative Models. With advances in vector quantization (Van Den Oord, Vinyals et al. 2017) and diffusion modeling (Rombach et al. 2022), text-to-image generation has improved its performance. Notable works such as DALL-E (Ramesh et al. 2021) train transformer models. While current state-of-the-art, diffusion-based text-to-image models such as GLIDE (Nichol et al. 2022), LDM (Rombach et al. 2022), and Imagen (Saharia et al. 2022), have surpassed prior approaches (such as StackGAN (Zhang et al. 2017), StackGAN++ (Zhang et al. 2018), TReCS (Koh et al. 2021), and DALL-E (Ramesh et al. 2021)) and achieved superior performance. Pixart- (Chen et al. 2023) and ECLIPSE (Patel et al. 2023) further enhances T2I methods without depending on heavy compute. Additionally, as shown by (Saxon and Wang 2023), these T2I models also have multilingual concept understanding to a certain extent.
Text-to-Image Concept Learning. Text-conditioned diffusion models, such as LDM, have demonstrated their potential for learning novel visual concepts with only a few reference images. Textual Inversion (Gal et al. 2022) proposes learning the embedding corresponding to the placeholder (V*) through optimization. DreamBooth (Ruiz et al. 2022) suggests optimizing the UNet parameters instead of optimizing the placeholder embedding. Custom Diffusion (Kumari et al. 2022) combines both approaches by optimizing the placeholder and key/value weights from the cross-attention layers for faster concept learning. These concept learners are essentially text-conditioned diffusion models and inherit the same limitations of diffusion models. One limitation is the overfitting of concepts and language drift. By optimizing model parameters on a handful of reference images, it is highly likely that the model might overfit the given concept and cannot maintain compositionality. Therefore, in this paper, we propose ConceptBed for systematic evaluations.
Text-to-Image Generative Model Evaluations. Evaluating generative models is not widely studied. The FID (Heusel et al. 2017) score is commonly used to measure generated image quality. CLIPScore (Hessel et al. 2021) is another popular evaluation metric for reference-free image-text alignment. Another study focuses on compositional evaluations of text-to-image models on small subsets (CU-Birds and Oxford-Flowers) (Park et al. 2021). DALL-Eval (Cho, Zala, and Bansal 2022) evaluates reasoning skills on synthetic datasets and social biases of text-to-image generative models. DALL-Eval, VISOR (Gokhale et al. 2022), LAYOUTBENCH (Cho et al. 2023) evaluates spatial reasoning abilities. Parallel work T2I CompBench (Huang et al. 2023) also adopts the idea of VQA for accurate composition evaluations. Although text-to-image model evaluations are well-explored, they lack concept-specific assessments and cannot be used for evaluating concept learning. Therefore, ConceptBed attempts to overcome this gap in evaluations of novel visual concept learning abilities.
6 Conclusion
In this paper, we introduce a novel benchmark called ConceptBed designed to assess the efficacy of text-conditioned diffusion models in learning new concepts (a.k.a. personalized T2I). The ConceptBed benchmark encompasses an end-to-end evaluation pipeline, a comprehensive concept library, and a novel Concept Confidence Deviation (CCD) evaluation metric. We conduct evaluations based on two key criteria: concept alignment and composition alignment. Through extensive experiments, we demonstrate that existing text-conditioned diffusion model-based concept learners exhibit significant limitations in their performance. We perform human evaluations to validate the effectiveness of our proposed evaluation metric (), which showcases a strong correlation with human preferences. This finding positions as a viable alternative to human judgments, enabling large-scale and comprehensive evaluations. ConceptBed represents the first large-scale concept-learning dataset that facilitates precise and accurate evaluations of personalized text-to-image generative models.
Acknowledgments
This work was supported by NSF RI grants #1750082 and #2132724, and a grant from Meta AI Learning Alliance. The views and opinions of the authors expressed herein do not necessarily state or reflect those of the funding agencies and employers.
References
- Azizi et al. (2023) Azizi, S.; Kornblith, S.; Saharia, C.; Norouzi, M.; and Fleet, D. J. 2023. Synthetic Data from Diffusion Models Improves ImageNet Classification. arXiv preprint arXiv:2304.08466.
- Banerjee et al. (2021) Banerjee, P.; Gokhale, T.; Yang, Y.; and Baral, C. 2021. WeaQA: Weak Supervision via Captions for Visual Question Answering. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 3420–3435.
- Brown et al. (2020) Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
- Chen et al. (2023) Chen, J.; Yu, J.; Ge, C.; Yao, L.; Xie, E.; Wu, Y.; Wang, Z.; Kwok, J.; Luo, P.; Lu, H.; et al. 2023. PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis. arXiv preprint arXiv:2310.00426.
- Cho et al. (2023) Cho, J.; Li, L.; Yang, Z.; Gan, Z.; Wang, L.; and Bansal, M. 2023. Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation. arXiv preprint arXiv:2304.06671.
- Cho, Zala, and Bansal (2022) Cho, J.; Zala, A.; and Bansal, M. 2022. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. arXiv preprint arXiv:2202.04053.
- Couairon et al. (2022) Couairon, G.; Verbeek, J.; Schwenk, H.; and Cord, M. 2022. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427.
- Deng et al. (2009) Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
- Fellbaum (2010) Fellbaum, C. 2010. WordNet. In Theory and applications of ontology: computer applications, 231–243. Springer.
- Gal et al. (2022) Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A. H.; Chechik, G.; and Cohen-Or, D. 2022. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618.
- Gokhale et al. (2022) Gokhale, T.; Palangi, H.; Nushi, B.; Vineet, V.; Horvitz, E.; Kamar, E.; Baral, C.; and Yang, Y. 2022. Benchmarking Spatial Relationships in Text-to-Image Generation. arXiv preprint arXiv:2212.10015.
- Guo et al. (2017) Guo, C.; Pleiss, G.; Sun, Y.; and Weinberger, K. Q. 2017. On calibration of modern neural networks. In International conference on machine learning, 1321–1330. PMLR.
- He et al. (2015) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep residual learning for image recognition. arXiv 2015. arXiv preprint arXiv:1512.03385, 14.
- Hendrycks, Mazeika, and Dietterich (2019) Hendrycks, D.; Mazeika, M.; and Dietterich, T. 2019. Deep Anomaly Detection with Outlier Exposure. In International Conference on Learning Representations.
- Hertz et al. (2022) Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Pritch, Y.; and Cohen-Or, D. 2022. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626.
- Hessel et al. (2021) Hessel, J.; Holtzman, A.; Forbes, M.; Le Bras, R.; and Choi, Y. 2021. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 7514–7528.
- Heusel et al. (2017) Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
- Huang et al. (2023) Huang, K.; Sun, K.; Xie, E.; Li, Z.; and Liu, X. 2023. T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Kim, Son, and Kim (2021) Kim, W.; Son, B.; and Kim, I. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, 5583–5594. PMLR.
- Koh et al. (2021) Koh, J. Y.; Baldridge, J.; Lee, H.; and Yang, Y. 2021. Text-to-image generation grounded by fine-grained user attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 237–246.
- Koh et al. (2020) Koh, P. W.; Nguyen, T.; Tang, Y. S.; Mussmann, S.; Pierson, E.; Kim, B.; and Liang, P. 2020. Concept bottleneck models. In International Conference on Machine Learning, 5338–5348. PMLR.
- Krishna et al. (2017) Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123: 32–73.
- Kumari et al. (2022) Kumari, N.; Zhang, B.; Zhang, R.; Shechtman, E.; and Zhu, J.-Y. 2022. Multi-Concept Customization of Text-to-Image Diffusion. arXiv preprint arXiv:2212.04488.
- Kuznetsova et al. (2020) Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A.; et al. 2020. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7): 1956–1981.
- Li et al. (2017) Li, D.; Yang, Y.; Song, Y.-Z.; and Hospedales, T. M. 2017. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, 5542–5550.
- Liu et al. (2022) Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; and Xie, S. 2022. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11976–11986.
- Mao et al. (2019) Mao, J.; Gan, C.; Kohli, P.; Tenenbaum, J. B.; and Wu, J. 2019. The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision. In International Conference on Learning Representations.
- Murphy (2004) Murphy, G. 2004. The big book of concepts. MIT press.
- Naeini, Cooper, and Hauskrecht (2015) Naeini, M. P.; Cooper, G.; and Hauskrecht, M. 2015. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29.
- Nichol et al. (2022) Nichol, A. Q.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; Mcgrew, B.; Sutskever, I.; and Chen, M. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In International Conference on Machine Learning, 16784–16804. PMLR.
- Park et al. (2021) Park, D. H.; Azadi, S.; Liu, X.; Darrell, T.; and Rohrbach, A. 2021. Benchmark for compositional text-to-image synthesis. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
- Patel et al. (2023) Patel, M.; Kim, C.; Cheng, S.; Baral, C.; and Yang, Y. 2023. ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations. arXiv preprint arXiv:2312.04655.
- Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- Ramesh et al. (2021) Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; and Sutskever, I. 2021. Zero-shot text-to-image generation. In International Conference on Machine Learning, 8821–8831. PMLR.
- Rombach et al. (2022) Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695.
- Ruiz et al. (2022) Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; and Aberman, K. 2022. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242.
- Saharia et al. (2022) Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E. L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35: 36479–36494.
- Saxon and Wang (2023) Saxon, M.; and Wang, W. Y. 2023. Multilingual Conceptual Coverage in Text-to-Image Models. In Rogers, A.; Boyd-Graber, J.; and Okazaki, N., eds., Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 4831–4848. Toronto, Canada: Association for Computational Linguistics.
- Schuhmann et al. (2022) Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C. W.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; Schramowski, P.; Kundurthy, S. R.; Crowson, K.; Schmidt, L.; Kaczmarczyk, R.; and Jitsev, J. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Thrush et al. (2022) Thrush, T.; Jiang, R.; Bartolo, M.; Singh, A.; Williams, A.; Kiela, D.; and Ross, C. 2022. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5238–5248.
- Trabucco et al. (2023) Trabucco, B.; Doherty, K.; Gurinas, M.; and Salakhutdinov, R. 2023. Effective data augmentation with diffusion models. arXiv preprint arXiv:2302.07944.
- Van Den Oord, Vinyals et al. (2017) Van Den Oord, A.; Vinyals, O.; et al. 2017. Neural discrete representation learning. Advances in neural information processing systems, 30.
- Wah et al. (2011) Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltech-ucsd birds-200-2011 dataset.
- Xia et al. (2022) Xia, W.; Zhang, Y.; Yang, Y.; Xue, J.-H.; Zhou, B.; and Yang, M.-H. 2022. Gan inversion: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Zarlenga et al. (2022) Zarlenga, M. E.; Pietro, B.; Gabriele, C.; Giuseppe, M.; Giannini, F.; Diligenti, M.; Zohreh, S.; Frederic, P.; Melacci, S.; Adrian, W.; et al. 2022. Concept embedding models: Beyond the accuracy-explainability trade-off. In Advances in Neural Information Processing Systems, volume 35, 21400–21413. Curran Associates, Inc.
- Zhang et al. (2017) Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; and Metaxas, D. N. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, 5907–5915.
- Zhang et al. (2018) Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; and Metaxas, D. N. 2018. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence, 41(8): 1947–1962.
Appendix A Social Impact
In this paper, we introduce ConceptBed, a novel benchmark and evaluation framework designed for conducting comprehensive studies on few-shot Concept Learning using T2I diffusion models. Previous evaluations of recent works in this field have been limited to a small number of test concepts, thus hindering our understanding of their practical applicability. Through our benchmark, we demonstrate that while current concept learners exhibit impressive performance, a substantial gap remains that must be addressed. As pioneers in constructing this extensive evaluation set, we anticipate that future research will incorporate a broader range of potential concepts. Additionally, we propose a novel evaluation metric and framework that can be applied to any concept learning setting, extending its efficacy beyond the confines of ConceptBed dataset. Ultimately, this research directly contributes to the advancement of Human-Level Artificial Intelligence (HLAI) objectives, fostering the development of more robust and capable systems.
Appendix B Extended Related Work
Evaluations of T2I Concept Learners. Previous studies on concept learning have conducted evaluations and model comparisons using their own test sets. For instance, Textual Inversion (Gal et al. 2022) employed approximately 20 concepts with around 27 unique compositions, while DreamBooth (Ruiz et al. 2022) utilized 30 concepts with 50 unique compositions. Custom Diffusion (Kumari et al. 2022), on the other hand, employed 10 concepts with 24 unique compositions. Notably, these works were evaluated on a relatively small subset of concepts and a limited list of compositions. In order to address the limitations associated with a centralized evaluation set, we introduce the ConceptBed dataset, which consists of 284 concepts and over 33000 compositions. Additionally, we present an automated procedure for concept and composition collection, enabling the creation of large-scale datasets.
Downstream Applications of Diffusion Models. In addition to concept learning, diffusion models have demonstrated potential for various downstream applications. For example, approaches such as prompt-to-prompt (Hertz et al. 2022) and DiffEdit (Couairon et al. 2022) have been proposed for image editing tasks. In another case, diffusion-generated images have shown improvements in ImageNet accuracy (Azizi et al. 2023). Furthermore, methods similar to textual inversion have been found to enhance few-shot classification performance (Trabucco et al. 2023).
Out-of-Distribution Detection and Domain Adaptation/Generalization. While the research directions of out-of-distribution detection and domain adaptation/generalization have been explored independently to a significant extent, they share a common focus on measuring and controlling model confidence. Prior works have employed various confidence quantification methods, including: 1) Expected Calibration Error (ECE), which is a popular metric for assessing classifier calibration by measuring the difference between model accuracy and its probability (Naeini, Cooper, and Hauskrecht 2015), and 2) Expected Uncertainty Calibration Error (UCE), a recently proposed metric that quantifies the miscalibration of uncertainty by calculating the difference between model error and its uncertainty (Guo et al. 2017). Given the high variance observed in diffusion models with respect to hyperparameters, we introduce a novel method, leveraging the ConceptBed dataset, to quantify generation variances and measure deviations using . ECE and UCE can serve as alternative metrics for quantifying deviations and evaluating concept learners. Our experimental results in Appendix G.1 demonstrate that ECE performs equally well as in assessing concept alignment. In the context of concept alignment, ECE and UCE can be computed based on generated concept-specific images, without considering the performance on the ground truth target images. Lower values of these metrics indicate better performance, albeit at the cost of explainability regarding the source of errors (e.g., overconfidence or lack of confidence in the model). To address these potential ambiguities, we propose , which measures the discrepancy in probabilities between ground truth and generated concept-specific images, thereby facilitating a more nuanced understanding of the limitations of concept learners.
Appendix C Preliminaries on text-conditioned diffusion models
Diffusion Models: The training procedure of Stable Diffusion can be described as follows: given a training pair , the input image is first mapped to a latent vector and get a variably-noised vector , where is a noise term and are terms that control the noise schedule and sample quality. At training time, the time-conditioned UNet is optimized to predict the noise and recover the initial , via conditioning on the text prompt , the model is trained with a squared error loss on the predicted noise term as follows:
(4) |
where is uniformly sampled from .
At inference time, Stable Diffusion is sampled by iteratively denoising conditioned on the text prompt . Specifically, at each denoising step , is obtained from both and the predicted noise term of UNet whose input is and text prompt . After the final denoising step, will be mapped back to yield the generated image .
Textual-Inversion (TI): TI uses the pre-trained Stable Diffusion and fine-tunes it to learn the specific concepts using a few images. Given a small set of images depicting the target concept , and with the rare-token (i.e., V*), we want to learn the embedding corresponding to . This input-conditioned text can be represented as “A photo of a V*”.
TI follows the exact same process of Stable Diffusion. Unlike Stable Diffusion, TI optimizes the text conditional encoder () with respect to the rarely occurring token using the Latent Diffusion Model (LDM) objective function:
Note that is the noised where . Intuitively, the objective is to correctly remove the added noise (while training) and optimize with respect to . At inference time, a random noise tensor is sampled and a text prompt (containing the rare-token ) is used to generate the image it using fine-tuned .
DreamBooth: While textual-inversion can be used to learn various concepts depending on the training images and corresponding set of text prompts, DreamBooth is proposed to learn the specific properties of the target subject: “A photo of a V* dog”. In the case of DreamBooth, we do not optimize and instead, it optimizes .
To overcome the challenges (overfitting and language drift) of fine-tuning the full model, DreamBooth contains the class-specific prior-preserving loss. Essentially, this method uses the pre-trained diffusion model generated samples () to supervise the training. Here, and conditioning vector . Therefore, the proposed loss becomes:
Custom-Diffusion: For single-concept learning, Custom Diffusion is essentially the combination of Textual Inversion and DreamBooth. The objective function of Custom Diffusion is the same as DreamBooth but instead of optimizing whole UNet (i.e., ), Custom Diffusion optimizes the embedding corresponding to V* from and key/value weights from Cross Attention Layers of the UNet model.
Appendix D ConceptBed Dataset
D.1 ImageNet Dataset Generation Pipeline
As mentioned in the main text, ImageNet contains 1000 classes but not all of them are used in day-to-day interactions. Moreover, performing experiments on each of these 1000 classes is computationally very extensive as one needs to train 4000 models and generate 400,000 images. Therefore, it is important to filter out highly used concepts in daily life. To measure the real-life- importance we check if any concept (such as dog) is the subject of the caption prompt in the whole visual genome dataset. If there exist at least 10 captions having the concept as subject then we add the concept in ConceptBed library. Additionally, the concept learning methodologies can learn new concepts using as little as 4 images. Using all ImageNet images as training data can potentially add more noise as these images are not high resolution. Hence, we further filter out the top 100 images based on the percentage of the object pixels (with a ratio of at least 0.4) within the image. We provide the Algorithm 2 for readers’ understanding of the data generation pipeline. It is worth noting that, this pipeline can be used to extend the ConceptBed and even to train the future concept learning methodologies.
D.2 Concept Statistics
Our dataset, ConceptBed, comprises a total of 284 concepts. Among these concepts, 200 are sourced from the CUB dataset, 80 from ImageNet, and 4 from PACS. The concepts and their respective categories are presented in Table 11. We use the CUB dataset for attribute-level analysis, while the ImageNet concepts are included to ensure a diversity of concepts.
D.3 Composition Categorization
We leverage the Visual Genome dataset to create composite prompts for each of the 80 ImageNet concepts. This process yields over 33,000 compositions, resulting in a rich variety of prompts. Table 12 provides detailed statistics on the compositions for each concept. Furthermore, Figure 5 illustrates the distribution of composition categories within the ConceptBed dataset. For the sake of simplicity, ConceptBed contains composite prompts that combine up to two different compositions. To determine the composition type, we employ the GPT3 (text-davinci-003) model for few-shot classification. Figure 6 showcases the instruction and in-context examples used to categorize each text phrase.
D.4 Question Generations
Caption | Generated Questions |
---|---|
two birds standing on branches | are there two birds ? |
are there branches ? | |
tongue hanging out a dog’s mouth | is there tongue ? |
is there a dog’s mouth ? | |
cat is licking herself | is there cat ? |
is the cat licking herself ? | |
the monkey is holding onto a red bar | is there the monkey ? |
is there a red bar ? | |
Propeller blade on an aircraft | is there propeller blade ? |
is there an aircraft ? | |
the vine is around clock | is there the vine ? |
is there clock ? | |
man driving the boat | is there man ? |
is there the boat ? |
Rather than relying solely on image-text similarity, we evaluate compositions through VQA performance using synthetically generated boolean questions (i.e., yes or no) based on the composite text phrases. To create these questions, we manually filter out salient words such as nouns, attributes, and verbs, and formulate questions corresponding to each of these words. This process enables the creation of existence-related questions where the ground truth answer is always yes. Table 6 provides examples of the questions generated for different composite text prompts.
Appendix E Experimental Setup
Hyper-Parameter | Textual Inversion (LDM) | Textual Inversion (SD) | DreamBooth | Custom Diffusion |
---|---|---|---|---|
Base Model | LDM | Stable Diffusion - v1.5 | Stable Diffusion - v1.5 | Stable Diffusion - v1.5 |
Optimized | V* | V* | UNet | V* + CrossAtten(k,v) |
Optimization Steps | 3000 | 3000 | 400 | 250 |
Learning Rate | 5e-4 | 5e-4 | 5e-6 | 1e-5 |
Place-Holder Token | * | object | sks | new1 |
Regularizer | - | - | ✔ | ✔ |
Regularization Images | - | - | 200 | 200 |
# if inference steps | 50 | 50 | 50 | 50 |
Guidance Scale | 7.5 | 7.5 | 7.5 | 7.5 |
Noise Scheduler | - | PNDMScheduler | PNDMScheduler | PNDMScheduler |
Hyper-parameters | DomainPACS | ObjectImageNet | Fine-GrainedCUB | Compositions |
---|---|---|---|---|
Model Architecture | ResNet18 | ConceNeXt-base | - | ViLT |
Pre-training Dataset | ImageNet | ImageNet | - | MSCOCO, GCC, SBU, VG |
# of target concepts | 4 | 80 | 200/112 | 1 |
Objective Function | NLL | NLL (w outlier exposure) | NLL | NLL |
Table 7 presents the hyperparameter details for the baseline methodologies employed in our benchmarking process. To ensure fair comparisons, we generate images for all concepts using the same number of inference steps and guidance scale. Specifically, Textual Inversion (SD), DreamBooth, and Custom Diffusion utilize the Stable Diffusion V1.5 pre-trained model222https://huggingface.co/runwayml/stable-diffusion-v1-5. Regarding Textual Inversion, we explore two variants: 1) Latent Diffusion Model-based, and 2) Stable Diffusion-based. By incorporating different pre-trained models, we aim to investigate their impact on learning novel concepts, and our findings reveal significant differences. For instance, Textual Inversion (LDM) outperforms Textual Inversion (SD) when learning style as a concept, while the SD version excels in adapting object-level concepts.
Table 8 outlines the hyperparameter settings for each oracle. It is important to note that we can employ any type of classifier as an oracle, as elaborated in the subsequent section. In our approach, we initially take pre-trained models and further, fine-tune them on concepts using the negative log-likelihood objective. However, it is well-established that classifiers may exhibit misclassification tendencies with high confidence. To address this, for ObjectsImageNet, we additionally incorporate an outlier-exposure objective function (Hendrycks, Mazeika, and Dietterich 2019).
Appendix F Human Annotations
To evaluate the effectiveness of our ConceptBed evaluation framework, we conducted human evaluations consisting of three distinct studies: object-based concept similarity, style-based concept similarity, and traditional image-text similarity for evaluating compositions. For the object and style-based concept similarity, we asked participants to rate the likelihood of the target image being the same as three reference images on a scale of 1 to 5. A rating of 1 indicated the least similarity, while a rating of 5 indicated an exact match in terms of concept. We ensured that human annotators did not compare generated images from different concept learning strategies; instead, they rated each image independently. Regarding the composition evaluation, we simply asked annotators to rate the image-text similarities on the same 1-5 scale, with 1 representing the least similarity and 5 representing an exact match. Figures 7, 8, and 9 present screenshots of the MTurk interface used for each type of human evaluation.
To ensure comprehensive coverage, we randomly selected 100 generated images and obtained evaluations from three unique workers for each image. This resulted in a total of 900 evaluations from human annotators. To assess the relationship between human evaluations and various baseline evaluation metrics, as well as our method, we computed Pearson’s correlation. Our findings indicate a strong correlation between the human evaluations and our CCD evaluation metric.
Appendix G Ablations
G.1 Different Confidence Measures
Table 9 presents a comprehensive comparison of various confidence quantification metrics employed in Out-Of-Distribution (OOD) detection. Notably, all these metrics outperform the baseline metrics DINO and KID, as evidenced by their consistently high correlation scores, reaching an absolute high correlation of at least . This implies that our evaluation framework supports multiple metrics to measure the alignment as we are performing supervised learning to train the oracles. Importantly, Accuracy and ECE measure the performance of a large collection of generated images. While MSP and measure the performance at the instance level, which is more useful in practical scenarios where we don’t have access to a lot of generated images to estimate the performance. Although MSP also achieves a high correlation, in some cases, there might be a chance that an oracle can predict the wrong class with high confidence (as it is class-label independent). For instance, MSP on domain alignment leads to only a correlation with human preferences. Hence, conditional probability is important to measure the instance-level alignment. It is worth noting that the negative sign in the correlation coefficients stems from the inherent differences between the nature of these metrics. Specifically, lower values of , and ECE indicate better performance, while higher scores in human evaluations indicate superior performance.
G.2 Choice of Classifiers for Oracles
In Table 10, we explore the impact of utilizing different types of classifiers as oracles. Our analysis encompasses four distinct classifiers, each characterized by an increasing number of parameters. Intriguingly, the choice of classifier appears to have a negligible effect, as consistently demonstrates strong correlations with human scores, surpassing a Pearson’s correlation of at least .
Models | Accuracy () | MSP () | ECE () | |
---|---|---|---|---|
Textual Inversion (LDM) | 80.07% | 0.8734 | 0.0755 | 0.0955 |
Textual Inversion (SD) | 84.30% | 0.9022 | 0.0623 | 0.0472 |
DreamBooth | 83.17% | 0.8923 | 0.0647 | 0.0678 |
Custom Diffusion | 69.73% | 0.8311 | 0.1382 | 0.2085 |
Original | 89.31% | 0.9276 | 0.0436 | -0.0000 |
Appendix H Qualitative Results
Figure 10 presents qualitative examples showcasing the performance of various baseline methods across different style concepts. Notably, the textual inversion methods demonstrate limitations in preserving object-specific features and accurately learning the desired style. Furthermore, both DreamBooth and Custom Diffusion exhibit challenges in effectively capturing and reproducing the intended styles. In Figure 11, we delve into the object-specific learned concepts obtained through the baseline methodologies. Notably, Custom Diffusion struggles in acquiring and comprehending new concepts, thus explaining its relatively lower performance in terms of concept alignment. To gain further insights, Figure 12 offers a comparison of the generated images using Custom Diffusion at different random seeds. The results indicate that Custom Diffusion successfully generates the learned concepts in three out of four instances. However, when tasked with generating concept-specific images based on composite text prompts, Custom Diffusion struggles to maintain fidelity to the learned concept.
To facilitate a more comprehensive understanding of the ConceptBed benchmark and its results, we have developed an online results explorer, which provides readers with a user-friendly interface for exploring and analyzing the benchmark outcomes.
Appendix I Limitations
We introduce the first comprehensive benchmark for large-scale concept learning, encompassing 284 distinct concepts and a vast collection of 33,000 composite prompts. However, there are infinitely many concepts, and evaluating all of them is next to impossible. Therefore, we recommend that future works benchmark the novel methodologies with the combination of both ConceptBed and selective qualitative examples. While training and evaluating numerous models on an expanded subset of concepts can be resource-intensive, our approach, ConceptBed, employs an automated strategy that effortlessly scales to incorporate an extensive range of concepts. Our benchmark primarily evaluates concept learning strategies derived from Stable Diffusion models. However, the dataset and evaluation framework we present in ConceptBed can serve as a good foundation for assessing any text-conditioned concept learners, including inversion methodologies. It is important to note that the limitations inherent to Stable Diffusion models, which form the core of our experiments, extend to other concept learners, such as spatial relationships. Hence, while ConceptBed utilizes composite text prompts pre-trained on text-to-image models, future work will explore strategies to enable concept learners to adapt rapidly to novel concepts and achieve state-of-the-art performance on our benchmark. In addition to the above, concept learning holds promise for enhancing performance in various application domains, such as refining existing concepts to mitigate potential biases present in Stable Diffusion models and incorporating spatial relations like left/right. These areas offer fertile ground for further exploration and can contribute to the advancement of concept learning techniques. By addressing these limitations and exploring potential application areas, we aim to propel the development of concept learning methods that consistently push the boundaries of performance on the ConceptBed benchmark.
Models | ResNet18 | Inception-V4 | ViT-Large | ConvNeXt |
---|---|---|---|---|
Textual Inversion (LDM) | 0.0107 | 0.0773 | 0.1165 | 0.0955 |
Textual Inversion (SD) | -0.0100 | 0.0201 | 0.0599 | 0.0472 |
DreamBooth | 0.0214 | 0.0485 | 0.0786 | 0.0678 |
Custom Diffusion | 0.1538 | 0.1845 | 0.2286 | 0.2085 |
original | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
Concept Source | Concepts |
---|---|
PACS |
Art-Painting Cartoon Photo Sketch |
ImageNet |
langur hand-held_computer guenon brambling desktop_computer speedboat titi airship tiger_cat organ squirrel_monkey bluetick siamang yawl lifeboat ambulance beagle digital_clock fire_engine Walker_hound gondola pill_bottle fireboat proboscis_monkey moving_van rotisserie slide_rule Irish_wolfhound junco cab magpie robin jeep colobus airliner gibbon letter_opener garbage_truck limousine English_foxhound borzoi baboon basset capuchin convertible analog_clock redbone canoe spider_monkey bulbul Afghan_hound goldfinch patas tabby web_site grand_piano laptop chickadee Dutch_oven black-and-tan_coonhound marmoset chimpanzee macaque police_van tow_truck cleaver howler_monkey bloodhound pickup house_finch beer_bottle notebook water_ouzel orangutan Madagascar_cat gorilla indri beach_wagon jay indigo_bunting |
CUB |
Black_footed_Albatross Laysan_Albatross Sooty_Albatross Groove_billed_Ani Crested_Auklet Least_Auklet Parakeet_Auklet Rhinoceros_Auklet Brewer_Blackbird Red_winged_Blackbird Rusty_Blackbird Yellow_headed_Blackbird Bobolink Indigo_Bunting Lazuli_Bunting Painted_Bunting Cardinal Spotted_Catbird Gray_Catbird Yellow_breasted_Chat Eastern_Towhee Chuck_will_Widow Brandt_Cormorant Red_faced_Cormorant Pelagic_Cormorant Bronzed_Cowbird Shiny_Cowbird Brown_Creeper American_Crow Fish_Crow Black_billed_Cuckoo Mangrove_Cuckoo Yellow_billed_Cuckoo Gray_crowned_Rosy_Finch Purple_Finch Northern_Flicker Acadian_Flycatcher Great_Crested_Flycatcher Least_Flycatcher Olive_sided_Flycatcher Scissor_tailed_Flycatcher Vermilion_Flycatcher Yellow_bellied_Flycatcher Frigatebird Northern_Fulmar Gadwall American_Goldfinch European_Goldfinch Boat_tailed_Grackle Eared_Grebe Horned_Grebe Pied_billed_Grebe Western_Grebe Blue_Grosbeak Evening_Grosbeak Pine_Grosbeak Rose_breasted_Grosbeak Pigeon_Guillemot California_Gull Glaucous_winged_Gull Heermann_Gull Herring_Gull Ivory_Gull Ring_billed_Gull Slaty_backed_Gull Western_Gull Anna_Hummingbird Ruby_throated_Hummingbird Rufous_Hummingbird Green_Violetear Long_tailed_Jaeger Pomarine_Jaeger Blue_Jay Florida_Jay Green_Jay Dark_eyed_Junco Tropical_Kingbird Gray_Kingbird Belted_Kingfisher Green_Kingfisher Pied_Kingfisher Ringed_Kingfisher White_breasted_Kingfisher Red_legged_Kittiwake Horned_Lark Pacific_Loon Mallard Western_Meadowlark Hooded_Merganser Red_breasted_Merganser Mockingbird Nighthawk Clark_Nutcracker White_breasted_Nuthatch Baltimore_Oriole Hooded_Oriole Orchard_Oriole Scott_Oriole Ovenbird Brown_Pelican White_Pelican Western_Wood_Pewee Sayornis American_Pipit Whip_poor_Will Horned_Puffin Common_Raven White_necked_Raven American_Redstart Geococcyx Loggerhead_Shrike Great_Grey_Shrike Baird_Sparrow Black_throated_Sparrow Brewer_Sparrow Chip**_Sparrow Clay_colored_Sparrow House_Sparrow Field_Sparrow Fox_Sparrow Grasshopper_Sparrow Harris_Sparrow Henslow_Sparrow Le_Conte_Sparrow Lincoln_Sparrow Nelson_Sharp_tailed_Sparrow Savannah_Sparrow Seaside_Sparrow Song_Sparrow Tree_Sparrow Vesper_Sparrow White_crowned_Sparrow White_throated_Sparrow Cape_Glossy_Starling Bank_Swallow Barn_Swallow Cliff_Swallow Tree_Swallow Scarlet_Tanager Summer_Tanager Artic_Tern Black_Tern Caspian_Tern Common_Tern Elegant_Tern Forsters_Tern Least_Tern Green_tailed_Towhee Brown_Thrasher Sage_Thrasher Black_capped_Vireo Blue_headed_Vireo Philadelphia_Vireo Red_eyed_Vireo Warbling_Vireo White_eyed_Vireo Yellow_throated_Vireo Bay_breasted_Warbler Black_and_white_Warbler Black_throated_Blue_Warbler Blue_winged_Warbler Canada_Warbler Cape_May_Warbler Cerulean_Warbler Chestnut_sided_Warbler Golden_winged_Warbler Hooded_Warbler Kentucky_Warbler Magnolia_Warbler Mourning_Warbler Myrtle_Warbler Nashville_Warbler Orange_crowned_Warbler Palm_Warbler Pine_Warbler Prairie_Warbler Prothonotary_Warbler Swainson_Warbler Tennessee_Warbler Wilson_Warbler Worm_eating_Warbler Yellow_Warbler Northern_Waterthrush Louisiana_Waterthrush Bohemian_Waxwing Cedar_Waxwing American_Three_toed_Woodpecker Pileated_Woodpecker Red_bellied_Woodpecker Red_cockaded_Woodpecker Red_headed_Woodpecker Downy_Woodpecker Bewick_Wren Cactus_Wren Carolina_Wren House_Wren Marsh_Wren Rock_Wren Winter_Wren Common_Yellowthroat |
Concept | Action | Attribute | Counting | Relation | Overall |
---|---|---|---|---|---|
laptop | 17 | 18 | 2 | 40 | 52 |
tow_truck | 97 | 348 | 35 | 409 | 645 |
hand-held_computer | 17 | 18 | 2 | 40 | 52 |
gorilla | 9 | 11 | 0 | 11 | 19 |
chimpanzee | 9 | 11 | 0 | 11 | 19 |
pickup | 97 | 348 | 35 | 409 | 645 |
yawl | 116 | 178 | 36 | 466 | 567 |
beagle | 380 | 807 | 24 | 496 | 1222 |
bulbul | 62 | 270 | 11 | 142 | 374 |
spider_monkey | 9 | 11 | 0 | 11 | 19 |
borzoi | 380 | 807 | 24 | 496 | 1222 |
analog_clock | 1 | 61 | 10 | 71 | 109 |
letter_opener | 11 | 10 | 0 | 21 | 31 |
water_ouzel | 62 | 270 | 11 | 142 | 374 |
web_site | 17 | 18 | 2 | 40 | 52 |
garbage_truck | 97 | 348 | 35 | 409 | 645 |
bloodhound | 380 | 807 | 24 | 496 | 1222 |
basset | 380 | 807 | 24 | 496 | 1222 |
proboscis_monkey | 9 | 11 | 0 | 11 | 19 |
Dutch_oven | 58 | 85 | 13 | 111 | 194 |
fireboat | 116 | 178 | 36 | 466 | 567 |
black-and-tan_coonhound | 380 | 807 | 24 | 496 | 1222 |
speedboat | 116 | 178 | 36 | 466 | 567 |
beach_wagon | 98 | 213 | 17 | 363 | 497 |
airliner | 4 | 8 | 2 | 11 | 20 |
titi | 9 | 11 | 0 | 11 | 19 |
marmoset | 9 | 11 | 0 | 11 | 19 |
beer_bottle | 1 | 9 | 0 | 14 | 20 |
magpie | 62 | 270 | 11 | 142 | 374 |
Irish_wolfhound | 380 | 807 | 24 | 496 | 1222 |
lifeboat | 116 | 178 | 36 | 466 | 567 |
brambling | 62 | 270 | 11 | 142 | 374 |
rotisserie | 58 | 85 | 13 | 111 | 194 |
junco | 62 | 270 | 11 | 142 | 374 |
ambulance | 98 | 213 | 17 | 363 | 497 |
gondola | 116 | 178 | 36 | 466 | 567 |
tabby | 424 | 992 | 42 | 727 | 1592 |
cleaver | 11 | 10 | 0 | 21 | 31 |
limousine | 98 | 213 | 17 | 363 | 497 |
desktop_computer | 17 | 18 | 2 | 40 | 52 |
colobus | 9 | 11 | 0 | 11 | 19 |
house_finch | 62 | 270 | 11 | 142 | 374 |
chickadee | 62 | 270 | 11 | 142 | 374 |
cab | 98 | 213 | 17 | 363 | 497 |
notebook | 17 | 18 | 2 | 40 | 52 |
squirrel_monkey | 9 | 11 | 0 | 11 | 19 |
digital_clock | 1 | 61 | 10 | 71 | 109 |
canoe | 116 | 178 | 36 | 466 | 567 |
indri | 9 | 11 | 0 | 11 | 19 |
English_foxhound | 380 | 807 | 24 | 496 | 1222 |
airship | 4 | 8 | 2 | 11 | 20 |
capuchin | 9 | 11 | 0 | 11 | 19 |
tiger_cat | 424 | 992 | 42 | 727 | 1592 |
bluetick | 380 | 807 | 24 | 496 | 1222 |
Afghan_hound | 380 | 807 | 24 | 496 | 1222 |
moving_van | 97 | 348 | 35 | 409 | 645 |
jay | 62 | 270 | 11 | 142 | 374 |
police_van | 97 | 348 | 35 | 409 | 645 |
howler_monkey | 9 | 11 | 0 | 11 | 19 |
langur | 9 | 11 | 0 | 11 | 19 |
gibbon | 9 | 11 | 0 | 11 | 19 |
redbone | 380 | 807 | 24 | 496 | 1222 |
organ | 3 | 24 | 12 | 41 | 68 |
slide_rule | 17 | 18 | 2 | 40 | 52 |
goldfinch | 62 | 270 | 11 | 142 | 374 |
pill_bottle | 1 | 9 | 0 | 14 | 20 |
siamang | 9 | 11 | 0 | 11 | 19 |
convertible | 98 | 213 | 17 | 363 | 497 |
baboon | 9 | 11 | 0 | 11 | 19 |
Walker_hound | 380 | 807 | 24 | 496 | 1222 |
guenon | 9 | 11 | 0 | 11 | 19 |
indigo_bunting | 62 | 270 | 11 | 142 | 374 |
grand_piano | 3 | 24 | 12 | 41 | 68 |
fire_engine | 97 | 348 | 35 | 409 | 645 |
robin | 62 | 270 | 11 | 142 | 374 |
macaque | 9 | 11 | 0 | 11 | 19 |
orangutan | 9 | 11 | 0 | 11 | 19 |
jeep | 98 | 213 | 17 | 363 | 497 |
patas | 9 | 11 | 0 | 11 | 19 |
Madagascar_cat | 9 | 11 | 0 | 11 | 19 |