(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: Dartmouth College, Hanover NH 03755, USA 22institutetext: Meta, FAIR
22email: {maxwell.m.aladago.gr, LT, soroush.voshoughi}@dartmouth.edu

Semantic Compositions Enhance Vision-Language Contrastive Learning

Maxwell Mbabilla Aladago 11    Lorenzo Torresani 1122    Soroush Vosoughi 11
Abstract

In the field of vision-language contrastive learning, models such as CLIP capitalize on matched image-caption pairs as positive examples and leverage within-batch non-matching pairs as negatives. This approach has led to remarkable outcomes in zero-shot image classification, cross-modal retrieval, and linear evaluation tasks. We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining. Inspired by CutMix in vision categorization, we create semantically composite image-caption pairs by merging elements from two distinct instances in the dataset via a novel procedure. Our method fuses the captions and blends 50% of each image to form a new composite sample. This simple technique (termed CLIP-𝒞𝒞\mathcal{C}caligraphic_C for CLIP Compositions), devoid of any additional computational overhead or increase in model parameters, significantly improves zero-shot image classification and cross-modal retrieval. The benefits of CLIP-𝒞𝒞\mathcal{C}caligraphic_C are particularly pronounced in settings with relatively limited pretraining data.

Keywords:
multimodal learningsemantic composition pretraining

1 Introduction

Recent advancements in vision-language pretraining have propelled a multitude of tasks, including zero-shot image classification [41, 22, 32], video understanding [52, 59], and various multi-modal applications [27, 54, 36]. These successes echo the transformative trajectory initiated by large-scale pretraining efforts in Computer Vision (CV) [26, 17] and later in Natural Language Processing (NLP) [11, 42, 43, 1]. A prominent recent example in the vision-language interplay is the Contrastive Language-Image Pre-training (CLIP) model [41], which has become a benchmark for language-supervised training.

The objective of contrastive language-supervised pretraining is straightforward yet powerful: to align embeddings of corresponding image-text pairs in a shared embedding space while distancing the non-matching pairs [4, 5, 16]. CLIP has pioneered this direction with a dual-encoder framework, training on an expansive dataset of image-caption pairs sourced from the internet, using a bidirectional contrastive loss [37].

Subsequent studies aimed at improving the data efficiency of CLIP have introduced supplementary objectives, such as within-modality self-supervision [35, 32], multi-crop supervision [32], and the use of captions generated by large language models [13, 28, 59]. These methods typically necessitate additional computational steps, such as multiple forward passes or the inclusion of extra encoders.

Our contribution, termed CLIP-𝒞𝒞\mathcal{C}caligraphic_C (illustrated in Fig. 1), enhances data efficiency through an innovative but straightforward compositional approach, merging original image-caption pairs into novel compound examples. This approach builds upon the achievements of CutMix [56] in the domain of vision categorization tasks, adapting it to the vision-language pretraining context. This adaptation results in significant enhancements in model performance across various downstream tasks, achieved without incurring extra computational costs or increasing model complexity.

Our technique involves composing together two image-caption pairs from the dataset. This is done by conjoining the captions with “and” serving as the conjunction and merging the central crops of both images. The result is a set of compound instances that embody a broader array of concepts than the individual pairs, presenting the model with expanded semantic challenges that drive learning (Fig. 5). The two image-caption pairs constituting the composite instance are sampled dynamically in each iteration based on a predefined probability, empowering the model to uncover novel combinations of examples throughout training.

Distinguishing itself from stylistic variation methods that manipulate single examples, CLIP-𝒞𝒞\mathcal{C}caligraphic_C leverages “semantic composition” to introduce contextually varied training examples. Interestingly, we discover that the benefits of CLIP-𝒞𝒞\mathcal{C}caligraphic_C arise from other factors besides the diverse nature of these novel semantic associations. Surprisingly and very counter-intuitively, we find that the model learns to match compound image-caption pairs more easily than the original plain image-caption pairs. This then initiates a positive spillover effect where the model is able to learn better representations of plain unmodified examples in CLIP-𝒞𝒞\mathcal{C}caligraphic_C compared to CLIP.

Like [13], our approach improves the performance of CLIP in poor data scenarios. However, we enhance the data with compositions of both the images and captions without any reliance on external systems. As a result, we can efficiently and flexibly generate novel captions and images online during training. Moreover, our method does not increase the batch size or the number of iterations needed, maintaining operational parity with CLIP. Indeed, we demonstrate that training CLIP longer or with a higher batch size than CLIP-𝒞𝒞\mathcal{C}caligraphic_C is not sufficient to close the performance gap between the two methods.

In downstream applications, CLIP-𝒞𝒞\mathcal{C}caligraphic_C exhibits a competitive edge, surpassing CLIP by over 5% in cross-modal retrieval accuracy on Flickr30k [53] and showing substantial improvements on MS-COCO [33] in both image-to-text and text-to-image retrieval tasks. Additionally, our model demonstrates impressive gains in zero-shot classification, with a 2% increase on ImageNet [9], and superior linear evaluation results without necessitating any additional model parameters, memory, or dependence on external language processing systems. Even when evaluated on relatively large datasets such as CC12M [3] and RedCaps [10], CLIP-𝒞𝒞\mathcal{C}caligraphic_C still outperform the baseline CLIP in cross-modal retrieval and zero-shot classification tasks, albeit with decreased margins.

Finally, we believe that CLIP-𝒞𝒞\mathcal{C}caligraphic_C will be particularly beneficial in contexts where image-text datasets do not exist in large quantities or are not easily accessible (e.g., medical images, satellite images, etc.). However, it is not feasible to carry out comprehensive evaluations in these domains precisely because there are no established benchmarks of images with captions for them. Thus, we use evaluations on medium-size Web-derived datasets to demonstrate the potential value of our approach in application scenarios where in-domain data is not as abundantly available as for general natural images.

2 Related Works

The use of language as an effective supervisory signal for learning visual representations has a rich history in machine learning [40, 15, 24, 30, 41]. Early influential works such as DeViSE [15] first learned semantic relations using unannotated textual data before map** the images into that semantic space using class labels. More recently, models like CLIP [41], ALIGN [22], and others [35, 32, 20, 21] further improved the capabilities of joint vision-language embedding models by training on massive image-text paired datasets contrastively using the InfoNCE loss [37]. Our work aligns with these prior arts but focuses on incorporating semantic compositions during pretraining to improve data efficiency and enhance performance.

Both CLIP [41] and ALIGN [22] use huge datasets —400 million and 1B image-text pairs for CLIP and ALIGN, respectively. DeCLIP [32] improves the data efficiency of CLIP by incorporating several training objectives, including self-supervision within each modality [5, 11], nearest-neighbor supervision, and multi-view supervision [2]. SLIP [35], on the other hand, adds image self-supervision, SimCLR [4], to the language supervision. In [50], Wu et al. show good zero-shot results in the low data regime through soft image-text matches via optimal transport distillation. These methods, however, require multiple passes through the image encoder [35, 32] for each update or a first-in-first-out feature queue [32] to generate the representations for the extra objectives. Our method is free of these additional complexities.

Most similar to our work are data augmentation methods such as CutMix [56] and MixUP [58] which have been very effective in training categorization models in computer vision. Our work brings the benefits of these established augmentation techniques in image understanding to the vision-language joint-embedding space. In addition to the incorporation of language, our method differs from CutMix by concatenating the image crops instead of pasting one crop on the other. Additionally, we train our models using contrastive loss. As our method is a pre-training mechanism, we do not discuss works [23, 19, 57, 31, 55] that use open-source CLIP checkpoints transfer learning scenarios.

Refer to caption
Figure 1: CLIP-𝒞𝒞\mathcal{C}caligraphic_C: We use the center half crops spanning the width (as in this illustration) or the height of the image. The captions are concatenated with the delimiter “and”. We vary the positions of the captions on either side of the conjunction, i.e., the output caption can be either (a) {caption1 and caption2}caption1 and caption2\{\text{caption1 }\text{and }\text{caption2}\}{ caption1 and caption2 } or (b) {caption2 and caption1}caption2 and caption1\{\text{caption2 }\text{and }\text{caption1}\}{ caption2 and caption1 }. We emphasize that only a fraction of the batch in each iteration constitute composite samples. The colored boxes and texts shown here are for illustrative purposes.

3 Method

This section covers a background of the baseline method as well as the core components of CLIP-𝒞𝒞\mathcal{C}caligraphic_C’s framework.

3.1 Background

Contrastive Language-Image Pre-training (CLIP) from Radford et al. [41] has emerged as a highly successful approach for training vision-language models. CLIP is a dual encoder model with separate encoders fIsubscript𝑓𝐼f_{I}italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and fTsubscript𝑓𝑇f_{T}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT for extracting visual and textual features respectively. It also has two dedicated projection functions gIsubscript𝑔𝐼g_{I}italic_g start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and gTsubscript𝑔𝑇g_{T}italic_g start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT that map the outputs of the encoders to a shared embedding space. Given a batch of B𝐵Bitalic_B images and text pairs {xI(i),xT(i)}i=1Bsuperscriptsubscriptsuperscriptsubscript𝑥𝐼𝑖superscriptsubscript𝑥𝑇𝑖𝑖1𝐵\left\{x_{I}^{(i)},x_{T}^{(i)}\right\}_{i=1}^{B}{ italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT in each training step, CLIP computes the embeddings zI(i)=gI(fI(xI(i)))superscriptsubscript𝑧𝐼𝑖subscript𝑔𝐼subscript𝑓𝐼superscriptsubscript𝑥𝐼𝑖z_{I}^{(i)}=g_{I}\left(f_{I}\left(x_{I}^{(i)}\right)\right)italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) and zT(i)=gT(fT(xT(i)))superscriptsubscript𝑧𝑇𝑖subscript𝑔𝑇subscript𝑓𝑇superscriptsubscript𝑥𝑇𝑖z_{T}^{(i)}=g_{T}\left(f_{T}\left(x_{T}^{(i)}\right)\right)italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) where zI(i)dsuperscriptsubscript𝑧𝐼𝑖superscript𝑑z_{I}^{(i)}\in\mathbb{R}^{d}italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT represents the normalized features of image xI(i)superscriptsubscript𝑥𝐼𝑖x_{I}^{(i)}italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. zT(i)dsuperscriptsubscript𝑧𝑇𝑖superscript𝑑z_{T}^{(i)}\in\mathbb{R}^{d}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denotes the normalized features of the corresponding caption xT(i)superscriptsubscript𝑥𝑇𝑖x_{T}^{(i)}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. The loss is evaluated using InfoNCE [37] whereby matching image-text pairs {xI(i),xT(i)}superscriptsubscript𝑥𝐼𝑖superscriptsubscript𝑥𝑇𝑖\{x_{I}^{(i)},x_{T}^{(i)}\}{ italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } constitute the positive samples and non-matching pairs {xI(i),xT(j)}jisuperscriptsubscript𝑥𝐼𝑖superscriptsubscript𝑥𝑇𝑗for-all𝑗𝑖\{x_{I}^{(i)},x_{T}^{(j)}\}\quad\forall j\neq i{ italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT } ∀ italic_j ≠ italic_i form the negative examples. A bidirectional loss is computed as

I2Tsubscriptsubscript𝐼2𝑇\displaystyle\mathcal{L}_{I_{2}T}caligraphic_L start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT =1Bi=1Blogexp(1τsim(zI(i),zT(i)))j=1Bexp(1τsim(zI(i),zT(j)))absent1𝐵superscriptsubscript𝑖1𝐵1𝜏simsuperscriptsubscript𝑧𝐼𝑖superscriptsubscript𝑧𝑇𝑖superscriptsubscript𝑗1𝐵1𝜏simsuperscriptsubscript𝑧𝐼𝑖superscriptsubscript𝑧𝑇𝑗\displaystyle=-\frac{1}{B}\sum_{i=1}^{B}\log\frac{\exp\left(\frac{1}{\tau}% \text{sim}\left(z_{I}^{(i)},z_{T}^{(i)}\right)\right)}{\sum_{j=1}^{B}\exp\left% (\frac{1}{\tau}\text{sim}\left(z_{I}^{(i)},z_{T}^{(j)}\right)\right)}= - divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT roman_log divide start_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_τ end_ARG sim ( italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_τ end_ARG sim ( italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) ) end_ARG (1)
T2Isubscriptsubscript𝑇2𝐼\displaystyle\mathcal{L}_{T_{2}I}caligraphic_L start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT =1Bi=1Blogexp(1τsim(zI(i),zT(i)))k=1Bexp(1τsim(zI(k),zT(i)))absent1𝐵superscriptsubscript𝑖1𝐵1𝜏simsuperscriptsubscript𝑧𝐼𝑖superscriptsubscript𝑧𝑇𝑖superscriptsubscript𝑘1𝐵1𝜏simsuperscriptsubscript𝑧𝐼𝑘superscriptsubscript𝑧𝑇𝑖\displaystyle=-\frac{1}{B}\sum_{i=1}^{B}\log\frac{\exp\left(\frac{1}{\tau}% \text{sim}\left(z_{I}^{(i)},z_{T}^{(i)}\right)\right)}{\sum_{k=1}^{B}\exp\left% (\frac{1}{\tau}\text{sim}\left(z_{I}^{(k)},z_{T}^{(i)}\right)\right)}= - divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT roman_log divide start_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_τ end_ARG sim ( italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_τ end_ARG sim ( italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) end_ARG (2)

where temperature τ𝜏\tauitalic_τ is typically a learnable parameter used to scale the logits. τ𝜏\tauitalic_τ is fixed in all of our ablation experiments as it has a noticeable impact on the model [29] which makes comparisons across different experiments difficult. sim(,)sim\text{sim}(\cdot,\cdot)sim ( ⋅ , ⋅ ) is a similarity function measuring the distance between the features. In CLIP [41] and our experiments, sim(,)sim\text{sim}(\cdot,\cdot)sim ( ⋅ , ⋅ ) is set as the dot product function. The total loss is an average of the two losses in Eq. 1 and Eq. 2:

=(I2T+T2I)/2.subscriptsubscript𝐼2𝑇subscriptsubscript𝑇2𝐼2\displaystyle\mathcal{L}=(\mathcal{L}_{I_{2}T}+\mathcal{L}_{T_{2}I})/2\ .caligraphic_L = ( caligraphic_L start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) / 2 . (3)

3.2 CLIP-𝒞𝒞\mathcal{C}caligraphic_C

In each training step, CLIP-𝒞𝒞\mathcal{C}caligraphic_C samples a batch of examples of size B𝐵Bitalic_B, {x^I(i),x^T(i)}i=1Bsuperscriptsubscriptsuperscriptsubscript^𝑥𝐼𝑖superscriptsubscript^𝑥𝑇𝑖𝑖1𝐵\left\{\hat{x}_{I}^{(i)},\hat{x}_{T}^{(i)}\right\}_{i=1}^{B}{ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT. Any given paired instance (x^I(i),x^T(i))superscriptsubscript^𝑥𝐼𝑖superscriptsubscript^𝑥𝑇𝑖\left(\hat{x}_{I}^{(i)},\hat{x}_{T}^{(i)}\right)( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) is either the original example (xI(i),xT(i))superscriptsubscript𝑥𝐼𝑖superscriptsubscript𝑥𝑇𝑖\left({x}_{I}^{(i)},{x}_{T}^{(i)}\right)( italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) or a composition of that example and another example (xI(i),xT(i)),iisuperscriptsubscript𝑥𝐼superscript𝑖superscriptsubscript𝑥𝑇superscript𝑖𝑖superscript𝑖\left({x}_{I}^{(i^{\prime})},{x}_{T}^{(i^{\prime})}\right),i\neq i^{\prime}( italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ) , italic_i ≠ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, drawn from the dataset. Note that index isuperscript𝑖i^{\prime}italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is taken with respect to the dataset size and not the batch size B𝐵Bitalic_B, i.e., sample isuperscript𝑖i^{\prime}italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT may not be present in the current mini-batch. The proportion of composed samples in any mini-batch is controlled by a sampling rate hyper-parameter ρ𝜌\rhoitalic_ρ. The impact of this parameter is discussed in Sec. 6.2.

In the case whereby (x^I(i),x^T(i))superscriptsubscript^𝑥𝐼𝑖superscriptsubscript^𝑥𝑇𝑖\left(\hat{x}_{I}^{(i)},\hat{x}_{T}^{(i)}\right)( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) is a composite sample, the new caption x^T(i)superscriptsubscript^𝑥𝑇𝑖\hat{x}_{T}^{(i)}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is a concatenation of the two original captions involved: x^T(i)=[xT(i),xT(i)]superscriptsubscript^𝑥𝑇𝑖superscriptsubscript𝑥𝑇𝑖superscriptsubscript𝑥𝑇superscript𝑖\hat{x}_{T}^{(i)}=[x_{T}^{(i)},x_{T}^{(i^{\prime})}]over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ] where [,][\cdot,\cdot][ ⋅ , ⋅ ] is a string concatenation function with the word “and” as a conjunction. The positions of the captions on either side of this conjunction change, with xT(i)superscriptsubscript𝑥𝑇𝑖x_{T}^{(i)}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT appearing first fifty percent of the time.

The new image is composed of the center half crops spanning either the height or the width of each image. For example, if the images have resolution (S×S)𝑆𝑆(S\times S)( italic_S × italic_S ), either (S2×S)𝑆2𝑆(\frac{S}{2}\times S)( divide start_ARG italic_S end_ARG start_ARG 2 end_ARG × italic_S ) or (S×S2)𝑆𝑆2(S\times\frac{S}{2})( italic_S × divide start_ARG italic_S end_ARG start_ARG 2 end_ARG ) center crops are taken from both images and concatenated as illustrated in Fig. 1. We experiment with other forms of image augmentation methods such as MixUP[58] and CutMix[56] in Tab. 8.

After assembling the mini-batch as described above, CLIP-𝒞𝒞\mathcal{C}caligraphic_C proceeds to extract the image and text features as in CLIP: z^I(i)=gI(fI(x^I(i)))superscriptsubscript^𝑧𝐼𝑖subscript𝑔𝐼subscript𝑓𝐼superscriptsubscript^𝑥𝐼𝑖\hat{z}_{I}^{(i)}=g_{I}\left(f_{I}\left(\hat{x}_{I}^{(i)}\right)\right)over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) and z^T(i)=gT(fT(x^T(i)))superscriptsubscript^𝑧𝑇𝑖subscript𝑔𝑇subscript𝑓𝑇superscriptsubscript^𝑥𝑇𝑖\hat{z}_{T}^{(i)}=g_{T}\left(f_{T}\left(\hat{x}_{T}^{(i)}\right)\right)over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ). With z^I(i)superscriptsubscript^𝑧𝐼𝑖\hat{z}_{I}^{(i)}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and z^T(i)superscriptsubscript^𝑧𝑇𝑖\hat{z}_{T}^{(i)}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT computed, Eq. 1, Eq. 2, and Eq. 3 are used to compute the InfoNCE loss.

The sampling strategy CLIP-𝒞𝒞\mathcal{C}caligraphic_C employs exposes the model to a much higher diversity of images and their corresponding captions compared to the vanilla pretraining pipeline. As a result, we observe much more significant improvements in downstream transfer when the pretraining dataset is small. It is reasonably expected that relatively larger datasets such as RedCaps [10] are already sufficiently diverse and, therefore, may not benefit from our method. Nonetheless, CLIP-𝒞𝒞\mathcal{C}caligraphic_C still does better than CLIP on these large datasets.

4 Experimental Setup

All our experiments use the CLIP framework due to its demonstrated effectiveness, simplicity, and widespread usage. We emphasize that we do not use pretrained CLIP checkpoints from prior works as our method is a pretraining mechanism. Thus, we retrain CLIP on our pretraining datasets and compare it to our approach. Finally, due to resource constraints, we conduct our experiments in the low data and small model regimes. Consequently, we are unable to compare with prior large-scale training systems.

Pretraining Datasets. We use three widely adopted web-crawled datasets of varying sizes and distributions for our experiments: Conceptual Captions [44], Conceptual 12M [3], and RedCaps [10]. These three datasets together enable us to assess the effectiveness of our method across pretraining datasets of different sizes and qualities.

Models. We use Vision Transformer [12] models of various sizes as in [35]. The vision encoder is set to ViT-S/16 [45] in all our ablation experiments unless explicitly specified otherwise. We use ViT-B/16 [12, 45] as the image encoder to demonstrate the efficacy of our method at scale as we are unable to run much bigger models such as ViT-L/16 because of resource constraints. The text encoder in all our experiments is set to the 38M parameter text Transformer model from [41]. Following previous methods, Byte-Pair encoding is used for tokenization with a context length of 77 and a vocabulary size of 49k. Finally, we fixed the temperature parameter at 0.010.010.010.01, the maximum value used in CLIP [41].

Hyper-parameters. We train all our models using PyTorch [39] with a global batch size of 2,04820482,0482 , 048 split across 8 GPUS in a single machine. AdamW [34] is the optimizer during pretraining. All models are pretrained for 40 epochs using a cosine decay learning rate schedule with a base rate of 0.0030.0030.0030.003, a warm-up period of 5555 epochs, and a final learning rate of 1e51superscript𝑒51e^{-5}1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. The weight decay parameter is always set to 0.10.10.10.1. Random crop** is the only augmentation applied to the images during pretraining. We refer the reader to the Supplemental for more detailed information about these and other hyper-parameters.

Evaluation. We perform zero-shot evaluation on several classification benchmarks using class names and prompts provided by [41, 35]. First, the embeddings for all classes in a given benchmark are computed with each class embedding being an ensemble of multiple prompt templates. The highest cosine similarity between the image embedding and the class embeddings is then used as the zero-shot prediction.

We test our model on eleven downstream datasets including ImageNet [9], CIFAR-10 [25], CIFAR-100 [25], Caltech-101 [14], Oxford Pets [38], Country211 [41], DTD [7], Sun397 [51], STL-10 [8], RESISC-45 [6], and EuroSAT [18]. Following previous works [35, 13], we use “mean per class accuracy” as the metric for Oxford Pets and Caltech-101. Accuracy is the metric for all other datasets. In addition to the zero-shot analysis, we also conduct zero-shot retrieval in Sec. 5.3, and linear probing evaluations in Sec. 5.4. For the linear probing experiments, we use the standard “train” and “test” splits for training and evaluation whenever possible. In instances where the standard splits are not present (RESISC-45 [6], and EuroSAT [18]), we randomly split the dataset into an 80%-20% ratio for training and testing respectively.

5 Results

This section outlines our key comparisons between CLIP and CLIP-𝒞𝒞\mathcal{C}caligraphic_C (our method) on zero-shot image classification, cross-modal retrieval, and linear probing. However, we explain first why our method works.

Refer to caption
Figure 2: Counter-intuitively, the model learns to match the composite examples faster compared to the plain instances.
Refer to caption
Figure 3: CLIP-𝒞𝒞\mathcal{C}caligraphic_C generally produces higher cosine similarity for matching pairs than CLIP.

5.1 Why is CLIP-𝒞𝒞\mathcal{C}caligraphic_C an Effective Method?

Why will combining multiple different image-caption pairs into single instances during pretraining lead to improvements in downstream evaluations? In other words, why will CLIP-𝒞𝒞\mathcal{C}caligraphic_C work? To investigate this salient question, we examined the pretraining losses and cosine similarities of both the composite examples and plain examples as the model evolves. This fine-grained tracking of training mechanics provides insights into how the model handles plain simple examples versus composite examples, and whether there are any differences between the two groups.

Contrary to expectation that compound examples will be the more challenging to the model (since they are multiple examples condensed into single instances), we observed precisely the opposite: as shown in Fig. 3, the loss on the composite examples is lower than the loss on plain examples especially in the early stages. Our hypothesis for this empirical observation is that the model more easily recognizes compound image-caption pairs because they tend to be structurally different from plain examples. The more interesting development arising from this phenomenon, however, is that the model is encouraged to dedicate more effort into learning the plain examples in CLIP-𝒞𝒞\mathcal{C}caligraphic_C compared to CLIP as seen in Fig. 3. We believe this elevated learning of plain examples together with the use of dynamic semantic compositions (See Sec. 6) all contribute to the superior capabilities of our method as discussed in the next sections.

Table 1: Zero-shot Image Classification: CLIP-𝒞𝒞\mathcal{C}caligraphic_C is our method. CLIP is a the model from [41] trained in our setting. CC3M CLIP-𝒞𝒞\mathcal{C}caligraphic_C models use ρ=0.3𝜌0.3\rho=0.3italic_ρ = 0.3 while CC12M and RedCaps models use ρ=0.15𝜌0.15\rho=0.15italic_ρ = 0.15. Bold numbers are the best in each dataset and architecture comparison.

PT Dataset

Method

Food-101

CIFAR-10

CIFAR-100

Caltech-101

Pets

DTD

Country211

Sun397

STL-10

RESISC45

EuroSAT

ImageNet

Vision Encoder: ViT-S/16
CC3M CLIP 11.6 56.1 22.7 46.9 12.9 10.5 0.6 20.5 77.0 24.5 23.7 18.5
CLIP-𝒞𝒞\mathcal{C}caligraphic_C 15.1 66.4 26.9 51.9 14.5 14.8 0.7 27.2 84.6 25.4 30.7 20.5
Vision Encoder: ViT-B/16
CC3M CLIP 13.8 54.8 20.4 49.8 14.9 12.2 0.7 21.9 76.0 22.7 19.6 19.6
CLIP-𝒞𝒞\mathcal{C}caligraphic_C 15.7 58.0 28.5 50.1 11.4 14.2 0.7 27.8 86.8 26.1 21.3 21.2
CC12M CLIP 46.9 78.0 43.0 76.2 57.2 19.3 4.8 41.2 89.7 33.8 27.8 37.9
CLIP-𝒞𝒞\mathcal{C}caligraphic_C 48.1 76.8 44.8 73.5 60.8 21.9 5.0 41.1 90.3 36.2 36.1 38.5
RedCaps CLIP 78.8 72.8 38.7 72.1 76.0 16.2 6.1 27.5 92.9 36.5 30.9 40.7
CLIP-𝒞𝒞\mathcal{C}caligraphic_C 79.0 73.7 42.2 72.1 77.1 18.1 6.6 29.4 94.2 41.1 34.8 41.6

5.2 Zero-shot Image Classification

We conduct a thorough study of the transfer learning capabilities of our model in zero-shot image classification on many downstream benchmarks, including ImageNet [9] in Tab. 1. Across different pretraining datasets, our method substantially improves over CLIP. For ViT-S/16, CLIP-𝒞𝒞\mathcal{C}caligraphic_C achieves a 2%percent22\%2 % top-1 improvement over the baseline CLIP model on ImageNet while outperforming CLIP on 12121212 out of 12121212 downstream datasets when pretraining on CC3M. Furthermore, these enhancements are maintained when we scale the vision encoder from ViT-S/16 to ViT-B/16 showing the continued effectiveness of our method over CLIP in a bigger model. When pretraining on RedCaps and CC12M, the gains of CLIP-𝒞𝒞\mathcal{C}caligraphic_C over CLIP on ImageNet are respectively are 0.9%percent0.90.9\%0.9 % and 0.4%percent0.40.4\%0.4 %. These results are remarkable, considering that our approach and CLIP both use the same number of parameters, memory, and computational resources during pretraining. Even in the relatively data-rich settings of CC12M and RedCaps, CLIP-𝒞𝒞\mathcal{C}caligraphic_C still improves over CLIP on 11111111 out of 12121212 benchmarks for RedCaps and 9999 of the 12121212 benchmarks for CC12M.

Table 2: Zero-shot Cross-modal Retrieval. ρ𝜌\rhoitalic_ρ is set to 0.30.30.30.3 for CC3M and 0.150.150.150.15 for CC12M abd RedCaps. Similarly to zero-shot classification, our semantic composition model is nontrivially better than CLIP on zero-shot retrieval.
Flickr30k MS-COCO
PT Dataset Method Image \rightarrow Text Text \rightarrow Image Image \rightarrow Text Text \rightarrow Image
R@1 R@5 R@1 R@5 R@1 R@5 R@1 R@5
Vision Encoder: ViT-S/16
CC3M CLIP 35.2 62.3 25.4 49.12 17.3 39.0 13.1 31.2
CLIP-𝒞𝒞\mathcal{C}caligraphic_C 40.7 70.9 30.6 57.9 21.4 45.6 16.2 36.5
Vision Encoder: ViT-B/16
CC3M CLIP 36.1 65.1 26.3 52.4 18.6 41.1 13.9 32.8
CLIP-𝒞𝒞\mathcal{C}caligraphic_C 39.6 69.4 31.2 58.3 22.9 46.7 17.0 37.9
CC12M CLIP 61.5 87.2 46.1 74.9 36.2 64.2 25.3 49.7
CLIP-𝒞𝒞\mathcal{C}caligraphic_C 66.0 87.8 49.5 75.6 38.4 65.6 26.4 51.5
RedCaps CLIP 26.8 51.9 20.5 42.5 24.3 44.8 16.7 35.7
CLIP-𝒞𝒞\mathcal{C}caligraphic_C 32.3 57.2 23.6 44.9 27.1 49.2 18.2 38.4
Table 3: Linear Probing: CLIP-𝒞𝒞\mathcal{C}caligraphic_C (ours) is very competitive with CLIP in linear probe experiments on CC12M and RedCaps. Our method outperforms CLIP on a majority of benchmarks when using CC3M. Additionally, on our largest downstream dataset, ImageNet, CLIP-𝒞𝒞\mathcal{C}caligraphic_C beats CLIP in all settings except when pretraining on CC12M. All CLIP-𝒞𝒞\mathcal{C}caligraphic_C models here are trained using a sampling probability ρ=0.3𝜌0.3\rho=0.3italic_ρ = 0.3.

PT Dataset

Method

Food-101

CIFAR-10

CIFAR-100

Caltech-101

Pets

DTD

Country211

Sun397

STL-10

RESISC45

EuroSAT

ImageNet

Vision Encoder: ViT-S/16
CC3M CLIP 63.7 84.9 65.7 79.6 69.6 60.9 12.3 63.5 91.6 88.5 95.8 55.0
CLIP-𝒞𝒞\mathcal{C}caligraphic_C 64.7 85.3 66.6 81.2 69.3 61.8 12.7 64.6 92.9 88.1 95.5 56.8
Vision Encoder: ViT-B/16
CC3M CLIP 66.6 85.7 67.1 79.0 71.9 59.1 12.6 63.9 91.8 89.4 96.3 58.4
CLIP-𝒞𝒞\mathcal{C}caligraphic_C 67.6 86.2 67.1 81.0 72.4 62.2 13.7 66.1 93.2 90.0 95.9 59.5
CC12M CLIP 79.4 91.7 74.9 88.8 83.4 67.4 16.6 72.3 95.2 91.6 97.0 68.6
CLIP-𝒞𝒞\mathcal{C}caligraphic_C 79.7 91.2 75.0 88.7 83.3 68.6 17.0 72.8 95.7 91.7 96.7 68.3
RedCaps CLIP 88.7 91.4 73.5 88.3 90.2 69.1 15.7 68.6 96.8 91.7 95.4 70.4
CLIP-𝒞𝒞\mathcal{C}caligraphic_C 88.9 90.8 74.1 88.6 89.9 69.7 16.0 69.7 96.9 92.3 97.0 70.7

5.3 Zero-shot Cross-Modal Retrieval

In addition to the zero-shot transfer results as detailed in Sec. 5.2, we also provide analysis of the performance of CLIP-𝒞𝒞\mathcal{C}caligraphic_C versus CLIP on zero-shot cross-modal retrieval in Tab. 2. For these evaluations, we use MS-COCO [33] and Flickr30k [53] as the downstream benchmarks. As in the zero-shot transfer setting, CLIP-𝒞𝒞\mathcal{C}caligraphic_C yields significant improvements over the baseline model on both MS-COCO and Flickr30k across different pretraining datasets and model sizes. For example, when using CC3M as the pretraining dataset, our method outperforms CLIP by over 5%percent55\%5 % absolute top-1 retrieval accuracy in both image-to-text and text-to-image retrieval. The enhancement on MS-COCO is 4%percent44\%4 % on image-to-text and 3%percent33\%3 % on text-to-image retrievals. For both CLIP and our method, we noticed low retrieval results when pretraining on RedCaps, which we believe is related to the data distribution. We leave that analysis out for later works.

5.4 Linear Probe Evaluations

Having verified the efficacy of our method using joint-embedding features in the zero-shot settings, we conduct several linear-probing evaluations in Tab. 3 to test the quality of the learned image features. In these experiments, a randomly initialized linear layer is added on top of the pretrained image encoder fIsubscript𝑓𝐼f_{I}italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT which is frozen. The text encoder fTsubscript𝑓𝑇f_{T}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, along with linear projections gIsubscript𝑔𝐼g_{I}italic_g start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and gTsubscript𝑔𝑇g_{T}italic_g start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are discarded. We train the linear layer for 50 epochs using a stochastic gradient descent optimizer with a weight decay of 00 and a momentum of 0.90.90.90.9. The linear probe learning rate and mini-batch size for each downstream dataset are provided in Tab. 9 in the Supplemental.

Using our proposed compositions, CLIP-𝒞𝒞\mathcal{C}caligraphic_C surpasses CLIP in several linear probe experiments when the pretraining dataset is relatively small, e.g., CC3M, indicating that our method also learns more discriminative image features than CLIP in that regime. These superior generalization capabilities come as a result of exposing the image encoder to a more diverse array of images through the compositions. We stress again that these gains are obtained using the same pretraining datasets (without any external augmentations), computational costs, and number of parameters as the CLIP baseline. The linear probing results are competitive on all downstream benchmarks when using larger pretraining datasets.

6 Ablations

We ablate the various components of our framework including (1) providing more training resources to the CLIP model, (2) the sampling probability ρ𝜌\rhoitalic_ρ, (3) semantic versus stylistic compositions, (4) the impact of stochasticity in drawing the second example, and (5) the composition function used for the images. We cover further ablations in Sec. C of the Supplemental detailing different mechanisms of pairing the examples, and other ways of merging captions including large language model rewrites.

These ablation experiments underscore the importance of using semantically diverse examples in compositions (Sec. 6.3). They also reveal that while incorporating a proportion of CLIP-𝒞𝒞\mathcal{C}caligraphic_C examples in the mini-batch contributes positively to performance, exclusively using such compositions during training detracts from downstream transfer capabilities (Sec. 6.2). Finally, the results demonstrate the necessity of generating compound examples dynamically during training rather than relying on a static set of pre-generated instances (Sec. 6.4). Collectively, these insights affirm the effectiveness of the design principles underpinning our method.

Due to computational constraints, all ablation experiments are conducted using CC3M with ViT-S/16 as the image encoder. Additionally, we present only the zero-shot results of CIFAR-10, CIFAR-100, and ImageNet for the ablation experiments. Also, we are unable to conduct multiple runs of each experiment because of the number and scale of our ablations. As a result, we execute most of our experiments once using a shared fixed random seed.

Table 4: CLIP-𝒞𝒞\mathcal{C}caligraphic_C beats a CLIP despite using half the batch size of the CLIP model.
Method Batch-Size CIFAR-10 CIFAR-100 ImageNet
CLIP 2,04820482,0482 , 048 56.1 22.7 18.5
CLIP-𝒞𝒞\mathcal{C}caligraphic_C (Ours) 1,02410241,0241 , 024 67.7 31.1 20.1
Table 5: Both CLIP and CLIP-𝒞𝒞\mathcal{C}caligraphic_C are consistent across the three initializations.
Method CIFAR-10 CIFAR-100 ImageNet
CLIP 57.8±2.36plus-or-minus57.82.3657.8\pm 2.3657.8 ± 2.36 24.6±1.62plus-or-minus24.61.6224.6\pm 1.6224.6 ± 1.62 18.6±0.24plus-or-minus18.60.2418.6\pm 0.2418.6 ± 0.24
CLIP-𝒞𝒞\mathcal{C}caligraphic_C 64.7±2.04plus-or-minus64.72.0464.7\pm 2.0464.7 ± 2.04 27.6±0.53plus-or-minus27.60.5327.6\pm 0.5327.6 ± 0.53 20.4±0.35plus-or-minus20.40.3520.4\pm 0.3520.4 ± 0.35

To ablate the impact of this choice, we train three models each for CLIP and CLIP-𝒞𝒞\mathcal{C}caligraphic_C on CC3M with three different random seeds. The results in Tab. 5 indicate that zero-shot performances are consistent across different random initializations.

6.1 Is CLIP-𝒞𝒞\mathcal{C}caligraphic_C a CLIP Model Exposed to More Data?

It could be argued that our method sees a lot more examples due to the compositions we employ, and that may be the reason for the observed improved performances. In Tab. 4, we show that a CLIP-𝒞𝒞\mathcal{C}caligraphic_C model that uses a batch size of 1,02410241,0241 , 024 examples outperforms the equivalent CLIP model trained with a batch-size of 2,04820482,0482 , 048 by 1.6% on ImageNet, strongly indicating that our method is different from —and more impactful than— a technique to increase the CLIP batch size. Similarly, we show in Fig. 5 that training CLIP for (1+ρ)1𝜌(1+\rho)( 1 + italic_ρ ) times the number of epochs for the CLIP-𝒞𝒞\mathcal{C}caligraphic_C model does not close the performance gap (compare CLIP-52 and CLIP-𝒞𝒞\mathcal{C}caligraphic_C-40 epochs in Fig. 5). Indeed, the results in Fig. 5 highlight the strong regularization effect of CLIP-𝒞𝒞\mathcal{C}caligraphic_C as its superiority over CLIP emerges towards the later stages of pretraining. CLIP-𝒞𝒞\mathcal{C}caligraphic_C becomes even more superior to CLIP as training duration increases, extending the improvement from +2%percent2+2\%+ 2 % zero-short accuracy on ImageNet when both models are trained for 40 epochs to over +3%percent3+3\%+ 3 % when both are trained for 52 epochs. All these empirical results point to concrete beneficial qualities of CLIP-𝒞𝒞\mathcal{C}caligraphic_C as discussed in Sec. 5.1 and not from any implicit amplified exposure to data.

Refer to caption
Figure 4: CLIP-𝒞𝒞\mathcal{C}caligraphic_C v.s. CLIP. Pretraining CLIP longer than CLIP-𝒞𝒞\mathcal{C}caligraphic_C does not close the performance gap. CLIP-𝒞𝒞\mathcal{C}caligraphic_C becomes even more superior as training duration increases.
Refer to caption
Figure 5: Sampling probability ρ𝜌\rhoitalic_ρ. Our method is very effective when between 10% and 50% of the mini-batch are CLIP-𝒞𝒞\mathcal{C}caligraphic_C compositions but performs poorly when the entire batch is composite instances.

6.2 Sampling Probability ρ𝜌\rhoitalic_ρ

The probability at which we create a composite sample as opposed to the original image-caption pair is an important parameter in our method which determines the percentage of the mini-batch that are compound instances. When ρ=0𝜌0\rho=0italic_ρ = 0, our method is identical to CLIP as no composition is performed. On the other extreme, when ρ=1𝜌1\rho=1italic_ρ = 1, all the examples in each mini-batch are instances of our composition method. As shown in Fig. 5, using a small non-zero sampling rate is more effective than CLIP. However, the performance deteriorates when more than fifty percent of the mini-batch are these compound image-text pairs. These results indicate that maintaining a reasonable percentage of the original examples is necessary likely because streamlined non-contradictory learning signal is significantly reduced when a majority of the batch are compositions. Also, since downstream evaluations do not involve such compositions, some exposure to examples with uniform semantic content during pretraining is important for effective transfer.

6.3 Why Semantic Compositions?

We call CLIP-𝒞𝒞\mathcal{C}caligraphic_C compositions semantic because the new instances are not just stylistically different from the constituent original examples, they are also semantically different. Thus, it is fair to question whether or not this semantic differentiation is important in producing the observed favorable results over CLIP. After all, purely stylistic augmentations that use content from the same examples also increase data diversity and could yield the same outcomes as our semantic compositions. We investigate this prospect in this section. We train a model using two augmentations of the same example instead of two distinct examples as outlined in Sec. 3.2. On the image side, two random crops of the image are taken simulating two instances. For the text, we employ “Easy Data Augmentation (EDA)” [48] to generate a caption for the second crop while the first crop uses the original caption. These two stylistically generated examples are then combined using CLIP-𝒞𝒞\mathcal{C}caligraphic_C.

Table 6: Semantic Compositions: Using semantically distinct examples is better than stylistic augmentations.
Method CIFAR-10 CIFAR-100 ImageNet
CLIP 56.1 22.7 18.5
Stylistic 53.7 25.0 19.0
Semantic (Ours) 66.4 26.9 20.5

In Tab. 6, it is evident that such stylistic augmentations are sub-optimal compared to the semantic generations we employ in CLIP-𝒞𝒞\mathcal{C}caligraphic_C. On ImageNet, the CLIP-𝒞𝒞\mathcal{C}caligraphic_C model achieves a 1.5%percent1.51.5\%1.5 % absolute top-1 accuracy than the stylistic augmentations model. This suggests that the content of the new instances is important as the model prefers the use of distinct examples in the composition. We also note that just increasing the diversity of examples is helpful as the stylistic augmentations method yields a 0.50.50.50.5% zero-shot accuracy gain over CLIP on ImageNet.

6.4 Impact of Stochasticity During Sampling

Whenever CLIP-𝒞𝒞\mathcal{C}caligraphic_C composition is activated, the second example is usually chosen randomly from the dataset. This allows for every image-caption pair to be paired with any other image-caption pair in the dataset. Moreover, the pairings differ from one epoch to another, thus uncovering novel combinations of examples throughout pretraining.

Table 7: Impact of Stochasticity: Dynamic assignments is more effective than fixed pairings.
Method CIFAR-10 CIFAR-100 ImageNet
CLIP 56.1 22.7 18.5
Fixed 60.4 28.0 19.7
Dynamic (Ours) 66.4 26.9 20.5

We examine the impact of this dynamic nature of CLIP-𝒞𝒞\mathcal{C}caligraphic_C versus using fixed pairs of examples. To do this, for every example x𝑥xitalic_x, we allocate only one other example xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that is fixed throughout training. Then, whenever x𝑥xitalic_x is involved in a CLIP-𝒞𝒞\mathcal{C}caligraphic_C composition, xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is used. The results in Tab. 7 suggest that dynamic compositions lead to better downstream results than fixed compositions. This makes intuitive sense because in the dynamic case, if a particular composition is unhelpful, there is a possibility of changing it in subsequent epochs. This possibility does not exist when the combinations are fixed. in Sec. C.2 of the Supplemental, we also investigate scenarios whereby we combine examples whose captions are either close or far apart in the feature space of a sentence embedding model [47].

6.5 Image Composition Function

In this section, we compare our image mixing method with established systems such as CutMix [56] and MixUP [58]. When activated, MixUP executes a weighted pixel-wise summation of the two images, x^I(i)=ωxI(i)+(1ω)xI(j)superscriptsubscript^𝑥𝐼𝑖𝜔superscriptsubscript𝑥𝐼𝑖1𝜔superscriptsubscript𝑥𝐼𝑗\hat{x}_{I}^{(i)}=\omega\cdot x_{I}^{(i)}+(1-\omega)\cdot x_{I}^{(j)}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_ω ⋅ italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + ( 1 - italic_ω ) ⋅ italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT with the weighting factor, ω𝜔\omegaitalic_ω sampled from the beta distribution ωβ(1,1)similar-to𝜔𝛽11\omega\sim\beta(1,1)italic_ω ∼ italic_β ( 1 , 1 ). CutMix on the other hand takes a random crop from one of the images and pastes it at the same spatial location on the other image. The crop’s dimensions are scaled by the value α=1ω𝛼1𝜔\alpha=\sqrt{1-\omega}italic_α = square-root start_ARG 1 - italic_ω end_ARG, ωβ(1,1)similar-to𝜔𝛽11\omega\sim\beta(1,1)italic_ω ∼ italic_β ( 1 , 1 ). That is, Hcut=αHsubscript𝐻cut𝛼𝐻H_{\text{cut}}=\alpha\cdot Hitalic_H start_POSTSUBSCRIPT cut end_POSTSUBSCRIPT = italic_α ⋅ italic_H, Wcut=αWsubscript𝑊cut𝛼𝑊W_{\text{cut}}=\alpha\cdot Witalic_W start_POSTSUBSCRIPT cut end_POSTSUBSCRIPT = italic_α ⋅ italic_W where H𝐻Hitalic_H and W𝑊Witalic_W are the height and width of the image respectively.

Table 8: Image Composition: Our strategy outperforms CutMix and MixUP.
Function CIFAR-10 CIFAR-100 ImageNet
MixUP [58] 50.8 22.3 20.2
CutMix [56] 54.2 26.9 20.4
CLIP-𝒞𝒞\mathcal{C}caligraphic_C (Ours) 66.4 26.9 20.5

Unlike MixUP, our method as depicted in Fig. 1 preserves the integrity of each crop, and does not paste parts of one image on the other as in CutMix. Additionally, using the center-half crop of each image guarantees that substantial portions of both images are represented in the output image. We believe these characteristics of our method are important as demonstrated by its superior zero-shot results over MixUP and CutMix in Tab. 8.

7 Conclusion

We have demonstrated in this study that fast and simple compositions of different image-caption pairs can significantly enhance the effectiveness of language-supervised visual representation learning models. This approach proves particularly beneficial when pretraining on smaller datasets. Our comprehensive analysis shows that CLIP-𝒞𝒞\mathcal{C}caligraphic_C, our proposed model, delivers substantial improvements in zero-shot learning tasks over the baseline CLIP model and performs robustly in linear evaluation settings. Our ablation studies provide crucial insights, emphasizing that the observed performance improvements stem not from a mere increase in data augmentation but from the strategic use of semantically distinct examples in compositions. We anticipate that these findings will encourage further exploration into novel and efficient uses of small-scale datasets for vision-language pretraining, especially in settings where it is difficult to curate massive amounts of paired data (e.g., medical and satellite images).

References

  • [1] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 1877–1901 (2020), https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
  • [2] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 9912–9924 (2020), https://proceedings.neurips.cc/paper_files/paper/2020/file/70feb62b69f16e0238f741fab228fec2-Paper.pdf
  • [3] Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021)
  • [4] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 1597–1607. PMLR (13–18 Jul 2020), https://proceedings.mlr.press/v119/chen20j.html
  • [5] Chen, X., He, K.: Exploring simple siamese representation learning. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15745–15753 (2021). https://doi.org/10.1109/CVPR46437.2021.01549
  • [6] Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (Oct 2017). https://doi.org/10.1109/jproc.2017.2675998, http://dx.doi.org/10.1109/JPROC.2017.2675998
  • [7] Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2014)
  • [8] Coates, A., Ng, A., Lee, H.: An Analysis of Single Layer Networks in Unsupervised Feature Learning. In: AISTATS (2011)
  • [9] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
  • [10] Desai, K., Kaul, G., Aysola, Z.T., Johnson, J.: Redcaps: Web-curated image-text data created by the people, for the people. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1) (2021), https://openreview.net/forum?id=VjJxBi1p9zh
  • [11] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics (Jun 2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
  • [12] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021), https://openreview.net/forum?id=YicbFdNTTy
  • [13] Fan, L., Krishnan, D., Isola, P., Katabi, D., Tian, Y.: Improving clip training with language rewrites. arXiv:2305.20088 (2023)
  • [14] Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. CVPR Workshop (2004)
  • [15] Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M.A., Mikolov, T.: Devise: A deep visual-semantic embedding model. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems. vol. 26 (2013), https://proceedings.neurips.cc/paper_files/paper/2013/file/7cce53cf90577442771720a370c3c723-Paper.pdf
  • [16] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9726–9735 (2020). https://doi.org/10.1109/CVPR42600.2020.00975
  • [17] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
  • [18] Helber, P., Bischke, B., Dengel, A., Borth, D.: Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification (2017)
  • [19] Hong, T., Guo, X., Ma, J.: Itmix: Image-text mix augmentation for transferring clip to image classification. In: 2022 16th IEEE International Conference on Signal Processing (ICSP). vol. 1, pp. 129–133 (2022). https://doi.org/10.1109/ICSP56322.2022.9965292
  • [20] Hu Xu, S.X., Tan, X.E., Huang, P.Y., Howes, R., Sharma, V., Li, S.W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying clip data (2023)
  • [21] Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: Openclip (Jul 2021). https://doi.org/10.5281/zenodo.5143773, https://doi.org/10.5281/zenodo.5143773
  • [22] Jia, C., Yang, Y., Xia, Y., Chen, Y., Parekh, Z., Pham, H., Le, Q.V., Sung, Y., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021. Proceedings of Machine Learning Research, vol. 139, pp. 4904–4916 (2021), http://proceedings.mlr.press/v139/jia21b.html
  • [23] Jiang, K., He, X., Xu, R., Wang, X.E.: Comclip: Training-free compositional image and text matching (2023)
  • [24] Joulin, A., van der Maaten, L., Jabri, A., Vasilache, N.: Learning visual features from large weakly supervised data. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. pp. 67–84. Springer International Publishing (2016)
  • [25] Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech. rep., University of Toronto (2009)
  • [26] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K. (eds.) Advances in Neural Information Processing Systems. vol. 25 (2012), https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
  • [27] Kuo, W., Piergiovanni, A., Kim, D., Luo, X., Caine, B., Li, W., Ogale, A., Zhou, L., Dai, A., Chen, Z., Cui, C., Angelova, A.: Mammut: A simple vision-encoder text-decoder architecture for multimodal tasks. Transactions on Machine Learning Research (2023), https://arxiv.longhoe.net/abs/2303.16839
  • [28] Lai, Z., Zhang, H., Wu, W., Bai, H., Timofeev, A., Du, X., Gan, Z., Shan, J., Chuah, C.N., Yang, Y., Cao, M.: From scarcity to efficiency: Improving clip training via visual-enriched captions (2023)
  • [29] Lazaridou, A., Kuncoro, A., Gribovskaya, E., Agrawal, D., Liska, A., Terzi, T., Gimenez, M., de Masson d'Autume, C., Kocisky, T., Ruder, S., Yogatama, D., Cao, K., Young, S., Blunsom, P.: Mind the gap: Assessing temporal generalization in neural language models. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems. vol. 34, pp. 29348–29363 (2021), https://proceedings.neurips.cc/paper_files/paper/2021/file/f5bf0ba0a17ef18f9607774722f5698c-Paper.pdf
  • [30] Li, A., Jabri, A., Joulin, A., van der Maaten, L.: Learning visual n-grams from web data. In: 2017 IEEE International Conference on Computer Vision (ICCV). pp. 4193–4202. IEEE Computer Society (2017). https://doi.org/10.1109/ICCV.2017.449, https://doi.ieeecomputersociety.org/10.1109/ICCV.2017.449
  • [31] Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrap** language-image pre-training with frozen image encoders and large language models. In: ICML (2023)
  • [32] Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., Yan, J.: Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=zq1iJkNk3uN
  • [33] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014. pp. 740–755 (2014)
  • [34] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICML (2019), https://openreview.net/forum?id=Bkg6RiCqY7
  • [35] Mu, N., Kirillov, A., Wagner, D., Xie, S.: Slip: Self-supervision meets language-image pre-training. arXiv:2112.12750 (2021)
  • [36] Naeem, M.F., Xian, Y., Zhai, X., Hoyer, L., Gool, L.V., Tombari, F.: Silc: Improving vision language pretraining with self-distillation (2023)
  • [37] van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding (2019)
  • [38] Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)
  • [39] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32, pp. 8024–8035 (2019), http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
  • [40] Quattoni, A., Collins, M., Darrell, T.: Learning visual representations using images with captions. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition. pp. 1–8 (2007). https://doi.org/10.1109/CVPR.2007.383173
  • [41] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021)
  • [42] Radford, A., Narasimhan, K.: Improving language understanding by generative pre-training (2018)
  • [43] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
  • [44] Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2556–2565. Association for Computational Linguistics (Jul 2018). https://doi.org/10.18653/v1/P18-1238, https://aclanthology.org/P18-1238
  • [45] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers &; distillation through attention. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 10347–10357. PMLR (18–24 Jul 2021), https://proceedings.mlr.press/v139/touvron21a.html
  • [46] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C.C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P.S., Lachaux, M.A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E.M., Subramanian, R., Tan, X.E., Tang, B., Taylor, R., Williams, A., Kuan, J.X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., Scialom, T.: Llama 2: Open foundation and fine-tuned chat models (2023)
  • [47] Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers (2020)
  • [48] Wei, J., Zou, K.: EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 6382–6388. Association for Computational Linguistics (Nov 2019). https://doi.org/10.18653/v1/D19-1670, https://aclanthology.org/D19-1670
  • [49] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., Rush, A.: Transformers: State-of-the-art natural language processing. In: Liu, Q., Schlangen, D. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 38–45. Association for Computational Linguistics (Oct 2020). https://doi.org/10.18653/v1/2020.emnlp-demos.6, https://aclanthology.org/2020.emnlp-demos.6
  • [50] Wu, B., Cheng, R., Zhang, P., Gao, T., Gonzalez, J.E., Vajda, P.: Data efficient language-supervised zero-shot recognition with optimal transport distillation. In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=G89-1yZLFHk
  • [51] Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp. 3485–3492 (June 2010). https://doi.org/10.1109/CVPR.2010.5539970
  • [52] Xu, H., Ghosh, G., Huang, P.Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., Feichtenhofer, C.: VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 6787–6800. Association for Computational Linguistics (Nov 2021). https://doi.org/10.18653/v1/2021.emnlp-main.544, https://aclanthology.org/2021.emnlp-main.544
  • [53] Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, 67–78 (2014). https://doi.org/10.1162/tacl_a_00166, https://aclanthology.org/Q14-1006
  • [54] Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research (2022), https://openreview.net/forum?id=Ee277P3AYC
  • [55] Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? In: The Eleventh International Conference on Learning Representations (2023), https://openreview.net/forum?id=KRLUvxh8uaX
  • [56] Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: Regularization strategy to train strong classifiers with localizable features. In: ICCV (2019)
  • [57] Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., Beyer, L.: Lit: Zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18123–18133 (June 2022)
  • [58] Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (2018), https://openreview.net/forum?id=r1Ddp1-Rb
  • [59] Zhao, Y., Misra, I., Krähenbühl, P., Girdhar, R.: Learning video representations from large language models. In: arXiv:2212.04501 (2022)
\thetitle

Supplementary Material

Table 9: Linear probe evaluation hyper-parameters: We use stochastic gradient descent optimizer with a decay of 0 and a momentum of 0.9 for all linear probe experiments. All linear probe experiments are tuned for 50 epochs.
Hyper-parameter

Food-101

CIFAR-10

CIFAR-100

Caltech-101

Pets

DTD

Country211

Sun397

STL-10

RESISC45

EuroSAT

ImageNet

Batch Size 16 16 64 64 16 16 64 16 16 16 16 1024
Learning Rate 0.1 0.1 0.1 0.05 0.1 0.1 0.05 0.05 0.1 0.1 0.1 0.1

A Implementation Details

We discuss other hyper-parameters in our setup in addition to those described in Sec. 4. During pretraining, AdamW [34] with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.98subscript𝛽20.98\beta_{2}=0.98italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.98 is used as the optimizer. We use a Stochastic Gradient Descent optimizer with a momentum of 0.9 for linear probing. The learning rate schedule is always set to Cosine Annealing with a warm-up period of 5555 epochs. In all pretraining settings, we train for 40 epochs and use the best checkpoint determined by the top-1 zero-shot accuracy on ImageNet [9] for downstream evaluations. In linear probing experiments, we use the final results after training for 50 epochs on the respective dataset.

In pretraining, the batch-size and learning rate are set to 2,04820482,0482 , 048 and 0.0030.0030.0030.003, respectively. For linear probe experiments, we use a batch-size of 1024 and a learning rate of 0.10.10.10.1 when evaluating on ImageNet. For all other downstream datasets, we select the batch-size and learning rate using the ViT-B/16 CC3M pretrained model as follows. First, we choose the best batch-size from the set {16,64,256}1664256\{16,64,256\}{ 16 , 64 , 256 } before selecting a learning rate from the set {0.05,0.1,0.2,0.3}0.050.10.20.3\{0.05,0.1,0.2,0.3\}{ 0.05 , 0.1 , 0.2 , 0.3 } using the chosen batch-size. After selecting a batch-size and learning rate using the CC3M ViT/B-16 model, we then apply those parameters to all other models without any further hyperparameter searches. Table 9 contains details about the combinations of batch-sizes and base learning rates we use for linear probing on all downstream datasets. Every linear experiment run is executed on a single NVIDIA RTX A6000 GPU.

The image resolution is always set to 224×224224224224\times 224224 × 224. Given an image of arbitrary size, we invoke the “RandomResizedCrop” from PyTorch [39] with bicubic interpolation to generate an appropriately sized crop for training. The “scale” argument of “RandomResizedCrop” is set to (0.6,1.0)0.61.0(0.6,1.0)( 0.6 , 1.0 ) during pretraining and (0.08,1.0)0.081.0(0.08,1.0)( 0.08 , 1.0 ) during linear probing. In linear evaluations, the output crop is further flipped horizontally with a probability of 0.50.50.50.5. Following previous works, we normalize all images using the ImageNet mean and standard deviation values irrespective of the dataset. We do not use crop** in the zero-shot classification and retrieval settings — we simply resize the image to 224×224224224224\times 224224 × 224 before normalization. In all our experiments, we used a fixed seed for the PyTorch and Numpy random number generators. We also enabled PyTorch’s CUDNN deterministic setting.

Table 10: Pretraining Datasets: Size is the number of image-text pairs in the dataset. Min, Avg, and Max are respectively the minimum, average, and maximum number of words in a caption obtained by splitting the string at spaces. CC12M tends to have the longest captions on average.
Dataset Size Min. Avg. Max.
CC3M [44] 2.85M 4 10.25 50
CC12M [3] 9.55M 1 17.76 242
RedCaps [10] 12.01M 1 9.50 70
Table 11: Downstream Datasets: Train and Test respectively denote the sizes of the training and evaluation sets. “Classes” is the number of categories while “Metric” is the evaluation metric. “Acc” represents top-1 accuracy over the entire dataset and “P/C Acc” represents the mean of per-category top-1 accuracy.
Dataset Train Test Classes Metric
Food-101 75,750 25,250 101 Acc
CIFAR-10 50,000 10,000 10 Acc
CIFAR-100 50,000 10,000 100 Acc
Caltech-101 3,060 6,084 102 P/C Acc
Pets 3,721 3,669 37 P/C Acc
DTD 1,880 1,880 47 Acc
Country211 31,650 21,100 211 Acc
Sun397 76, 127 21,750 397 Acc
STL-10 5,000 8,000 10 Acc
RESISC45 25,200 6,300 45 Acc
EuroSAT 21,600 5,400 10 Acc
ImageNet 1,281,167 50,000 1000 Acc

B Datasets

We cover basic characteristics of our pretraining datasets in Tab. 10 and the downstream benchmarks in Tab. 11. We downloaded the pretraining datasets images manually from the internet using the URLs provided in Sharma [44], Changpinyo et al. [3], and Desai et al. [6]. As a result, we could not retrieve the full original datasets as some of the links are now broken. Most of the downstream datasets were downloaded from Tensorflow datasets111https://www.tensorflow.org/datasets/catalog/overview and Torchvision datasets222https://pytorch.org/vision/stable/datasets.html.

C Additional Ablations

Table 12: Modality Involved in Composition: Applying semantic compositions on both modalities is the most consistently effective method across different downstream datasets and tasks.
Modality CIFAR-10 CIFAR-100 ImageNet
Text only 55.2 27.0 21.3
Images only 55.9 23.1 19.4
Text & Images 66.4 26.9 20.5

C.1 Impact of Modality Used in Composition

Since our inputs are of different modalities, visual and textual, it is important to examine whether compositions in each of these modalities produce similar effects. To that end, in Tab. 12, we conduct analysis where our method is applied on (1) only the captions, (2) only the images, and (3) both the captions and images. Of these three variations, executing the compositions on both the captions and images is the most effective, probably due to the symmetry of transforming both modalities. The second most effective is the captions-only approach. Option (2) is the least effective method likely because the images are naturally augmented (random crop**) in the baseline method whereas the captions are fixed. These observations suggest that our method is more helpful in learning representations of the texts relative to those of the images. They also help elucidate why we obtain much bigger improvements over the baseline in zero-shot settings compared to linear evaluations.

Table 13: Zero-shot Results on Ways of Pairing Examples: For each row, we pretrain a ViT-S/16 model on the 100k CC3M subset explained in Sec. C.2. L-C-S stands for Largest Caption-Similarity while S-C-S represents Smallest Caption-Similarity. Random is the case where we perform random assignments at the beginning and fix them for the rest of the training. We pretrain for 40 epochs using a reduced learning rate of 3×1043superscript1043\times 10^{-4}3 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, a batch-size of 256256256256, and a warm-up period to 20202020 epochs.

Method

Food-101

CIFAR-10

CIFAR-100

Caltech-101

Pets

DTD

Country211

Sun397

STL-10

RESISC45

EuroSAT

ImageNet

CLIP 1.55 18.1 4.93 8.73 2.64 2.98 0.36 0.6 28.7 4.97 20.4 1.43
Random 2.22 16.7 5.38 7.86 2.95 2.18 0.56 0.69 25.6 8.46 11.4 1.52
L-C-S 2.63 20.4 5.66 9.36 2.64 2.71 0.38 1.12 27.5 8.31 19.2 1.56
S-C-S 2.17 23.6 5.87 8.46 1.93 3.03 0.52 0.75 27.4 8.69 12.7 1.49
CLIP-𝒞𝒞\mathcal{C}caligraphic_C 1.83 21.9 5.68 8.15 2.74 3.19 0.37 0.84 25.3 4.90 12.3 1.61

C.2 Other Ways of Pairing Examples

In CLIP-𝒞𝒞\mathcal{C}caligraphic_C, the second example is chosen randomly from the dataset whenever the composition is active. In this section, we explore other ways of choosing that second image-caption pair. The configurations studied here include selecting a second instance whose caption is (1) the closest to, or (2) the farthest from the first caption of the anchor, with distance measured by the pair-wise cosine similarity between features obtained from a sentence embedding model [47]. We also consider the scenario where examples are paired randomly. However, the pairings are fixed once generated at the beginning of training. (CLIP-𝒞𝒞\mathcal{C}caligraphic_C uses random assignments that change in every epoch.)

Since computing the embeddings of all captions as well as the pair-wise cosine similarities of all examples is extraordinarily expensive computationally, we randomly selected a subset of 100,000100000100,000100 , 000 examples from CC3M for these studies. We then computed the normalized features for all captions using the All-MiniLM-L6-v2333https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 pretrained model from HuggingFace [49]. Afterward, we generated the 100k×100k100𝑘100𝑘100k\times 100k100 italic_k × 100 italic_k pair-wise cosine similarities for this experiment. We also run CLIP, our method as discussed in the paper, and the fixed random assigned described above on this subset.

In zero-shot evaluations in this setting, we do not observe any significant differences between pairing examples whose captions are close or far in the embedding space. More importantly, CLIP-𝒞𝒞\mathcal{C}caligraphic_C beats all other setups on a plurality of the downstream benchmarks including on ImageNet. This performance, along with its simplicity, efficiency, and scalability (since there is no need for pair-wise comparisons of examples) all further justify the adoption of dynamic random sampling in our method.

Table 14: Zero-shot Results on Text Composition Methods: For each row, we pretrain a ViT-S/16 model on the 100k CC3M subset detailed in Sec. C.3. CLIP-𝒞w/ ANDsubscript𝒞w/ AND\mathcal{C}_{\text{w/ AND}}caligraphic_C start_POSTSUBSCRIPT w/ AND end_POSTSUBSCRIPT (default) represents the scenario where the captions are concatenated with “and” as a conjunction. CLIP-𝒞w/o ANDsubscript𝒞w/o AND\mathcal{C}_{\text{w/o AND}}caligraphic_C start_POSTSUBSCRIPT w/o AND end_POSTSUBSCRIPT denotes the case where we omit the “and”. Finally, CLIP-𝒞LLMsubscript𝒞LLM\mathcal{C}_{\text{LLM}}caligraphic_C start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT is the language model rewriting method. We pretrain for 40 epochs using a reduced learning rate of 3×1043superscript1043\times 10^{-4}3 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, a batch-size of 256256256256, and a warm-up period to 20202020 epochs.

Method

Food-101

CIFAR-10

CIFAR-100

Caltech-101

Pets

DTD

Country211

Sun397

STL-10

RESISC45

EuroSAT

ImageNet

CLIP-𝒞w/ ANDsubscript𝒞w/ AND\mathcal{C}_{\text{w/ AND}}caligraphic_C start_POSTSUBSCRIPT w/ AND end_POSTSUBSCRIPT 1.92 15.6 5.58 8.49 4.35 3.24 0.50 1.15 31.0 8.97 19.8 1.70
CLIP-𝒞w/o ANDsubscript𝒞w/o AND\mathcal{C}_{\text{w/o AND}}caligraphic_C start_POSTSUBSCRIPT w/o AND end_POSTSUBSCRIPT 2.34 19.48 5.33 7.67 2.34 4.04 0.40 1.29 31.6 6.49 18.5 1.92
CLIP-𝒞LLMsubscript𝒞LLM\mathcal{C}_{\text{LLM}}caligraphic_C start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT 1.94 15.6 5.68 8.22 3.81 3.19 0.45 1.13 26.7 6.17 9.05 1.83

C.3 Different Methods of Merging Captions

As in Sec. 6.5 where we looked at different ways of composing the merged image, we study other ways of merging the captions in this section. In CLIP-𝒞𝒞\mathcal{C}caligraphic_C, we adopt the very simple and highly flexible process of concatenating the two captions using “and” as a conjunction. Here, we compare that method against omitting the conjunction or rewriting the merged caption with a large language model (LLM).

Similarly to the ablation in Sec. C.2, we randomly selected 100,000100000100,000100 , 000 examples from CC3M for this ablation because of the costs involved in the LLM rewriting method. Each example in the subset is associated with a second example (drawn from the full dataset) at random. We generate these associations once and use them for all ablation methods in this section: when (1) concatenating with “and”, (2) concatenating without “and”, and (3) rewriting the output of (1) with an LLM.

We employ the open-source 70B parameter model LLama2-chat [46] for the LLM rewriting method. The temperature and top-p arguments of the model are set to 0.80.80.80.8 and 0.950.950.950.95, respectively. The maximum sequence length is set to 256. To rewrite a given caption, we (a) instruct the model to act as a text completion assistant, and (b) prompt the model as: rephrase the sentence {merged caption} where merged caption is from option (1) above. We use the first generated sentence as the new caption. The LLM-based captions are generated offline and saved to file before training begins. We emphasize that only 100,000100000100,000100 , 000 rewrites are undertaken, one for each entry in the sampled subset. Some examples of these rewritten captions are provided in Fig. 6.

On zero-shot classification in this limited setting, our simple concatenation method with and without “and” as a conjunction, each obtained the highest accuracy on 6666 out of 14141414 downstream datasets while the LLM rewrite option achieved the highest performance on the remaining 2222 datasets as shown in Tab. 14. These results show that besides being highly efficient and flexible, our method is also comparable with the more resource-intensive large language model rewriting method. Additionally, while the LLM rewrite method may hallucinate non-grounded captions (See Fig. 6), our method is devoid of such hallucinations. We use concatenation with the conjunction as the default caption composition method because of its coherence.

Refer to caption
Figure 6: Sample of captions generated using the LLM-based method and our CLIP-𝒞𝒞\mathcal{C}caligraphic_C procedure.