Descriptor and Word Soups \scalerel*X: Overcoming the Parameter Efficiency
Accuracy Tradeoff for Out-of-Distribution Few-shot Learning

Christopher Liao
Boston University
[email protected] Theodoros Tsiligkaridis
MIT Lincoln Laboratory
[email protected] Brian Kulis
Boston University
[email protected]

Abstract

Over the past year, a large body of multimodal research has emerged around zero-shot evaluation using GPT descriptors. These studies boost the zero-shot accuracy of pretrained VL models with an ensemble of label-specific text generated by GPT. A recent study, WaffleCLIP, demonstrated that similar zero-shot accuracy can be achieved with an ensemble of random descriptors. However, both zero-shot methods are un-trainable and consequently sub-optimal when some few-shot out-of-distribution (OOD) training data is available. Inspired by these prior works, we present two more flexible methods called descriptor and word soups, which do not require an LLM at test time and can leverage training data to increase OOD target accuracy. Descriptor soup greedily selects a small set of textual descriptors using generic few-shot training data, then calculates robust class embeddings using the selected descriptors. Word soup greedily assembles a chain of words in a similar manner. Compared to existing few-shot soft prompt tuning methods, word soup requires fewer parameters by construction and less GPU memory, since it does not require backpropagation. Both soups outperform current published few-shot methods, even when combined with SoTA zero-shot methods, on cross-dataset and domain generalization benchmarks. Compared with SoTA prompt and descriptor ensembling methods, such as ProDA and WaffleCLIP, word soup achieves higher OOD accuracy with fewer ensemble members. Please checkout our code: github.com/Chris210634/word_soups

1 Introduction

Problem Setting

There is extensive interest from the computer vision community for training classifiers that are robust to distribution shifts. Pioneering works in this area [23, 67, 47] focused on optimizing for simple shifts in the image distribution, such as sketch-to-real adaptation. As the topic evolved, the community proposed increasingly harder adaptation problems by eliminating some restrictive assumptions. For the domain generalization (DG) problem [68, 55], we do not assume access to unlabeled target data; for the cross-dataset generalization (XD) problem [70], we allow source and target label spaces to be different; and for the parameter efficient learning (PEFT) problem [20, 57, 37], we impose a tight budget on the number of parameters that can be tuned. Our work lies at the confluence of these three topics. Similar to CoOp [70] and MaPLe [25], we do assume access to labeled few-shot generic source data, such as ImageNet. Since we assume nothing about the relationship between source and target datasets, this setting can be more useful in practice than strict zero-shot learning. In this paper, we propose two parameter efficient few-shot methods, called word and descriptor soups, that finetune vision-language (VL) models to generalize to target datasets which may contain unseen labels and/or shifts in the image distribution. Our methods achieve state-of-the-art on some benchmarks without additional gradient-based tuning, but can also improve state-of-the-art gradient-based finetuning methods with an additional diversity loss.

Refer to caption — Figure 1: Illustration of word and descriptor soups. We conceptually position our two soup methods along the tradeoff between parameter efficiency and flexibility; we then list the pros and cons of our soups compared to prior work. Firstly, word soup is more parameter efficient than soft prompt tuning, because it uses discrete tokens (see Fig. 2). Secondly, word soup does not require an LLM or handcrafted prompts. Lastly, word soup attains higher target accuracy than prior descriptor methods by allowing a descriptor to be any permutation of words and explicitly maximizing its accuracy on training data (see Fig. 3). However, word soup achieves this flexibility by sacrificing the explainability of descriptors. On the other hand, descriptor soup is interpretable (see Table 1), but less flexible than word soup, since it is limited to selecting from the pool of GPT descriptors.

Motivation

Our work is motivated by the recent success of classification by description methods [35, 42, 24] in both zero-shot (ZS) classification and open-vocabulary object detection. These methods ask an LLM like GPT to generate a list of short descriptions for each class, then aggregate predictions from the descriptions to improve ZS accuracy, see Fig. 1(a). It is often claimed that the impressive gain in ZS accuracy comes from additional information given by the GPT descriptions. However, a recent study called WaffleCLIP [45] observed that random descriptors or even strings of random words can achieve similar ZS accuracy to GPT descriptors, when ensembled together (see Fig. 5). Therefore, gains in ZS accuracy achieved by descriptor methods are mostly driven by ensembling rather than the content of the descriptors themselves. Inspired by this observation, we propose descriptor and word soups, two methods which outperform WaffleCLIP by selecting descriptors or chains of words that maximize few-shot accuracy. Word soup has 3 advantages: (1) it outperforms existing descriptor-inspired ZS methods in the few-shot OOD setting since it directly maximizes classification accuracy (see Fig. 1(c)); (2) it is more parameter efficient than existing few-shot methods since the model is frozen and only the discrete descriptor tokens need to be stored; and (3) it does not require an LLM. The pros and cons of both descriptor and word soups are concisely stated in Figure 1 and discussed more in the method section.

Method Overview

According to the above motivation, we design a progression of three methods: descriptor soup, word soup, and word soup training with diversity loss. These methods build upon each other but can be used independently and in combination with prior methods. We opted for this style of presentation, since there are motivating empirical insights at each stage, and each method achieves state-of-the-art depending on resource constraints (such as availability of an LLM at training time or parameter storage budget). Descriptor soup is loosely inspired by model soups [58]; “soup” refers to a set of descriptors. We calculate an aggregate prediction based on the centroid of descriptors in the soup. We start with the most accurate descriptor on the training data and greedily add descriptors to the soup if training accuracy increases, see Fig. 1(b). Similarly, for word soups, we assemble a chain of words by greedily appending a word if it increases the training accuracy of the word chain, see Fig. 1(c). Finally, we present a diversity loss that can be used to optimize the CLIP model, using the word soup as an initialization. This loss is required to maintain the initial diversity among word soup members throughout finetuning.

Contributions

We make the following contributions to the computer vision literature:

•

We present word soup, which improves SoTA on few-shot cross-dataset (XD) and domain-generalization (DG) benchmarks by 1% and 0.8% resp.
•

Our word soup uses fewer parameters than SoTA parameter efficient methods while achieving higher accuracy than parameter-free ZS methods in both few-shot settings.
•

We propose a diversity loss to train VL models initialized with word soup. This allows our method to seamlessly combine with prior few-shot finetuning methods.
•

We present qualitative results (e.g. Tab. 1) to understand what is means for a descriptor to be “good”, and analyze the generalizability of these descriptors (Fig. 3). These results extend the current understanding of how descriptor and prompting methods work.

2 Related Work

Few-shot CLIP finetuning

We follow the problem settings of CoOp [70], CoCoOp [69], MaPLe [25], and Clipood [51], which finetune a CLIP-like model [43] on few-shot ImageNet in a manner that generalizes to OOD target datasets. Many prompt tuning methods build on top of CoOp by using different loss functions [5, 62, 3, 9, 41], using clever optimization techniques [71], ensembling multiple prompts [31, 6], leveraging different sources of information [49, 12, 22], leveraging synergy between modalities [65, 30, 26], or using different network architectures [61, 8]. We take a fundamentally different approach from these prior methods, drawing inspiration from classification by description [35]. Specifically, prior methods tune a soft prompt while our method tunes a sequence of discrete tokens.

Zero-shot CLIP

Many recent papers use LLM descriptors to aid ZS or open-vocabulary visual tasks, including classification [35, 42] and detection [24]. WaffleCLIP [45] observed that the impressive gains in accuracy reported by these works are mostly driven by ensembling and dataset-level concepts. WaffleCLIP ensembles random descriptors and uses an LLM to discover dataset-level concepts, while we design an optimization procedure to learn good descriptors from data. Our algorithm is loosely related to model averaging methods [58, 59]. However, unlike model soups [58], we do not generate multiple training trajectories, since all descriptors share the same model weights. ZS accuracy can also be improved with hierarchical label sets [38] or handcrafted prompts [1]. Test-time prompt tuning methods [50, 10, 33, 48] train a sample-specific prompt that maximizes agreement between predictions based on a set of image augmentations. These methods suffer from long inference times due to test-time optimization.

Parameter efficient finetuning (PEFT)

Our word soup can be considered a PEFT [17, 20] method, but specialised to finetuning VL models in the OOD setting. Prior PEFT methods include shallow text prompt tuning [70, 71, 22, 31], visual prompt tuning [20], bias tuning [64], adapters [16, 11, 66, 54, 39], LoRA [17], SSF [29], side-tuning [53], and others [21, 63, 32, 18]. Unlike the above works, our word soup tunes fewer parameters by leveraging discrete text tokens. Similar to LST [53], we use minimal GPU memory, since no backpropagation is required. We empirically compare with a representative subset of PEFT methods in the OOD settings in Fig. 2. Clearly, our word soup establishes a better tradeoff between parameter efficiency and OOD accuracy, compared to prior work.

3 Method

This section is organized into 4 parts. Section 3.1 reviews the classification by description [35] and WaffleCLIP [45] methods, which motivate our soup methods. Section 3.2 presents descriptor soup, a novel intermediary method which still uses GPT descriptors at training time but not at test time. Section 3.3 presents word soup, which is similarly motivated but only requires a list of English words at training time. Section 3.4 describes the diversity loss used to finetune the CLIP model using word soup as the initialization. Please use Fig. 1 as a reference. We organize the methods in this section in order of increasing flexibility, since it is more natural to motivate word soups this way. However, word soups can also be motivated in the opposite direction by shortcomings of soft prompt tuning, as noted in Fig. 1; this motivation is included in Appendix B. We also propose a token offset trick in Appendix C to augment descriptor soups.

Color-coded by source: ImageNet, Pets, DTD, Random
Target: ImageNet	Alignment	Accuracy
no descriptor	0.301	67.1
which typically brightly colored.	0.305 (+0.004)	68.2 (+1.1)
which has usually white or off-white.	0.310 (+0.009)	68.4 (+1.3)
which is a long, low-slung body.	0.312 (+0.011)	68.3 (+1.2)
which is a curved or rectangular shape.	0.309 (+0.008)	68.6 (+1.5)
which can vary in size from small to large.	0.315 (+0.014)	68.5 (+1.4)
which has reddish brown fur.	0.300 (-0.001)	66.2 (-0.9)
which is a hard skeleton.	0.295 (-0.006)	66.6 (-0.5)
which is a medium-sized, short-haired cat.	0.291 (-0.010)	66.0 (-1.1)
which has sharp claws.	0.299 (-0.002)	66.6 (-0.5)
which is a repeating pattern.	0.295 (-0.006)	66.1 (-1.0)
which is a sign with the shop’s name.	0.295 (-0.006)	66.7 (-0.4)

Target: Pets	Alignment	Accuracy
no descriptor	0.322	88.4
a type of pet. (handcrafted; for reference)	0.331 (+0.009)	89.0 (+0.6)
which is a large, powerful cat.	0.321 (-0.001)	89.8 (+1.4)
which has sharp claws.	0.324 (+0.002)	89.9 (+1.5)
which has soulful eyes.	0.317 (-0.005)	89.9 (+1.5)
which is a long arm with a claw …	0.324 (+0.002)	87.8 (-0.6)
which is a medium-sized, short-haired cat.	0.327 (+0.005)	91.4 (+3.0)
which is a boat with sails.	0.293 (-0.029)	81.5 (-6.9)
which often used by knights and soldiers.	0.315 (-0.007)	80.8 (-7.6)
which can vary in size from small to large.	0.333 (+0.011)	88.6 (+0.2)
which typically has a yellow or brownish color.	0.335 (+0.013)	89.3 (+0.9)

Target: Textures (DTD)	Alignment	Accuracy
no descriptor	0.273	44.3
a type of texture. (handcrafted; for reference)	0.287 (+0.014)	44.1 (-0.2)
which may be decorated with a pattern or logo.	0.286 (+0.013)	47.2 (+2.9)
which is a sign with the shop’s name.	0.261 (-0.012)	45.3 (+1.0)
which is a backdrop.	0.280 (+0.007)	46.6 (+2.3)
which is a repeating pattern.	0.283 (+0.010)	46.3 (+2.0)
which typically has a pattern or design.	0.295 (+0.022)	45.5 (+1.2)
which is a guard tower.	0.243 (-0.030)	43.4 (-0.9)
which has loud crow.	0.253 (-0.020)	42.4 (-1.9)
which can be brightly colored or patterned.	0.283 (+0.010)	44.5 (+0.2)
which is a curved or rectangular shape.	0.281 (+0.008)	44.4 (+0.1)

Table 1: Qualitative comparison of descriptors. We select descriptors based on a source dataset using Alg. 1 and test on a target dataset. The tables are organized by the target dataset; the color of the highlight indicates the source dataset. We include randomly selected descriptors in gray for comparison. Alignment refers to the average cosine similarity between image embeddings and the corresponding text embeddings. Observe that selected descriptors tend to describe the source dataset as a whole and improve both accuracy and alignment. Also observe that a descriptor soup trained on ImageNet (blue) generalizes to other datasets, but not vice versa.

Target: ImageNet	Alignment	Uniformity	Accuracy
no descriptor	0.301	0.173	67.1
dat they … difficulties.	0.306 (+0.005)	0.174 (+0.001)	68.9 (+1.8)
similar vary … mention etc.	0.314 (+0.013)	0.183 (+0.010)	69.1 (+2.0)
separately aspects … adopted.	0.315 (+0.014)	0.181 (+0.008)	69.2 (+2.1)
tue alot … itself.	0.303 (+0.002)	0.178 (+0.005)	69.0 (+1.9)
bufing beginner … status.	0.311 (+0.010)	0.181 (+0.008)	68.8 (+1.7)
soviet vbulletin … inexpensive.	0.320 (+0.019)	0.195 (+0.022)	62.0 (-5.1)
ideal ips … filename.	0.314 (+0.013)	0.196 (+0.023)	59.7 (-7.4)

Table 2: Example of a 5 member word soup trained on ImageNet (in blue) along with random chains of words (in gray) for comparison. Comparing with Tab. 1, we observe that the word soup descriptors achieve higher accuracy than descriptor soups, since word soup is more flexible from an optimization perspective. Here, we include uniformity scores, since chains of random words improve alignment at the expense of increasing uniformity. Uniformity is the average cosine similarity between image and text embeddings with different labels.

3.1 LLM Descriptors and WaffleCLIP

Several works use LLM descriptors to supplement class names in VL models [35, 42, 24]. These methods ask an LLM to describe the object being classified and incorporate this information into the textual input by forming sentences such as “a photo of a tench, which is a freshwater fish” or “a photo of a goldfish, which has small black eyes”. The LLM generates on average 5.8 such descriptors per label, and the centroids of the resulting text embeddings are used for zero-shot classification of images. The improvement in zero-shot accuracy can be attributed to (1) additional information coming from the LLM and (2) ensembling. In WaffleCLIP, Roth et al. [45] claim that most of the gain in accuracy reported by Menon and Vondrick [35] can be attributed to ensembling. They showed that appending a similar number of randomly selected descriptors to the class names can achieve similar zero-shot accuracies as the GPT descriptors. We confirm this result in Fig. 5. Observe in this figure that both random descriptors (labeled as “random soup”) and chains of random nonsensical words (labeled as “waffle CLIP”) perform better than classification by description (“GPT centroids”) for the same number of descriptors per label ( $m$ ). This is a surprising result. We reason that selecting descriptors which maximize few-shot training accuracy would achieve higher accuracy than random descriptors; this motivates descriptor soup.

		Source	Cross-dataset (XD) Evaluation Targets											Domain Generalization Targets
	$m$	INet	Caltech	Pets	Cars	Flowers	Food	Aircraft	SUN	DTD	EuroSAT	UCF	Mean	INet-V2	Sketch	INet-A	INet-R	Mean
CLIP ZS [70]	1	67.1	93.3	89.0	65.4	71.0	85.7	25.0	63.2	43.6	46.7	67.4	65.02	61.0	46.6	47.2	74.1	57.22
Ensemble [43]	80	68.4	93.5	88.8	66.0	71.1	86.0	24.8	66.0	43.9	45.0	68.0	65.31	61.9	48.5	49.2	77.9	59.36
GPT centroids [35]	5.8	68.2	94.1	88.4	65.8	71.5	85.7	24.7	67.5	44.7	46.6	67.4	65.63	61.5	48.2	48.9	75.1	58.40
GPT score mean [35]	5.8	68.6	93.7	89.0	65.1	72.1	85.7	23.9	67.4	44.0	46.4	66.8	65.42	61.8	48.1	48.6	75.2	58.42
Random descriptors	16	67.9	94.1	87.6	65.6	71.5	85.6	24.9	66.1	44.7	49.1	67.2	65.65	61.6	48.7	50.0	76.7	59.22
+ offset trick (ours)	96	68.5	93.5	89.2	65.8	72.0	85.7	25.2	66.1	44.4	53.0	68.2	66.29	61.9	48.9	50.6	77.5	59.76
Waffle CLIP [45]	16	68.1	93.5	88.4	65.4	72.0	85.9	25.9	66.2	44.1	46.3	68.0	65.58	61.8	48.6	49.8	76.2	59.08
+ offset trick (ours)	96	68.6	93.1	89.5	65.9	72.1	86.1	26.3	66.2	44.2	52.5	68.8	66.49	62.1	48.9	50.2	77.1	59.59
Descriptor soup (ours)	16.7	68.9	94.7	89.4	66.2	72.2	86.2	25.5	67.3	45.1	46.6	68.7	66.18	62.1	48.7	49.7	76.4	59.25
+ offset trick (ours)	100	69.1	93.8	89.8	66.0	72.9	86.2	25.4	66.8	45.0	51.6	69.1	66.67	62.6	49.0	50.5	77.2	59.82
Word soup (ours)	8	69.2	94.4	89.5	65.4	72.3	85.8	25.8	67.4	44.7	53.5	68.4	66.72	62.9	48.7	50.2	77.0	59.69
Word soup score mean (ours)	8	69.4	94.3	89.6	65.4	72.4	85.9	25.9	67.3	45.2	55.8	68.5	67.03	63.0	49.0	50.4	77.2	59.90
gain over GPT		+0.8	+0.6	+0.6	+0.3	+0.3	+0.2	+2.0	-0.1	+1.2	+9.4	+1.7	+1.6	+1.2	+0.9	+1.8	+2.0	+1.5
gain over Waffle		+1.3	+0.8	+1.2	+0.0	+0.4	+0.0	-0.0	+1.1	+1.1	+9.5	+0.5	+1.5	+1.2	+0.4	+0.6	+1.0	+0.8

Table 3: Comparison with ZS methods. All baseline methods in this table use prompts/descriptors on top of the pretrained model in a ZS manner. Note that the soup methods are not truly zero-shot because they require some training data. However, we do compare against all baselines in the few-shot setting in Table 6. We use the ViT/B-16 CLIP model trained by Open-AI. All non-deterministic numbers are an average of 3 random seeds.

m

indicates the number of descriptors used. “Ensemble” refers to the set of 80 handcrafted prompts created by Open-AI; GPT score mean corresponds to the classification by description method. We use centroid evaluation unless “score mean” is explicitly stated. We achieve substantial gains over GPT descriptors and waffle CLIP as indicated in the bottom two rows.

3.2 Descriptor Soup

We reference Alg. 1 in the Appendix throughout this section. Let $\mathcal{D}=\{d_{1},...,d_{n}\}$ denote a set of $n$ descriptors such as “which is a freshwater fish”. These descriptors are obtained by combining all descriptors generated by GPT for 1,000 ImageNet classes [35], and kee** only unique entries. Descriptors are no longer connected to their original classes. We wish to select a set of $m$ descriptors that maximizes accuracy on few-shot training data. Let’s define the loss function $\ell(\mathcal{S}_{\text{train}},\mathcal{T}_{\text{train}}(d))$ to be the 0-1 loss of the model using descriptor $d$ over the entire training dataset $\mathcal{S}_{\text{train}}$ . $\mathcal{T}_{\text{train}}(d)$ denotes the label text embeddings calculated by the text encoder by appending descriptor $d$ to all class names. Since all parameters of the vision model remain constant, we ignore vision model parameters in the notation. We aim to find a set of $m$ descriptors whose centroids in the text embedding space minimize the 0-1 loss:

\mathcal{D}^{*}_{m}=\{d_{1}^{*},...,d_{m}^{*}\}=\operatorname*{arg\,min}_{d_{1% :m}\in\mathcal{D}}\ell\left(\mathcal{S}_{\text{train}},\frac{1}{m}\sum_{i=1}^{% m}\mathcal{T}_{\text{train}}(d_{i})\right)

(1)

Note that $\frac{1}{m}\sum_{i=1}^{m}\mathcal{T}_{\text{train}}(d_{i})$ denotes the L2-normalized centroid of text embeddings for each class. We always normalize the centroid so it can be used to calculate the cosine similarity with image embeddings; this is omitted from the math to avoid clutter.

Eq. 1 is an intractable combinatorial problem, but we can approximately solve it via a greedy approach or by solving the continuous version of the problem using gradient descent. We use a greedy approach, inspired by Wortsman et al. [58]. The algorithm can be summarized as (reference Alg. 1):

1.

Calculate $\ell(\mathcal{S}_{\text{train}},\mathcal{T}_{\text{train}}(d))$ for all $d\in\mathcal{D}$ . Sort the descriptors by increasing loss / decreasing accuracy. With slight abuse of notation, denote the sorted list as $\mathcal{D}=[d_{0},...,d_{n}]$ .
2.

Initialize the “descriptor soup” $\mathcal{D}^{*}=\{d_{0}\}$ with the best descriptor.
3.

For $i$ in $1:n$ : Add $d_{i}$ to $\mathcal{D}^{*}$ if it decreases the loss of $\mathcal{D}^{*}$ .
4.

Return the first $m$ descriptors in $\mathcal{D}^{*}$ .

Please find ZS results for descriptor soup in Tab. 3.

Building Intuition

A natural question to ask is: descriptor soup members no longer describe individual classes, so why does Alg. 1 work? The answer has two parts (1) Alg. 1 finds descriptors which describe the dataset as a whole, rather than individual labels; these descriptors are orthogonal to the classification problem and increase classification accuracy by increasing alignment between corresponding image and text embeddings. (2) Descriptor soups generalize when the target classification problem has a narrower scope than the source classification problem. Prior work (e.g. [70, 25, 45]) suggests that handcrafted dataset-specific descriptors such as “a type of aircraft” or “a type of pet” improve ZS accuracy. Dataset-level descriptors like these are easier to design than label-level descriptors, so using dataset-level descriptors is currently standard practice. We hypothesize that these descriptors improve accuracy by increasing alignment between corresponding image and text embeddings; we demonstrate this in Tab. 1. e.g. “a type of pet” improves pet classification accuracy by 0.6% and alignment by 0.01.

We further hypothesize that descriptor soup members learn to mimic the behavior of handcrafted dataset-level descriptors. We display examples of descriptor soups trained on three different datasets in Table 1 in support of this intuition. Descriptors trained on pets (in pink) mention “claws”, “eyes”, and “hair”, which are concepts common to most pets. In a similar vein, descriptors trained on textures/DTD (in yellow) mention “pattern”, “logo”, and “design”. Meanwhile, ImageNet is a broader dataset, so descriptors trained on ImageNet (in blue) are generally non-specific (e.g. “which could be brown or grey”). This is intuitive, since ImageNet is a dataset with diverse classes. A descriptor such as “which is a type of dog” would be detrimental to the zero-shot accuracy, since it would bias the classifier toward labels that are types of dogs. Table 1 shows that individual descriptor soup members increase both the alignment and classification accuracy, when the source and target datasets are the same. The next paragraph addresses the issue of generalizability when source and target datasets are different.

Generalizability

Descriptor soups trained on ImageNet generalize to target datasets with narrower scopes, but not vice versa. This is because ImageNet concepts are a superset of narrower target datasets; e.g. ImageNet classes contain types of cars and pets. Table 1 shows that descriptors trained on ImageNet (blue) improve both the alignment and accuracy on Pets and Textures; but descriptors trained on the latter two datasets (pink and yellow) decrease the same metrics on ImageNet. To further support the generalizability of descriptor soups, we show a positive correlation between ImageNet accuracy and average target dataset accuracy in Fig. 3 (right). Finally, we train a descriptor soup on test data to maximize average accuracy of 10 datasets; we call this the “descriptor soup upper bound” in the middle of Tab. 6. The upper bound only achieves marginal improvement over the descriptor soup trained on ImageNet (three rows above the upper bound in Tab. 6). This suggests that greedily maximizing the descriptor soup accuracy on ImageNet training data is a good approximation of maximizing the target accuracy; i.e. the generalization gap is small.

3.3 Word Soup

Descriptor soup achieves impressive state-of-the-art performance, but it is still reliant on an LLM at training time to generate a list of candidate descriptors and is limited to this fixed descriptor list. In order to remove the reliance on LLMs and make the optimization process more flexible, we propose to generate descriptors in a greedy fashion using individual words selected from a dictionary. We use the list of 10,000 most commonly-used words on the web¹¹1github.com/first20hours/google-10000-english as the candidate pool of words.

Given a list of $n$ words $\mathcal{W}=\{w_{1},...,w_{n}\}$ (we abuse some notations slightly, since the word soup is a separate method). Descriptors are allowed to be any sequence of words, as long as the length does not exceed $p$ . Concretely,

\begin{split}&\mathcal{D}^{*}_{m}=\{d_{1}^{*},...,d_{m}^{*}\}=\operatorname*{% arg\,min}_{d_{1:m}\in D^{\prime}}\ell\left(\mathcal{S}_{\text{train}},\frac{1}% {m}\sum_{i=1}^{m}\mathcal{T}_{\text{train}}(d_{i})\right)\\ &D^{\prime}:=\{\text{all $q$ permutations of $\mathcal{W}$, $\forall q\leq p$}% \}\end{split}

(2)

The word soup problem described by Eq. 2 is again intractable, so we propose an approximate greedy solution using the following steps (see Alg. 2 in the Appendix):

1.

Initialization: Sort $\mathcal{W}$ by decreasing ZS accuracy to filter out unsuitable words (see Fig. 3 left). For this step, we only consider single word descriptors (e.g. “a photo of a cat, the.”). Select the top- $k_{0}$ and top- $k_{1}$ words, denoted as $\mathcal{W}_{\text{top}k_{0}}$ and $\mathcal{W}_{\text{top}k_{1}}$ , resp. $k_{0}<k_{1}$ .
2.

Randomly select a word $w$ from $\mathcal{W}_{\text{top}k_{0}}$ and initialize the descriptor $d=w$ .
3.

Shuffle $\mathcal{W}_{\text{top}k_{1}}$ . Then, for $w^{\prime}\in\mathcal{W}_{\text{top}k_{1}}$ , append $w^{\prime}$ to $d$ , only if it increases the accuracy of $d$ .
4.

return $d$ .

We obtain a total of $m$ independent (in a loose sense) descriptors by repeating steps 2-4. In these steps, we randomly select from $\mathcal{W}_{\text{top}k_{0}}$ and shuffle $\mathcal{W}_{\text{top}k_{1}}$ to encourage diversity among the $m$ selected descriptors. Instead of truncating all descriptors to a pre-determined length $p$ , we introduce a patience parameter in Alg. 2, which implicitly controls the average descriptor length. We now motivate word soup.

Motivation from descriptor soup

The descriptor soup method has some intuitive properties covered in the previous sub-section, but is limited by the small number of good descriptors. Fig. 3 left shows that only about 1,200 descriptors (green line) in $\mathcal{D}$ are better than no descriptor (vanilla ZS; red line). The descriptor soup is limited to various combinations of these 1,200 “good” descriptors. On the contrary, when we expand the hypothesis space to be $D^{\prime}$ , any permutation of a set of words, there are many more good descriptors to choose from, as indicated by the orange line in Fig. 3 left. In other words, word soup improves classification accuracy by increasing the size of the hypothesis class. Tab. 2 supports this assertion by showing that individual word soup descriptors achieve higher accuracies on ImageNet than descriptor soup members.

	$m$	Source	XD Mean	DG Mean
		INet	(10 datasets)	(4 datasets)
CLIP ZS	1	67.1	65.02	57.22
Vanilla CoOp	1	70.0	66.52	59.25
+ word soup	8	69.6	66.59	59.26
CoOp ensemble	8	69.8	66.68	59.18
CoOp regularized towards initialization	1	70.2	66.97	59.94
+ word soup	8	69.9	66.69	60.05
CoOp with label smoothing	1	70.1	66.37	60.09
+ word soup	8	69.9	66.13	60.16
CoOp + word soup ( $\lambda=0$ )	8	69.8	66.21	59.15
+ our diversity loss ( $\lambda=0.25$ )	8	70.2	67.23	60.20

Table 4: Ablation results to support the diversity loss. “Vanilla CoOp + word soup” refers to appending the word soup descriptors directly to soft CoOp prompts. “CoOp ensemble” refers to ensembling

m

randomly-initialized soft descriptors trained with CoOp. Observe that the model trained with our diversity loss (

\lambda=0.25

) achieves a 1% increase in accuracy on average. This increase in accuracy cannot be achieved with label smoothing or regularization towards the initialization as in MIRO [4] and ProGrad [71]. Detailed results see Tab. 9 in the Appendix.

		Cross-dataset Evaluation Target Mean
	$m$	B/32 $\dagger$	B/16 $\dagger$	L/14 $\ddagger$	CoCa L/14 $\ddagger$	g/14 $\ddagger$
ZS	1	61.32	65.02	73.11	74.82	77.58
GPT score mean	5.8	61.22	65.42	73.08	75.48	77.14
Waffle CLIP	16	62.13	65.58	73.25	75.37	77.72
Desc. soup + offsets	100	62.79	66.67	73.19	75.95	78.04
Word soup (ours)	8	62.24	67.03	73.56	76.08	78.09
		Domain Generalization Evaluation Target Mean
	$m$	B/32 $\dagger$	B/16 $\dagger$	L/14 $\ddagger$	CoCa L/14 $\ddagger$	g/14 $\ddagger$
ZS	1	47.68	57.22	64.88	67.94	71.37
GPT score mean	5.8	47.95	58.42	64.96	67.67	71.26
Waffle CLIP	16	49.07	59.08	64.47	67.85	70.99
Desc. soup + offsets	100	50.05	59.82	65.81	68.32	72.21
Word soup (ours)	8	50.00	59.90	65.73	68.73	72.05

Table 5: Comparison with ZS baselines at different model scales.

\dagger

indicates a model trained by Open-AI [43];

\ddagger

indicates a model trained by Open-CLIP [19]. Detailed results see Tab. 12 in the Appendix.

3.4 Diversity loss

Word soup already achieves competitive performance on most benchmarks. A reasonable next step would be to finetune using the word soup descriptors as an initialization. A variety of methods exist for few-shot finetuning of CLIP, e.g. CoOp, Clipood, and MaPLe. However, in many cases we actually see a slight decline in target accuracy after finetuning in Tab. 4 ( $\lambda=0$ ). This is because finetuning all descriptors on the same few-shot data forces text-prototypes to converge to the same locations in the embedding space, eliminating the initial diversity. Given fixed word soup descriptors $\mathcal{D}^{*}=\{d^{*}_{1},...,d^{*}_{m}\}$ , our training loss is:

\ell_{\text{train}}=\mathbb{E}_{d^{*}_{i}\sim\mathcal{D}^{*}}\left[\text{CE}(% \hat{y}_{d^{*}_{i}},(1-\lambda)y_{\text{truth}}+\lambda\hat{y}_{d^{*}_{i},0})\right]

(3)

where CE denotes the cross entropy loss, $\hat{y}_{d^{*}_{i}}\in\Delta_{c}$ ( $c$ is the number of classes) denotes the soft prediction of the model with descriptor $d^{*}_{i}$ ; $y_{\text{truth}}$ denotes the one-hot encoding of the true label; and $\hat{y}_{d^{*}_{i},0}\in\Delta_{c}$ denotes the soft prediction of the initial model with descriptor $d^{*}_{i}$ . $\lambda\in[0,1]$ is a hyperparameter controlling the amount of regularization. $\hat{y}_{d^{*}_{i}}$ is the quantity being optimized. $\hat{y}_{d^{*}_{i},0}$ is the output of a softmax with temperature $\tau_{0}$ (the teacher temperature). As in classical knowledge distillation, it is often useful to set the teacher temperature to be different than the training temperature. Training the expectation directly in Eq. 3 requires storing $mc$ forward and backward passes of the text encoder in memory, which is not scalable. In practice, we use one descriptor per mini-batch and rotate among the $m$ descriptors in a round-robin fashion, but we train for the same number of iterations as finetuning with one descriptor.

Our training loss biases the model prediction toward the initial prediction of the model using each description, thereby maintaining the diversity of predictions present at initialization. Fig. 4 verifies this interpretation by showing that training with $\lambda=0.25$ results in a higher average KL divergence between descriptor predictions $\hat{y}_{d^{*}_{i}}$ and a higher average target accuracy than training with lower $\lambda$ s. Additionally, Tab. 4 displays results for a naive CoOp ensemble and CoOp trained with regularization towards the initialization. These results show that our diversity loss results cannot be obtained by simply ensembling or regularizing predictions towards the initialization as in [4, 71]. The training does not take longer than standard cross entropy training, since only one model is trained for all descriptors. Descriptor tokens are fixed.

	$m$	Source	XD Mean	DG Mean
		INet	(10 datasets)	(4 datasets)
CLIP ZS [43]	1	67.1	65.02	57.22
CoOp [70] $\dagger$		71.5	63.88	59.3
Co-CoOp [69] $\dagger$		71.0	65.74	59.9
MaPLe [25] $\dagger$		70.7	66.30	60.3
CLIPood [51] $\dagger$		71.6		60.5
Cross Entropy (CE)	1	72.3	66.80	60.39
+ GPT score mean [35]	5.8	71.7	66.86	59.92
+ Random descriptors	32	71.6	66.89	60.69
+ Waffle CLIP [45]	32	71.6	66.58	60.65
+ Descriptor soup (ours)	16.7	72.1	67.10	60.70
+ offset trick (ours)	100	72.1	67.51	61.01
+ Word soup centroids (ours)	8	71.8	67.16	61.22
+ Word soup score mean (ours)	8	71.7	67.43	61.32
+ Descriptor soup upper bound	11	71.7	67.62	61.01
ProGrad [71]	1	69.8	66.48	58.96
KgCoOp [22]	1	69.2	66.16	58.64
ProDA [31]	32	70.0	66.23	58.83
Vanilla CoOp [70]	1	70.0	66.52	59.25
+ Word soup score mean (ours)	8	70.2	67.30	60.25
Vanilla MaPLe [25]	1	70.7	66.44	59.32
+ Word soup score mean (ours)	8	70.8	66.65	60.20
Vanilla CLIPood [51]	1	72.9	66.50	60.47
+ Word soup score mean (ours)	8	72.0	67.42	61.23

Table 6: Comparison with few-shot methods and few-shot methods stacked with ZS methods.

\dagger

indicates author-reported numbers on the same datasets with the same train-test splits. Other numbers are our reproductions. All methods except the upper bound were trained on 3 random 16-shot splits of ImageNet.

m

indicates number of descriptors used. Either our descriptor soup with the offset trick or our word soup achieves the best accuracy on average. We use the ViT/B-16 CLIP model. Detailed results see Tab. 10 in the Appendix.

4 Results

We present the main few-shot results in Tab. 6. The goal this section is to demonstrate the following in the OOD setting:

1.

Complementary to existing few-shot methods: Stacking either descriptor soup or word soup on top of traditional finetuning baselines (Cross Entropy, MaPLe, Clipood, or CoOp) improves target accuracy, exceeding current published state-of-art. (Tab. 6)
2.

Parameter Efficiency: Our method is more parameter efficient than CoOp due to the discrete nature of word soup tokens. We additionally compare to other PEFT methods: VPT [20], bitfi t[64], CLIP-adapter [11], SSF [29], LoRA [17], and adapter [16]. (Fig. 2)
3.

Descriptor Efficiency: We outperform prior state-of-the-art ZS methods with only 1 or 2 descriptors. Therefore, unlike some prior methods, our method is not primarily driven by ensembling. (Fig. 5)

Datasets

We train on random 16-shot splits of ImageNet-1K [46] and test on 14 unseen target datasets: Caltech-101 [28], Oxford-Pets [40], Stanford-Cars [27], Flowers-102 [36], Food-101 [2], FGVC-Aircraft [34], SUN-397 [60], Describable-Textures (DTD) [7], EuroSAT [13], UCF-101 (an action recognition dataset) [52], ImageNet-V2 [44], ImageNet-Sketch [56], ImageNet-A (natural adversarial examples) [15], and ImageNet-R [14]. The last four datasets are domain-shifted versions of ImageNet containing images from the ImageNet-1K label space.

Experimental Setting

All baselines and methods are trained on 16-shot ImageNet-1K data and tested on the indicated target datasets. Hyperparameters: We tune parameters on a withheld validation set. Word soup (Alg. 2) has three parameters: $k_{0}$ , $k_{1}$ and patience. The diversity loss has two parameters: $\lambda$ and $\tau_{0}$ . These 5 parameters are constant across all experiments. We tune the learning rate separately for each baseline, but keep all other training parameters consistent across methods. We report temperature, batch size, optimizer, EMA setting, token length, initialization and other training details in Appendix A. We discuss the difference between centroid and score mean evaluation in Appendix D.

Discussion

In Tab. 6, we first observe that stacking our word soup method on top of CE, CoOp, MaPLe, or CLIPood achieves approximately 0.8-1.0% increase in average target accuracy for both XD and DG benchmarks. Due to the space limitation, we only compare word soup with other ZS methods when combined with CE, since CE achieves the highest XD accuracy out of the 4 finetuning methods. $m$ indicates the number of descriptors for each label, on average. The greedy descriptor soup can be augmented using our token offset trick, which uses 6 augmented copies of each descriptor. The token offset trick improves accuracy by 0.4% and 0.3% on XD and DG, resp. but at a significant computational cost. The greedy word soup matches the performance of the augmented descriptor soup without the additional computational cost. Overall, the best OOD accuracy is achieved by either the descriptor soup with token offsets or word soup.

Ablation Study

An ablation study on our soup methods with varying $m$ is presented in Fig. 5. On both benchmarks, our word soup performs best for all $m$ . We note that the word soup with $m=2$ already outperforms all ZS baselines for all values of $m$ up to 64. This result indicates that, unlike state-of-the-art ZS methods, ensembling is not the main ingredient of our method. Additional ablation studies are presented in Appendix E.

Parameter Efficiency and Computational Efficiency

A discussion regarding efficiency of our methods is deferred to Appendix E.

5 Conclusion

In this paper, we proposed descriptor and word soups to tackle the cross-dataset and domain generalization problems. Descriptor soup greedily selects a set of descriptors by maximizing training accuracy on a source dataset. Word soup builds a chain of words using a similar greedy procedure. These greedy soup methods achieve higher target classification accuracy than prior descriptor-based methods by explicitly maximizing training accuracy. We further proposed a loss function to preserve word soup diversity throughout finetuning. When using word soup for initialization and finetuning with the diversity loss, we can significantly improve the accuracy of existing few-shot OOD finetuning methods. Compared to all baselines, word soup achieves the best trade-off between parameter efficiency and target accuracy.

Acknowledgements

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited.

This material is based upon work supported by the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Under Secretary of Defense for Research and Engineering.

References

Allingham et al. [2023] James Urquhart Allingham, Jie Ren, Michael W Dusenberry, Xiuye Gu, Yin Cui, Dustin Tran, Jeremiah Zhe Liu, and Balaji Lakshminarayanan. A simple zero-shot prompt weighting technique to improve prompt ensembling in text-image models. In International Conference on Machine Learning, pages 547–568. PMLR, 2023.
Bossard et al. [2014] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.
Bulat and Tzimiropoulos [2023] Adrian Bulat and Georgios Tzimiropoulos. Lasp: Text-to-text optimization for language-aware soft prompting of vision & language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23232–23241, 2023.
Cha et al. [2022] Junbum Cha, Kyungjae Lee, Sungrae Park, and Sanghyuk Chun. Domain generalization by mutual-information regularization with pre-trained models. In European Conference on Computer Vision, pages 440–457. Springer, 2022.
Chen et al. [2022] Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. Prompt learning with optimal transport for vision-language models. arXiv preprint arXiv:2210.01253, 2022.
Cho et al. [2023] Junhyeong Cho, Gilhyun Nam, Sungyeon Kim, Hunmin Yang, and Suha Kwak. Promptstyler: Prompt-driven style generation for source-free domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15702–15712, 2023.
Cimpoi et al. [2013] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. CoRR, abs/1311.3618, 2013.
Das et al. [2023] Rajshekhar Das, Yonatan Dukler, Avinash Ravichandran, and Ashwin Swaminathan. Learning expressive prompting with residuals for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3366–3377, 2023.
Derakhshani et al. [2023] Mohammad Mahdi Derakhshani, Enrique Sanchez, Adrian Bulat, Victor G Turrisi da Costa, Cees GM Snoek, Georgios Tzimiropoulos, and Brais Martinez. Bayesian prompt learning for image-language model generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15237–15246, 2023.
Feng et al. [2023] Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, and Wangmeng Zuo. Diverse data augmentation with diffusions for effective test-time prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2704–2714, 2023.
Gao et al. [2023] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, pages 1–15, 2023.
He et al. [2022] Xuehai He, Diji Yang, Weixi Feng, Tsu-Jui Fu, Arjun Akula, Varun Jampani, Pradyumna Narayana, Sugato Basu, William Yang Wang, and Xin Eric Wang. Cpl: Counterfactual prompt learning for vision and language models. arXiv preprint arXiv:2210.10362, 2022.
Helber et al. [2019] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
Hendrycks et al. [2021a] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021a.
Hendrycks et al. [2021b] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021b.
Houlsby et al. [2019] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
Hu et al. [2021] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
Hu et al. [2023] Zi-Yuan Hu, Yanyang Li, Michael R Lyu, and Liwei Wang. Vl-pet: Vision-and-language parameter-efficient tuning via granularity control. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3010–3020, 2023.
Ilharco et al. [2021] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, 2021. If you use this software, please cite it as below.
Jia et al. [2022] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
Jie and Deng [2022] Shibo Jie and Zhi-Hong Deng. Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039, 2022.
Kan et al. [2023] Baoshuo Kan, Teng Wang, Wenpeng Lu, Xiantong Zhen, Weili Guan, and Feng Zheng. Knowledge-aware prompt tuning for generalizable vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15670–15680, 2023.
Kang et al. [2019] Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G Hauptmann. Contrastive adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4893–4902, 2019.
Kaul et al. [2023] Prannay Kaul, Weidi Xie, and Andrew Zisserman. Multi-modal classifiers for open-vocabulary object detection. arXiv preprint arXiv:2306.05493, 2023.
Khattak et al. [2023a] Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023a.
Khattak et al. [2023b] Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15190–15200, 2023b.
Krause et al. [2013] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
Li et al. [2022] Fei-Fei Li, Marco Andreeto, Marc’Aurelio Ranzato, and Pietro Perona. Caltech 101, 2022.
Lian et al. [2022] Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning. Advances in Neural Information Processing Systems, 35:109–123, 2022.
Lin et al. [2023] Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, and Deva Ramanan. Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19325–19337, 2023.
Lu et al. [2022] Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Ya**g Liu, and Xinmei Tian. Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5206–5215, 2022.
Luo et al. [2023] Gen Luo, Minglang Huang, Yiyi Zhou, Xiaoshuai Sun, Guannan Jiang, Zhiyu Wang, and Rongrong Ji. Towards efficient visual adaption via structural re-parameterization. arXiv preprint arXiv:2302.08106, 2023.
Ma et al. [2023] Xiaosong Ma, Jie Zhang, Song Guo, and Wenchao Xu. Swapprompt: Test-time prompt adaptation for vision-language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Maji et al. [2013] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew B. Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. CoRR, abs/1306.5151, 2013.
Menon and Vondrick [2022] Sachit Menon and Carl Vondrick. Visual classification via description from large language models. arXiv preprint arXiv:2210.07183, 2022.
Nilsback and Zisserman [2008] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics and Image Processing, pages 722–729, 2008.
Niss et al. [2024] Laura Niss, Kevin Vogt-Lowell, and Theodoros Tsiligkaridis. Quantified task misalignment to inform PEFT: An exploration of domain generalization and catastrophic forgetting in CLIP. In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024.
Novack et al. [2023] Zachary Novack, Julian McAuley, Zachary Chase Lipton, and Saurabh Garg. Chils: Zero-shot image classification with hierarchical label sets. In International Conference on Machine Learning, pages 26342–26362. PMLR, 2023.
Pantazis et al. [2022] Omiros Pantazis, Gabriel Brostow, Kate Jones, and Oisin Mac Aodha. Svl-adapter: Self-supervised adapter for vision-language pretrained models. arXiv preprint arXiv:2210.03794, 2022.
Parkhi et al. [2012] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
Peng et al. [2023] Fang Peng, Xiaoshan Yang, Linhui Xiao, Yaowei Wang, and Changsheng Xu. Sgva-clip: Semantic-guided visual adapting of vision-language models for few-shot image classification. IEEE Transactions on Multimedia, 2023.
Pratt et al. [2023] Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What does a platypus look like? generating customized prompts for zero-shot image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15691–15701, 2023.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.
Recht et al. [2019] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? CoRR, abs/1902.10811, 2019.
Roth et al. [2023] Karsten Roth, Jae Myung Kim, A Koepke, Oriol Vinyals, Cordelia Schmid, and Zeynep Akata. Waffling around for performance: Visual classification with random words and broad concepts. arXiv preprint arXiv:2306.07282, 2023.
Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
Saito et al. [2019] Kuniaki Saito, Donghyun Kim, Stan Sclaroff, Trevor Darrell, and Kate Saenko. Semi-supervised domain adaptation via minimax entropy. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8050–8058, 2019.
Samadh et al. [2023] Jameel Hassan Abdul Samadh, Hanan Gani, Noor Hazim Hussein, Muhammad Uzair Khattak, Muzammal Naseer, Fahad Khan, and Salman Khan. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Shi and Yang [2023] Cheng Shi and Sibei Yang. Logoprompt: Synthetic text images can be good visual prompts for vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2932–2941, 2023.
Shu et al. [2022] Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-time prompt tuning for zero-shot generalization in vision-language models. Advances in Neural Information Processing Systems, 35:14274–14289, 2022.
Shu et al. [2023] Yang Shu, Xingzhuo Guo, Jialong Wu, Ximei Wang, Jianmin Wang, and Mingsheng Long. Clipood: Generalizing clip to out-of-distributions, 2023.
Soomro et al. [2012] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
Sung et al. [2022a] Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Lst: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems, 35:12991–13005, 2022a.
Sung et al. [2022b] Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237, 2022b.
Vogt-Lowell et al. [2023] Kevin Vogt-Lowell, Noah Lee, Theodoros Tsiligkaridis, and Marc Vaillant. Robust fine-tuning of vision-language models for domain generalization. In IEEE High Performance Extreme Computing Conference (HPEC), 2023.
Wang et al. [2019] Haohan Wang, Songwei Ge, Eric P. Xing, and Zachary C. Lipton. Learning robust global representations by penalizing local predictive power. CoRR, abs/1905.13549, 2019.
Wang et al. [2022] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149, 2022.
Wortsman et al. [2022a] Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR, 2022a.
Wortsman et al. [2022b] Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022b.
Xiao et al. [2010] Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492, 2010.
Xing et al. [2022] Yinghui Xing, Qirui Wu, De Cheng, Shizhou Zhang, Guoqiang Liang, and Yanning Zhang. Class-aware visual prompt tuning for vision-language pre-trained model. arXiv preprint arXiv:2208.08340, 2022.
Yao et al. [2023] Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6757–6767, 2023.
Yu et al. [2023] Tao Yu, Zhihe Lu, Xin **, Zhibo Chen, and Xinchao Wang. Task residual for tuning vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10899–10909, 2023.
Zaken et al. [2021] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
Zang et al. [2022] Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Unified vision and language prompt learning. arXiv preprint arXiv:2210.07225, 2022.
Zhang et al. [2021] Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
Zhang et al. [2019] Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adaptation. In International conference on machine learning, pages 7404–7413. PMLR, 2019.
Zhou et al. [2022a] Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022a.
Zhou et al. [2022b] Kaiyang Zhou, **gkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022b.
Zhou et al. [2022c] Kaiyang Zhou, **gkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022c.
Zhu et al. [2023] Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15659–15669, 2023.

\thetitle

Supplementary Material

Algorithm 1 Descriptor soup pseudo-code, PyTorch-like

{minted}

[fontsize=]python # ’@’ means matrix multiplication in Python. # ’+’ means concatenation when operating on lists. # Inputs: L2-normalized image_embeddings, y_truth # classnames: list of classnames in English # model: CLIP-style model # descriptions: list of descriptions from an LLM # Hyperparameters: m (number of members in the soup) descriptions = [’which has legs.’, ’which can swim.’, … ]

def get_accuracy(image_embeddings, text_embeddings, y_truth): scores = image_embeddings @ text_embeddings.T return (scores.argmax(dim=1) == y_truth).mean()

def get_description_embeddings(description): d = tokenizer([’a photo of ’ + classname + ’, ’ + description for classname in classnames]) return normalize(model.encode_text(d))

accuracies = [] for description in descriptions: text_embeddings = get_description_embeddings( description) accuracies.append(get_accuracy( image_embeddings, text_embeddings, y_truth)) # sort descriptions by accuracies descriptions_sorted = descriptions[ accuracies.sort(descending=True).indices]

# initialize with best descriptor soup, accuracy = [descriptions_sorted[0]], accuracies[0]

# greedy selection for description in descriptions_sorted: soup_embeddings = stack( [get_description_embeddings(description) for description in soup + [description] ] ) text_embeddings = normalize( soup_embeddings.mean(dim=0)) if get_accuracy(image_embeddings, text_embeddings, y_truth) > current_acc: soup = soup + [description]

return soup[:m]

Algorithm 2 Word soup pseudo-code, PyTorch-like

{minted}

[fontsize=]python # Hyperparameters: k0, k1, m, patience # Inputs: L2-normalized image_embeddings, y_truth

words = ["the", "of", "and", … ] accuracies = [] for word in words: text_embeddings = get_description_embeddings(word) accuracies.append(get_accuracy( image_embeddings, text_embeddings, y_truth)) # sort descriptions by accuracies words = words[accuracies.sort(descending=True).indices]

soup = [] for repeat m times: first_word = random.shuffled(words[0:k0])[0] word_chain = first_word accuracy = get_accuracy(image_embeddings, get_description_embeddings(word_chain), y_truth) words_k1 = random.shuffled(words[0:k1])[0:patience]

# greedy selection for word in words_k1: text_embeddings = get_description_embeddings( word_chain + " " + word) next_accuracy = get_accuracy(image_embeddings, text_embeddings, y_truth) if next_accuracy > accuracy: word_chain = word_chain + " " + word soup = soup + [word_chain]

return soup

Limitations

Similar to many related works, the main limitation of our work is that we require the source dataset to cover a broad range of classes (e.g. ImageNet). As a counter example, we cannot hope to train on pets classification and generalize to ImageNet. We highlighted this limitation in Table 1 of the main paper (top) with qualitative examples.

Appendix A Training details

Images are not augmented during the greedy descriptor selection process; image augmentation during finetuning is consistent with prior work. Descriptors are always selected using the pretrained model parameters. Selecting descriptors based on finetuned model weights would be sub-optimal, since the pretrained text encoder captures a richer set of textual information. Remaining details are organized in Table 7. Mini-batches are randomly sampled, but with exactly one sample per label per batch. Cross entropy and CLIPood both tune the last three layers of the image and text encoders, in addition to a shallow text prompt (like CoOp) at a higher learning rate. The only difference between Cross entropy and CLIPood is the loss function; the latter method uses an adaptive margin. We use cross entropy loss for all baselines except ProDA and ProGrad. ProDA and ProGrad consume more GPU memory during training, so we were unable to fit them onto a single A40 GPU when training with cross entropy. Consequently, we were forced to use a CLIP-like contrastive loss for these two methods to reduce the number of text encoder evaluations.

General Parameters
batch size	64
learning rate	tuned per method
weight decay	1e-5
number of iterations	750
learning rate decay	none
softmax temperature	60
optimizer	SGD momentum=0.9
label smoothing	0
EMA weight averaging $\beta$	0.995
Prompt Tuning Parameters
CoOp prompt length	3
CoOp prompt depth	1 (shallow)
MaPLe prompt depth	3
MaPLe prompt length	3
CoOp prompt initialization	“a photo of”
text prompt learning rate multiplier	10 $\times$
Word Soup and Diversity Loss Parameters
$k_{0}$	250
$k_{1}$	1000
patience	250
$\lambda$	0.25
$\tau_{0}$	10
Optimal Learning Rates
Cross entropy	2e-5
CLIPood [51]	2e-5
CoOp [70]	8e-5
MaPLe [25]	0.025
KgCoOp [22]	4e-5
ProDA [31]	3.2e-4
ProGrad [71]	1.28e-3
VPT [20]	0.8
bitfit [64]	1.25e-4
CLIP-adapter [11]	6e-3
SSF [29]	1e-4
adapter [16]	2.5e-3
LoRA [17]	1e-5

Table 7: Miscellaneous training details for training on 16-shot ImageNet-1K in the OOD setting.

Appendix B Additional Word Soup Motivation

A natural baseline for word soup is soft prompt tuning (CoOp), since the former method can be thought of as “discrete” prompt tuning. Soft prompt tuning optimizes over a continuous parameter space using gradient descent, whereas word soup optimizes over a discrete parameter space using a greedy algorithm. Many prior works (e.g. [58, 59]) observe that gradient descent is limited to a narrow convex basin around the initialization, when finetuning a pretrained deep model. This can be shown by linearly interpolating between the pretrained and finetuned parameters, similar to Fig. 6. In this figure, we plot in orange both the source and target error for interpolations between a randomly initialized descriptor (orange star) and the finetuned soft descriptor. The resulting soft descriptor lies at the bottom of a sharp loss basin. On the other hand, the word soup initialized descriptor (blue star) lies at an equally low but much flatter region of the loss landscape. Finetuning from this initialization leads to a lower error on both source and target data, as indicated in blue. This visualization suggests that our word soup algorithm finds robust flat minima, since it is not limited to a narrow loss basin like gradient descent methods.

	Source	Cross-dataset Evaluation Targets											Domain Generalization Targets
	INet	Caltech	Pets	Cars	Flowers	Food	Aircraft	SUN	DTD	EuroSAT	UCF	Mean	INet-V2	Sketch	INet-A	INet-R	Mean
CLIP ZS	67.1	93.3	89.0	65.4	71.0	85.7	25.0	63.2	43.6	46.7	67.4	65.02	61.0	46.6	47.2	74.1	57.22
Word soup	68.8	94.1	89.5	65.9	72.6	86.3	26.1	67.2	45.3	53.9	67.8	66.87	62.6	49.0	50.4	77.0	59.73
Vanilla CoOp	68.7	94.4	90.2	66.1	70.9	85.8	26.0	66.7	47.4	50.1	68.9	66.63	61.9	48.6	49.8	76.7	59.26
+ Word soup	69.1	94.6	91.1	65.2	71.8	86.0	25.1	67.4	46.0	51.9	69.1	66.82	62.7	49.4	50.3	78.0	60.09

Table 8: Experiments using a different source dataset (a 16-shot subset of LAION-2B queried using ImageNet label names). Settings are identical to Table 10 (the expanded form of Table 6 in the main paper).

Appendix C Token offset trick (for Descriptor Soup)

We propose a novel trick to augment/diversify the descriptors at test time to further increase the target accuracy of descriptor soups. This trick does not improve the performance of word soups significantly. Unlike the vision encoder, which has a cls token at a fixed position (either prepended or appended to the image tokens), the CLIP text encoder does not have a separate cls token. Instead, CLIP uses the output embedding which corresponds to the position of the end-of-sentence token in the input. In classification problems, the text inputs are generally short compared to the context size (number of total tokens). Consequently, the end-of-sentence token is always near the beginning of the sequence, with the remainder padded by null tokens. In this regime, there is never any information at the end of the input token sequence to attend to, so a large portion of the information in the pretrained model is not used. We remedy this inefficient use of pretrained parameters by shifting the description toward the end of the sequence by $t$ tokens. For example, if $t=5$ , we have:

•

original: a photo of a dog, which may be large or small.
•

augmented: a photo of a dog, ! ! ! ! ! which may be large or small. (“!” denotes the null token)

For all experiments with token offsets, we set $t=\{0,5,10,15,20,25\}$ for a total of 6 augmented copies per descriptor. This diversifies the text embeddings at the expense of increasing the text centroid evaluation time 6-folds.

Appendix D Centroid vs. Score Mean Evaluation

In this work, we presented both centroid and score mean results for both our soup methods and ensemble baselines. Centroid evaluation refers to averaging the text features among descriptors before calculating the cosine similarity between image and text features. Score mean evaluation refers to calculating the cosine similarity between image and text features and then averaging the similarity scores among descriptors.

Concretely, let there be $m$ descriptors and $c$ classes. Let $\mathbf{x}_{I}$ denote a normalized image feature and $\mathbf{x}_{T,k}^{j}$ denote the normalized text feature corresponding to class $k$ and descriptor $j$ ; $k\in[1:c]$ and $j\in[1:m]$ .

The predicted score for class $k$ using centroid evaluation, $s_{k}$ , is defined as:

\begin{split}\overline{\mathbf{x}}_{T,k}=\frac{1}{m}\sum_{j=1}^{m}\mathbf{x}_{% T,k}^{j}\end{split}

s_{k}=\left\langle\mathbf{x}_{I},\frac{\overline{\mathbf{x}}_{T,k}}{\|% \overline{\mathbf{x}}_{T,k}\|}\right\rangle

The predicted score for class $k$ using score mean evaluation is defined as:

s_{k}=\frac{1}{m}\sum_{j=1}^{m}\left\langle\mathbf{x}_{I},\mathbf{x}_{T,k}^{j}\right\rangle

Empirically, we found that score mean evaluation usually leads to small numerical improvements. However, in large scale applications where retrieval speed is crucial, centroid evaluation can be more efficiently implemented than score mean evaluation, due to the existence of fast nearest neighbor retrieval frameworks.

Appendix E Additional Ablation Studies

We present additional ablation studies in Table 8 and Figure 7. Table 8 presents OOD generalization results with a different source data set. Figure 7 presents results with different number of shots.

Parameter Efficiency

Fig. 2 compares the parameter efficiency of our word soups against PEFT baselines. We observe that word soup can achieve the maximal CoOp accuracy using 25 $\times$ and 70 $\times$ fewer parameters on the XD and DG benchmarks, resp. This impressive reduction in parameter storage requirements is due to the discrete nature of word soup parameters. A discrete token requires only one integer parameter, while a soft token requires 512 floating-point parameters.

Computational Efficiency

We emphasize that our method adds negligible test time computation, despite requiring $m$ text encoder evaluations per label. For classification tasks, more time is spent processing image data compared to text data. For example, the evaluation of the $m=8$ word soup in Table 6 took 239 seconds, of which 234 seconds were spent evaluating image embeddings and only 4.6 seconds were spent evaluating text embeddings.

		Source	Cross-dataset Evaluation Targets											Domain Generalization Targets
	$m$	INet	Caltech	Pets	Cars	Flowers	Food	Aircraft	SUN	DTD	EuroSAT	UCF	Mean	INet-V2	Sketch	INet-A	INet-R	Mean
CLIP ZS	1	67.1	93.3	89.0	65.4	71.0	85.7	25.0	63.2	43.6	46.7	67.4	65.02	61.0	46.6	47.2	74.1	57.22
Vanilla CoOp	1	70.0	94.6	91.2	65.4	71.2	86.3	24.6	66.9	48.0	48.3	68.7	66.52	63.2	48.4	49.2	76.2	59.25
+ word soup	8	69.6	94.6	90.8	65.2	70.3	86.0	24.8	66.9	47.6	50.7	69.0	66.59	62.9	48.2	49.6	76.3	59.26
CoOp ensemble	8	69.8	94.4	91.5	66.2	72.6	86.6	25.7	67.7	46.4	47.9	67.8	66.68	63.0	48.4	49.6	75.8	59.18
CoOp regularized towards initialization	1	70.2	94.8	91.1	65.4	72.1	86.2	24.8	67.6	46.2	52.7	69.0	66.97	63.6	49.1	49.6	77.5	59.94
+ word soup	8	69.9	94.7	90.1	64.7	71.8	85.5	25.0	67.4	45.5	53.6	68.7	66.69	63.4	49.2	49.9	77.7	60.05
CoOp with label smoothing	1	70.1	94.5	90.6	64.9	72.0	85.8	24.6	67.3	45.4	50.0	68.6	66.37	63.4	49.1	50.2	77.6	60.09
+ word soup	8	69.9	94.5	89.9	64.9	71.7	85.2	25.0	66.8	44.8	50.0	68.3	66.13	63.6	49.3	50.1	77.7	60.16
CoOp + word soup ( $\lambda=0$ )	8	69.8	94.3	90.8	64.8	71.1	86.0	24.1	67.2	46.8	48.4	68.8	66.21	63.2	48.3	49.0	76.1	59.15
+ our diversity loss ( $\lambda=0.25$ )	8	70.2	94.7	91.0	65.4	72.3	86.0	24.8	67.8	45.9	55.2	69.2	67.23	63.6	49.3	50.1	77.9	60.20

Table 9: Ablation results to support the diversity loss. “Vanilla CoOp + word soup” refers to naively appending the word soup descriptors trained on the pretrained model to the separately trained soft CoOp prompts. “CoOp ensemble” refers to ensembling

m

randomly-initialized soft descriptors. This requires running CoOp

m

times, but offers negligible gains in accuracy. In the second half of the table, we fix the descriptor tokens and train the prompt tokens only. We first run CoOp with standard CE training (

\lambda=0

) and observe a decrease in accuracy compared to the naive “Vanilla CoOp + word soup” baseline, caused by the diversity collapse issue observed in Figure 4. We then attempt to simply minimize the KL divergence between the training prediction and the initial prediction; this shows that the diversity loss is not simply a form of regularization towards the initialization as in MIRO [4] and ProGrad [71]. Finally, we train using our diversity loss with

\lambda=0.25

, which achieves a 1% increase in accuracy on average. Average of 3 trials. This is an expanded version of Table 4 in the main paper.

		Source	Cross-dataset Evaluation Targets											Domain Generalization Targets
	$m$	INet	Caltech	Pets	Cars	Flowers	Food	Aircraft	SUN	DTD	EuroSAT	UCF	Mean	INet-V2	Sketch	INet-A	INet-R	Mean
CLIP ZS [43]	1	67.1	93.3	89.0	65.4	71.0	85.7	25.0	63.2	43.6	46.7	67.4	65.02	61.0	46.6	47.2	74.1	57.22
CoOp [70] $\dagger$		71.51	93.70	89.14	64.51	68.71	85.30	18.47	64.15	41.92	46.39	66.55	63.88	64.20	47.99	49.71	75.21	59.3
Co-CoOp [69] $\dagger$		71.02	94.43	90.14	65.32	71.88	86.06	22.94	67.36	45.73	45.37	68.21	65.74	64.07	48.75	50.63	76.18	59.9
MaPLe [25] $\dagger$		70.72	93.53	90.49	65.57	72.23	86.20	24.74	67.01	46.49	48.06	68.69	66.30	64.07	49.15	50.90	76.98	60.3
CLIPood [51] $\dagger$		71.6												64.9	49.3	50.4	77.2	60.5
Cross Entropy (CE)	1	72.3	94.6	89.8	64.9	72.4	86.3	25.3	68.1	45.7	51.5	69.4	66.80	65.4	49.4	49.8	77.0	60.39
+ GPT score mean [35]	5.8	71.7	94.3	89.9	64.5	72.1	86.0	24.5	68.6	46.6	53.8	68.4	66.86	64.9	49.4	48.8	76.6	59.92
+ Random descriptors	32	71.6	94.6	89.3	64.7	72.1	86.0	25.3	67.5	45.4	55.2	68.8	66.89	64.8	49.9	50.2	77.9	60.69
+ Waffle CLIP [45]	32	71.6	94.1	89.8	65.0	72.6	86.1	26.1	67.7	45.0	50.9	68.4	66.58	65.1	49.7	50.3	77.4	60.65
+ Descriptor soup (ours)	16.7	72.1	94.7	89.9	65.0	72.4	86.3	25.6	68.0	45.6	53.9	69.5	67.10	65.3	49.7	50.1	77.7	60.70
+ offset trick (ours)	100	72.1	94.1	90.4	66.3	73.3	86.3	26.1	67.8	46.4	55.0	69.4	67.51	65.3	49.8	50.8	78.2	61.01
+ Word soup centroids (ours)	8	71.8	94.4	90.4	65.0	72.3	86.1	25.3	68.2	45.5	55.4	69.1	67.16	65.2	50.2	50.7	78.7	61.22
+ Word soup score mean (ours)	8	71.7	94.5	90.2	65.1	72.4	86.2	25.6	68.1	45.6	57.3	69.3	67.43	65.3	50.3	50.9	78.7	61.32
+ Descriptor soup upper bound	11	71.7	94.4	90.2	66.5	72.9	86.1	26.3	67.4	46.4	57.2	68.6	67.62	64.9	49.7	50.9	78.6	61.01
ProGrad [71]	1	69.8	94.4	91.5	65.8	72.4	86.4	25.3	66.6	47.2	46.3	69.0	66.48	63.2	48.2	48.6	75.9	58.96
KgCoOp [22]	1	69.2	94.3	89.9	63.9	71.0	85.7	23.7	66.2	44.4	54.4	68.3	66.16	62.3	48.0	48.8	75.5	58.64
ProDA [31]	32	70.0	94.2	90.2	64.7	70.8	85.7	23.1	67.0	45.8	51.4	69.4	66.23	63.0	48.1	48.4	75.7	58.83
Vanilla CoOp [70]	1	70.0	94.6	91.2	65.4	71.2	86.3	24.6	66.9	48.0	48.3	68.7	66.52	63.2	48.4	49.2	76.2	59.25
+ Word soup score mean (ours)	8	70.2	94.7	90.9	65.4	72.0	86.0	25.0	67.7	45.9	56.2	69.2	67.30	63.6	49.3	50.1	77.9	60.25
Vanilla MaPLe [25]	1	70.7	93.7	91.2	65.4	71.9	86.2	25.0	67.2	46.2	48.6	68.9	66.44	63.9	48.6	48.4	76.3	59.32
+ Word soup score mean (ours)	8	70.8	94.1	91.2	65.2	71.8	85.8	24.0	67.0	46.0	53.5	68.0	66.65	64.0	49.6	49.2	77.9	60.20
Vanilla CLIPood [51]	1	72.9	94.8	89.8	64.9	72.2	85.9	25.8	67.8	46.4	48.7	68.7	66.50	66.0	49.5	49.5	76.9	60.47
+ Word soup score mean (ours)	8	72.0	94.4	90.8	64.8	72.4	86.0	25.4	67.9	46.0	57.6	68.9	67.42	65.5	50.2	50.8	78.5	61.23

Table 10: Comparison with few-shot methods and few-shot methods stacked with ZS methods.

\dagger

indicates author-reported numbers on the same datasets with the same train-test splits. Other numbers are from our reproductions using our github code. We tune all baselines on a withheld validation set, so our numbers are different from published numbers. The descriptor soup upper bound was trained to maximize average cross-dataset accuracy (on test data); this loosely approximates the maximally achievable accuracy on these benchmarks without using extra information. All other methods were trained on 3 random 16-shot splits of ImageNet.

m

indicates number of descriptors used. All methods are evaluated on top of 3 models finetuned with different random seeds. Due to space limitations, we only compare with ZS baselines stacked on top of the CE-finetuned few-shot model, since this is the best finetuned model. Either our descriptor soup with the offset trick or our word soup achieves the best accuracy on most datasets. Finally, we stack our word soup method on top of CoOp, MaPLe, and CLIPood finetuned models to show that word soup is complementary to most existing robust finetuning methods. Average of 3 trials. This is an expanded version of Table 6 in the main paper.

		Source	Cross-dataset Evaluation Targets											Domain Generalization Targets
	parameters (thousands)	INet	Caltech	Pets	Cars	Flowers	Food	Aircraft	SUN	DTD	EuroSAT	UCF	Average	INet-V2	INet-Sketch	INet-A	INet-R	Average
VPT shallow 1 token	0.768	68.7	93.8	90.0	65.1	69.5	85.3	24.2	66.0	44.7	41.9	67.8	64.84	62.1	47.9	47.9	76.7	58.67
VPT shallow 2 tokens	2	68.7	93.8	90.0	65.2	69.5	85.2	24.2	66.2	44.8	42.3	67.1	64.84	62.2	48.0	47.3	76.7	58.54
VPT shallow 3 tokens	2	68.7	93.9	90.0	65.6	70.2	85.3	24.8	66.2	44.7	43.8	67.5	65.20	62.4	48.1	47.0	76.6	58.52
VPT shallow 3 tokens	2	68.6	93.8	89.5	64.8	70.1	85.3	24.1	66.1	44.5	45.4	67.7	65.12	62.1	48.0	47.1	76.4	58.41
VPT deep 2 layers	5	68.8	93.5	89.7	65.0	70.3	85.4	24.0	65.9	44.7	49.3	67.6	65.54	62.2	48.2	46.9	76.6	58.47
VPT deep 3 layers	7	68.7	93.5	89.4	65.3	70.4	85.3	24.2	66.2	44.8	45.0	67.5	65.16	62.3	48.2	46.8	76.4	58.42
MaPLe 1 layer	396	70.1	94.2	91.1	64.3	71.1	86.1	24.5	67.0	47.3	51.8	68.6	66.61	63.4	48.4	48.8	76.3	59.22
MaPLe 2 layers	397	70.4	93.6	91.8	64.3	71.3	85.9	24.7	67.0	46.9	48.1	68.5	66.21	63.7	48.3	49.2	76.1	59.34
MaPLe 3 layers	399	70.7	93.7	91.2	65.4	71.9	86.2	25.0	67.2	46.2	48.6	68.9	66.44	63.9	48.6	48.4	76.3	59.32
bitfit last layer	17	68.3	94.1	89.5	65.2	71.4	85.9	24.9	65.7	44.7	46.9	67.9	65.62	61.7	48.0	48.5	75.9	58.51
bitfit last 2 layers	34	68.8	93.9	89.9	65.3	71.4	85.9	25.1	66.4	45.1	47.4	68.4	65.88	62.1	48.6	48.5	76.6	58.93
bitfit last 3 layers	51	69.1	93.9	90.0	65.3	71.7	85.8	25.0	66.7	45.4	48.3	68.4	66.05	62.6	48.7	48.5	76.8	59.12
CoOp 1 token	0.512	69.4	94.3	91.4	64.4	71.7	86.3	24.6	67.2	47.3	49.1	68.5	66.49	63.1	48.2	49.0	76.1	59.08
CoOp 2 tokens	1	69.9	94.6	91.6	65.5	72.0	86.1	25.0	66.8	48.2	49.6	69.4	66.89	63.2	48.5	48.8	76.3	59.20
CoOp 3 tokens	2	70.2	94.5	91.0	66.0	71.6	86.3	24.6	66.8	47.6	49.0	68.9	66.63	63.4	48.5	49.5	76.3	59.45
ProGrad 1 token	0.512	69.4	94.2	91.0	65.6	72.7	86.4	25.1	66.2	46.0	48.2	68.5	66.39	62.8	48.1	48.5	75.7	58.77
ProGrad 2 tokens	1	69.5	94.1	90.8	65.7	72.6	86.3	24.8	66.5	45.5	47.7	68.7	66.28	62.8	48.0	48.5	75.7	58.75
ProGrad 3 tokens	2	69.8	94.4	91.5	65.8	72.4	86.4	25.3	66.6	47.2	46.3	69.0	66.48	63.2	48.2	48.6	75.9	58.96
KgCoOp 1 token	0.512	68.6	93.4	89.4	63.4	70.9	85.9	23.8	65.6	44.9	52.5	68.1	65.80	62.0	47.8	49.1	75.7	58.63
KgCoOp 2 tokens	1	69.0	93.3	89.3	62.8	70.2	85.8	23.8	66.0	45.4	53.0	69.0	65.85	62.4	48.0	49.1	75.9	58.85
KgCoOp 3 tokens	2	69.2	94.3	89.9	63.9	71.0	85.7	23.7	66.2	44.4	54.4	68.3	66.16	62.3	48.0	48.8	75.5	58.64
ProDA ensemble size 4	20	70.5	94.3	90.4	65.3	71.2	86.1	24.9	67.2	46.4	50.4	69.4	66.54	63.6	48.6	49.4	76.0	59.43
ProDA ensemble size 8	41	70.1	93.8	90.3	65.1	71.0	85.8	24.9	67.4	45.5	49.4	68.4	66.15	63.3	48.8	49.5	76.6	59.55
ProDA ensemble size 16	82	69.9	94.3	90.5	64.5	70.8	85.6	24.3	66.6	45.2	48.4	68.8	65.90	63.1	48.4	48.9	76.1	59.13
ProDA ensemble size 32	164	70.0	94.2	90.2	64.7	70.8	85.7	23.1	67.0	45.8	51.4	69.4	66.23	63.0	48.1	48.4	75.7	58.83
ProDA ensemble size 64	328	69.4	94.4	90.0	64.5	69.5	85.1	22.7	66.4	44.9	49.6	67.8	65.49	62.7	48.0	48.7	76.2	58.91
CLIP-adapter reduction=128	4	67.1	93.3	89.0	65.3	70.9	85.7	25.1	63.3	43.5	46.6	67.4	65.00	60.9	46.6	47.2	74.1	57.18
CLIP-adapter reduction=64	8	67.1	93.3	88.8	65.4	71.1	85.7	24.9	63.3	43.5	46.5	67.2	64.97	60.9	46.5	47.2	74.0	57.17
CLIP-adapter reduction=32	16	67.4	93.2	88.4	65.2	70.1	85.6	24.9	64.1	44.0	46.3	66.8	64.84	60.9	46.9	47.9	74.5	57.55
CLIP-adapter reduction=16	33	67.6	93.3	88.3	64.9	70.1	85.6	24.5	64.4	43.9	46.7	66.8	64.86	61.2	47.2	48.4	75.1	57.98
CLIP-adapter reduction=8	66	67.9	93.4	88.7	65.4	70.2	85.7	24.8	65.1	44.3	46.6	66.7	65.09	61.5	47.5	48.5	75.3	58.21
CLIP-adapter reduction=4	131	67.8	93.4	89.0	65.2	70.2	85.7	24.5	65.2	44.2	46.0	66.8	65.02	61.5	47.5	48.3	75.1	58.12
SSF last layer	12	68.1	94.0	89.5	65.4	71.0	85.7	24.7	65.6	45.3	51.6	68.5	66.13	61.6	47.8	46.4	75.7	57.87
SSF last 2 layers	25	68.5	94.1	89.9	65.1	71.2	85.8	24.8	66.3	45.9	49.1	68.2	66.04	62.1	48.3	47.2	76.3	58.46
SSF last 3 layers	37	68.5	94.2	89.5	64.9	71.2	85.3	24.4	66.2	45.8	49.3	67.8	65.86	62.1	48.1	47.2	76.3	58.44
LoRA rank=1	18	67.3	93.5	89.3	65.4	71.3	85.7	25.1	64.2	44.4	47.9	67.6	65.43	61.4	47.1	46.9	74.9	57.59
LoRA rank=2	37	67.6	93.7	90.0	65.7	71.2	85.7	25.3	65.6	45.9	49.6	67.8	66.05	61.9	47.7	45.3	75.6	57.62
LoRA rank=4	74	67.6	93.8	90.1	65.7	71.5	85.7	25.2	65.4	46.0	50.9	67.7	66.19	61.8	47.7	46.2	76.0	57.93
LoRA rank=8	147	68.0	93.9	90.0	65.7	71.4	85.4	25.5	65.9	46.3	52.6	67.2	66.39	61.9	47.1	42.2	74.4	56.40
ResBlock-adapter reduction=128	55	68.0	93.8	89.2	64.0	71.1	84.7	23.3	65.1	45.3	46.0	67.6	65.01	61.2	47.4	47.2	75.5	57.81
ResBlock-adapter reduction=64	111	68.8	94.0	89.7	64.2	70.8	85.0	23.5	65.8	45.5	46.9	68.0	65.35	61.8	48.0	48.0	76.3	58.52
ResBlock-adapter reduction=32	221	69.1	94.2	90.0	64.4	71.4	85.3	23.2	66.1	45.2	46.8	67.4	65.41	62.5	48.1	48.3	76.8	58.94
ResBlock-adapter reduction=16	442	69.3	94.2	89.9	64.2	71.3	85.3	23.8	66.4	45.6	47.5	67.9	65.60	62.8	48.4	48.4	76.9	59.12
ResBlock-adapter reduction=8	885	69.5	94.1	89.5	64.6	71.3	85.6	23.6	66.6	44.8	45.3	67.9	65.33	63.0	48.6	48.8	77.0	59.36
ResBlock-adapter reduction=4	1769	69.7	94.1	89.5	64.8	71.2	85.5	24.0	66.8	44.9	46.8	67.8	65.55	63.1	48.7	49.0	77.1	59.48
Word Soup $m=1$	0.012	68.6	93.9	89.2	64.6	71.8	86.0	24.7	65.9	44.2	48.0	67.7	65.61	62.1	47.9	49.7	76.3	59.01
Word Soup $m=2$	0.024	69.0	94.1	90.3	65.6	72.5	86.0	25.5	66.9	45.0	52.0	68.6	66.64	62.4	48.8	50.2	76.6	59.50
Word Soup $m=4$	0.048	69.3	94.1	89.9	65.9	72.4	86.5	25.7	67.1	45.8	53.6	68.7	66.96	62.9	48.9	50.3	77.2	59.80
Word Soup $m=8$	0.096	69.4	94.1	89.9	65.7	72.5	86.4	25.9	67.0	44.9	54.6	68.8	66.99	63.1	49.0	50.5	77.3	59.95
Word Soup $m=16$	0.192	69.5	94.0	89.9	65.9	72.5	86.3	26.1	67.4	45.2	54.8	68.8	67.08	63.2	49.0	50.7	77.2	60.02
Word Soup $m=32$	0.384	69.6	94.2	89.9	65.9	72.4	86.5	26.2	67.4	45.1	54.7	69.0	67.12	63.2	49.0	50.6	77.3	60.04
Word Soup $m=64$	0.767	69.5	94.1	90.0	65.9	72.5	86.4	26.2	67.4	45.2	55.1	69.0	67.17	63.3	49.1	50.7	77.4	60.11
Word Soup + CoOp $m=4$	2	70.2	94.5	91.0	65.6	72.3	86.0	25.1	67.7	45.7	56.1	68.6	67.26	63.7	49.3	50.1	77.9	60.26
Word Soup + CoOp $m=8$	2	70.2	94.4	91.0	65.3	72.1	86.1	25.2	67.7	45.5	55.5	68.7	67.15	63.5	49.3	50.2	78.0	60.25
Word Soup + CoOp $m=16$	2	70.2	94.5	91.0	65.7	72.6	86.1	24.9	67.8	45.6	55.5	69.2	67.30	63.7	49.5	50.5	77.9	60.39

Table 11: Detailed numerical results for PEFT comparison in Fig. 2. Average of 3 trials. These results are plotted in Figure 2 of the main paper. Also reference Section 7 (Results) for a discussion.

		Source	Cross-dataset Evaluation Targets											Domain Generalization Targets
	$m$	INet	Caltech	Pets	Cars	Flowers	Food	Aircraft	SUN	DTD	EuroSAT	UCF	Mean	INet-V2	Sketch	INet-A	INet-R	Mean
Open-AI CLIP ViT-B/32
ZS	1	61.9	91.5	87.4	60.3	66.4	80.2	19.1	62.2	42.3	40.3	63.5	61.32	54.6	40.7	29.1	66.3	47.68
GPT score mean	5.8	63.0	91.8	88.1	60.0	66.6	80.2	19.1	64.4	43.1	36.2	62.7	61.22	55.4	41.0	29.4	65.9	47.95
Waffle CLIP	16	63.3	91.8	88.0	60.9	67.4	80.4	19.6	63.8	41.7	44.8	63.0	62.13	55.8	41.6	31.1	67.8	49.07
Desc. soup + offsets	100	64.1	91.5	87.7	60.7	66.9	80.4	19.9	64.4	43.6	48.3	64.5	62.79	56.5	42.6	31.8	69.3	50.05
Word soup	8	64.5	91.5	88.0	60.4	67.0	80.9	19.3	64.6	42.0	45.5	63.2	62.24	56.9	42.5	32.0	68.7	50.00
Open CLIP ViT-L/14
ZS	1	73.3	96.4	92.9	92.0	75.8	85.7	34.1	72.7	57.3	52.1	72.1	73.11	65.6	61.0	47.2	85.7	64.88
GPT score mean	5.8	73.6	96.7	92.8	91.2	76.5	85.3	33.7	72.7	58.6	51.6	71.7	73.08	66.1	61.2	47.5	85.1	64.96
Waffle CLIP	16	72.7	96.1	92.4	91.7	76.4	85.8	34.4	72.4	58.6	52.2	72.5	73.25	65.3	60.7	46.5	85.4	64.47
Desc. soup + offsets	100	74.0	96.6	92.8	92.0	76.3	85.5	34.5	72.7	59.1	50.0	72.3	73.19	66.0	61.9	48.7	86.6	65.81
Word soup	8	74.3	96.5	92.1	92.2	76.0	86.0	35.0	73.6	58.5	52.9	73.0	73.56	66.8	61.6	48.2	86.3	65.73
Open CLIP CoCa-L/14
ZS	1	75.1	97.6	93.8	92.7	77.3	87.5	36.6	73.6	57.2	58.5	73.4	74.82	67.5	63.5	53.8	87.0	67.94
GPT score mean	5.8	74.9	97.6	93.7	92.4	76.2	87.3	36.3	73.9	58.9	64.9	73.6	75.48	67.6	63.5	52.8	86.8	67.67
Waffle CLIP	16	75.0	97.5	93.9	92.7	77.3	87.5	37.4	73.1	57.5	63.0	73.9	75.37	67.5	63.8	52.8	87.3	67.85
Desc. soup + offsets	100	75.5	97.5	93.9	92.6	77.5	87.3	37.2	73.8	61.1	63.6	75.0	75.95	68.0	64.2	53.2	87.9	68.32
Word soup	8	75.9	97.5	93.8	92.8	77.8	87.7	38.4	74.1	60.5	63.5	74.7	76.08	68.8	64.0	54.3	87.9	68.73
Open CLIP ViT-g/14
ZS	1	77.7	97.7	93.6	93.5	81.6	90.0	44.1	74.3	65.3	55.8	80.0	77.58	70.4	66.4	59.7	89.0	71.37
GPT score mean	5.8	77.6	97.2	93.7	93.6	81.4	89.6	43.1	74.7	63.1	58.7	76.3	77.14	71.0	66.3	58.8	88.9	71.26
Waffle CLIP	16	77.3	97.8	93.5	93.7	81.3	89.8	44.1	74.1	65.8	58.0	78.9	77.72	70.1	65.9	59.0	88.9	70.99
Desc. soup + offsets	100	78.0	97.8	94.1	93.9	80.7	89.2	43.1	75.0	67.0	60.4	79.2	78.04	71.5	67.2	60.2	90.0	72.21
Word soup	8	78.4	97.6	93.7	93.9	81.4	89.8	44.0	75.0	66.0	60.0	79.5	78.09	71.6	67.1	60.0	89.6	72.05

Table 12: Detailed numerical results for different model scales. This is an expanded version of Table 5. Average of 3 trials.

Descriptor and Word Soups \scalerel*X: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning