(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: Conservatoire national des arts et métiers, CEDRIC, F-75141 Paris, France 22institutetext: Univ. Gustave Eiffel, ENSG, IGN, LASTIG, F-94160 Saint-Mandé, France 33institutetext: Sorbonne Université, CNRS, ISIR, F-75005 Paris, France
33email: {marc.lafon, elias.ramzi}@cnam.fr

[Uncaptioned image] GalLoP: Learning Global and Local Prompts
for Vision-Language Models

Marc Lafon\orcidlink0009-0002-0688-177X 11    Elias Ramzi\orcidlink0000-0002-0131-2458 11   
Clément Rambour\orcidlink0000-0002-9899-3201
11
   Nicolas Audebert\orcidlink0000-0001-6486-3102 1122       Nicolas Thome\orcidlink0000-0003-4871-3045 33
Abstract

Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs), e.g. CLIP, for few-shot image classification. Despite their success, most prompt learning methods trade-off between classification accuracy and robustness, e.g. in domain generalization or out-of-distribution (OOD) detection. In this work, we introduce Global-Local Prompts (GalLoP), a new prompt learning method that learns multiple diverse prompts leveraging both global and local visual features. The training of the local prompts relies on local features with an enhanced vision-text alignment. To focus only on pertinent features, this local alignment is coupled with a sparsity strategy in the selection of the local features. We enforce diversity on the set of prompts using a new “prompt dropout” technique and a multiscale strategy on the local prompts. GalLoP outperforms previous prompt learning methods on accuracy on eleven datasets in different few shots settings and with various backbones. Furthermore, GalLoP shows strong robustness performances in both domain generalization and OOD detection, even outperforming dedicated OOD detection methods. Code and instructions to reproduce our results will be open-sourced.

Keywords:
Vision-language models Few shot classification Prompt learning Local and global prompts Robustness OOD detection
Equal contribution.

1 Introduction

Vision-Language Models (VLMs), e.g. CLIP [35] or ALIGN [20], have shown impressive performances for zero-shot image classification. Prompt learning [52, 51, 3, 21, 22, 33, 26] has been among the leading approaches to efficiently adapt VLMs to a specific downstream dataset. These methods train a learnable context in the form of soft prompts to optimize the text/image alignment. Prompt learning methods benefit from the strong generalization capability of VLMs’ textual encoder and are effective even when only a few labeled examples are available.

Despite their success, we observe that these methods trade off between classification accuracy and robustness. This is illustrated on Fig. 1(a), where methods exhibiting the best accuracy sacrifice out-of-distribution (OOD) detection performances, e.g. PromptSRC [22], while those excelling in OOD detection often have poor accuracy results, e.g. LoCoOp [29]. A similar observation is done in domain generalization, see Fig. 1(b): PromptSRC [22] presents two different versions, one optimized for accuracy (PromptSRC) and the other for domain generalization (PromptSRC), highlighting the intrinsic conflict between both criteria.

Refer to caption Refer to caption Refer to caption
(a) OOD detection (b) Domain Generalization (c) Global-local
Figure 1: Our GalLoP method demonstrates excellent performances in accuracy plus robustness, i.e. out-of-distribution detection (a) and domain generalization (b), while state-of-the-art prompt learning methods compromise between these aspects. Additionally, unlike recent methods utilizing ineffective local zero-shot CLIP features, GalLoP learns discriminative local prompts precisely aligned with sparse image regions at various scales, facilitating the discriminability between classes. GalLoP integrates both global and local prompts, with their diversity explicitly enforced during few-shot learning, which significantly enhances the performance of their combination (c).

To boost classification accuracy, prompt learning can involve learning multiple prompts [1] to emulate “prompt ensembling”, e.g. prompts specialized for specific classes  [33, 45] or Transformer’s layers [21, 22], or casting multiple prompts learning within a probabilistic framework [26].  The key challenge in prompt ensembling lies in learning diverse prompts to optimize the combination. However, since these approaches only operate on global visual representations, they cannot utilize diverse prompts aligned with specific image regions to maximize their diversity.

Recently, attempts have been made to use local image representations in prompt learning, e.g. LoCoOp [29] or PLOT [3]. Although these approaches are promising, their performances in accuracy/robustness are suboptimal compared to state-of-the-art results, see Fig. 1(a),(b). Their limited performances stem from two main factors: i) they use “dense” (i.e. all) local features from CLIP, which includes irrelevant or noisy regions for a given concept, and ii) these local features are not as well aligned with the text due to CLIP’s pre-training with the global representation. In consequence, the performance of prompts trained with those local features is much lower than their global counterpart, and this degradation affects performances when combined with global, as illustrated in Fig. 1(c).

In this paper, we introduce Global-Local Prompts (GalLoP), a new method to learn a diverse set of prompts by leveraging both global and local visual representations. GalLoP learns sparse discriminative local features, i.e. text prompts are aligned to a sparse subset of regions at multiple scales. This enables fine-grained and accurate text-to-image matching, making GalLoP local prompts highly competitive. Moreover, we train GalLoP with diverse global and local prompts, unlocking the complementarity between both sets and significantly improving their combination, as shown in Fig. 1(c).

To achieve this, GalLoP relies on two main methodological contributions:

  • Effective local prompts learning. In GalLoP, we propose to align local prompts with sparse subsets of k𝑘kitalic_k image regions, enabling text-to-image matching that captures fine-grained semantics. To adapt visual representations to the downstream dataset, we refine the textual alignment of visual local features by employing a simple linear projection amenable to few-shots learning.

  • Enforcing ensemble diversity. We learn both global prompts aligned with the whole image and local spatially-localized prompts, and enforce diversity between them to improve their combination. We induce diversity through randomization using a new “prompt dropout” strategy, which enhances generalization when learning multiple prompts. Additionally, we employ a multiscale strategy to align local prompts with image regions of varying sizes, capturing different visual aspects of a concept’s semantics.

We conduct an extensive experimental validation of GalLoP on 11 few-shot image classification datasets and 8 datasets evaluating robustness. We show that GalLoP outperforms state-of-the-art prompt learning methods on classification accuracy, OOD detection, and domain generalization, therefore improving the observed tradeoff in these 3 criteria. We validate that our two main contributions, i.e. learning strong local prompts and diverse representations, are essential for reaching excellent performances.

2 Related work

2.0.1 Prompt learning.

Prompt learning has emerged as an efficient way to adapt VLMs to downstream datasets. These methods, e.g. CoOp [52], learn soft prompts to adapt CLIP textual features to specific labels without the need for a cumbersome step of “prompt engineering” as performed in [35]. Following these seminal works, many variants have been proposed. [51] uses a meta-network to bias the learnable prompt using the global visual representation of the input image. To boost prompt learning performances, recent works have focused on learning multiple prompts [3, 26, 21, 22]. MaPLe [21] introduces prompts in several layers of both textual and visual encoders. PromptSRC [22] builds upon this work by introducing several regularization losses, boosting both accuracy and robustness performances. We note that PromptSRC uses a set of hand-crafted prompts to regularize the learning of the textual prompts, which is not fully aligned with the initial motivation behind prompt learning. Furthermore, both MaPLe and PromptSRC are limited to the use of vision transformer architectures. ProDA [26] models the distribution over the textual representation of classes using a multivariate Gaussian distribution, and indirectly learns the distribution over prompts using a surrogate loss. PromptStyler [4] learns several prompts that represent different “styles” to perform source-free domain generalization. These two approaches achieve prompt diversity by enforcing orthogonality among the prompts. In GalLoP, we induce diversity with a “prompt dropout” technique, which randomly drops subsets of prompts during training, thus avoiding the introduction of an additional loss, while limiting prompt over-fitting observed in [22] with a method inspired by a standard deep learning approach, i.e. Dropout [40]. To further improve diversity, we specialize the local prompts on different image scales, thus aligning them with different sets of attributes for each class.

2.0.2 Prompt learning using visual local features.

There has been a growing interest in leveraging CLIP’s local features in prompt learning methods [3, 41, 29]. PLOT [3] learns a set of prompts by using the optimal transport (OT) [44] distance between them and the set of local features, which is prohibitive to compute. Furthermore, the OT distance enforces the prompts to use information from all local visual features during training, including possibly detrimental ones. Also, PLOT adds the global visual features to the local features to achieve strong results on the ImageNet dataset. In GalLoP, we use a sparse mechanism to learn localized prompts. This removes the negative influence of background features while being computationally efficient. Finally, GalLoP learns prompts from the local features without any access to CLIP’s original global visual feature. LoCoOp [29] introduced an entropy loss leveraging “irrelevant” local visual features in an outlier exposure fashion [18] to improve out-of-distribution detection but at the expense of accuracy. [41] introduces a method specifically designed for multi-label classification, which learns prompts using local visual features. While these methods obtained promising results, we show in this work that their performance is intrinsically limited by the lower discriminative power of CLIP’s zero-shot local visual features.

2.0.3 Prompt learning for OOD detection.

As VLMs are becoming increasingly prevalent in few-shot classification applications, their zero-shot OOD detection capabilities have received increasing attention. In the seminal work [28], the authors proposed the maximum concept matching (MCM) score to detect OOD examples. Recently, [30] improved upon the MCM score by combining zero-shots global and local visual information to construct the GL-MCM score, which achieves strong zero-shots OOD detection results.  In addition to the previously mentioned LoCoOp [29], other works tackle the few-shots OOD detection problem using prompt learning. To avoid degraded accuracy performances, [31] introduces the concept of negative prompts to perform OOD detection by “Learning to Say No” (LSN). OOD samples are then detected by computing the difference in MCM scores between positive and negative prompts. GalLoP already achieves strong OOD detection performances without introducing an extra loss or additional negative prompts. Indeed, GalLoP has strong classification accuracy, compared to zero-shot CLIP used in [28] or LoCoOp [29], which helps OOD detection performances. Similarly, our discriminative local features further increase the gain of using local features for OOD detection observed in GL-MCM [30].

𝒙𝒙\bm{x}bold_italic_x 𝒵lsubscript𝒵𝑙\mathcal{Z}_{l}caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT 𝒛gsubscript𝒛𝑔\bm{z}_{g}bold_italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT 𝒕csubscript𝒕𝑐\bm{t}_{c}bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT simtop-k(𝒵l,𝒕c)subscriptsimtop-ksubscript𝒵𝑙subscript𝒕𝑐\text{sim}_{{\text{top-$k$}}}(\mathcal{Z}_{l},\bm{t}_{c})sim start_POSTSUBSCRIPT top- italic_k end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) Eq. 2 𝒫gsubscript𝒫𝑔\mathcal{P}_{g}caligraphic_P start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT 𝒫lsubscript𝒫𝑙\mathcal{P}_{l}caligraphic_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT Eq. 8 Eq. 7

Refer to caption

Figure 2: Illustration of GalLoP. GalLoP learns a diverse set of global prompts and local prompts. Pertinent local prompts are learned using only the most relevant regions of the image for each class. We further improve the limited text-vision alignment of CLIP’s local features using a simple linear layer. The diversity is encouraged using a new “prompt dropout” technique for global prompts, and a multiscale loss for local prompts.

3 Combining global and local prompts with GalLoP

In this section, we describe our proposed method, GalLoP, which seeks to learn an ensemble of diverse prompts from both global and local CLIP’s visual representations. As illustrated in Fig. , GalLoP learns two specialized sets of prompts: the “global prompts” receiving a signal from the global visual representation, and the “local prompts” trained using local features only.

Formally, let us consider a set of n𝑛nitalic_n learnable local prompts 𝒫l=(𝒑1l,,𝒑nl)subscript𝒫𝑙subscriptsuperscript𝒑𝑙1subscriptsuperscript𝒑𝑙𝑛{\mathcal{P}_{l}=(\bm{p}^{l}_{1},\cdots,\bm{p}^{l}_{n})}caligraphic_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ( bold_italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and a set of m𝑚mitalic_m learnable global prompts 𝒫g=(𝒑1g,,𝒑mg)subscript𝒫𝑔subscriptsuperscript𝒑𝑔1subscriptsuperscript𝒑𝑔𝑚\mathcal{P}_{g}=(\bm{p}^{g}_{1},\cdots,\bm{p}^{g}_{m})caligraphic_P start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = ( bold_italic_p start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_p start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ). Each of these prompt 𝒑𝒑\bm{p}bold_italic_p is composed of V𝑉Vitalic_V learnable embeddings, i.e. 𝒑:=[p1,,pV]V×dassign𝒑superscript𝑝1superscript𝑝𝑉superscript𝑉superscript𝑑\bm{p}:=[p^{1},\dots,p^{V}]\in\mathbb{R}^{V\times d^{\prime}}bold_italic_p := [ italic_p start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_p start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, and are prepended to the class name embeddings 𝒄𝒄\bm{c}bold_italic_c to perform classification. Let 𝒟={(𝒙,y)}𝒟𝒙𝑦\mathcal{D}=\{(\bm{x},~{}y)\}caligraphic_D = { ( bold_italic_x , italic_y ) } denote the downstream dataset, where 𝒙𝒙\bm{x}bold_italic_x is an image and y𝑦yitalic_y its class, and let 𝒯𝒯\mathcal{T}caligraphic_T and 𝒱𝒱\mathcal{V}caligraphic_V denote CLIP’s text and vision encoder, respectively. The textual encoder produces a normalized textual representation 𝒕c=𝒯([𝒑,𝒄])dsubscript𝒕𝑐𝒯𝒑𝒄superscript𝑑\bm{t}_{c}=\mathcal{T}([\bm{p},\bm{c}])\in\mathbb{R}^{d}bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = caligraphic_T ( [ bold_italic_p , bold_italic_c ] ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT of the cthsuperscript𝑐𝑡c^{th}italic_c start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT class. Given the input image 𝒙𝒙\bm{x}bold_italic_x, the visual encoder produces a visual representation 𝒛𝒛\bm{z}bold_italic_z. 𝒛𝒛\bm{z}bold_italic_z can be a global vector for learning global prompts, i.e. the global visual feature on which CLIP has been pre-trained. For local prompts, 𝒛𝒛\bm{z}bold_italic_z will be a set of localized features outputted by the encoder. From its visual representation 𝒛𝒛\bm{z}bold_italic_z, the probability for the image 𝒙𝒙\bm{x}bold_italic_x to be classified into the class ycsubscript𝑦𝑐y_{c}italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT can be expressed as:

p(y=yc|𝒙;𝒑)𝑝𝑦conditionalsubscript𝑦𝑐𝒙𝒑\displaystyle p(y=y_{c}|\bm{x};~{}\bm{p})italic_p ( italic_y = italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | bold_italic_x ; bold_italic_p ) =exp(sim(𝒛,𝒕c)/τ)cexp(sim(𝒛,𝒕c)/τ),absentsim𝒛subscript𝒕𝑐𝜏subscriptsuperscript𝑐sim𝒛subscript𝒕superscript𝑐𝜏\displaystyle=\frac{\exp(\text{sim}(\bm{z},~{}\bm{t}_{c})~{}/~{}\tau)}{\sum_{c% ^{\prime}}\exp(\text{sim}(\bm{z},~{}\bm{t}_{c^{\prime}})~{}/~{}\tau)},= divide start_ARG roman_exp ( sim ( bold_italic_z , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_exp ( sim ( bold_italic_z , bold_italic_t start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) / italic_τ ) end_ARG , (1)

where sim(,)sim\text{sim}(\cdot,\cdot)sim ( ⋅ , ⋅ ) is a measure of similarity, and τ𝜏\tauitalic_τ is fixed a temperature scaling parameter. With this general definition of the probability in Eq. 1, we can train a prompt 𝒑𝒑\bm{p}bold_italic_p using the standard cross-entropy loss CE(p(y=yc|𝒙;𝒑))subscriptCE𝑝𝑦conditionalsubscript𝑦𝑐𝒙𝒑\mathcal{L}_{\text{CE}}(p(y=y_{c}|\bm{x};~{}\bm{p}))caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_p ( italic_y = italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | bold_italic_x ; bold_italic_p ) ).

To train a global prompt, 𝒑ig𝒫gsubscriptsuperscript𝒑𝑔𝑖subscript𝒫𝑔\bm{p}^{g}_{i}\in\mathcal{P}_{g}bold_italic_p start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, we use the global visual representation for the image 𝒙𝒙\bm{x}bold_italic_x, i.e. 𝒛=𝒛gd𝒛subscript𝒛𝑔superscript𝑑\bm{z}=\bm{z}_{g}\in\mathbb{R}^{d}bold_italic_z = bold_italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.  The similarity between the global vector 𝒛gsubscript𝒛𝑔\bm{z}_{g}bold_italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and the prompt simply reduces to cosine similarity, i.e. sim(𝒛g,𝒕c)=𝒛g,𝒕csimsubscript𝒛𝑔subscript𝒕𝑐subscript𝒛𝑔subscript𝒕𝑐{\text{sim}(\bm{z}_{g},~{}\bm{t}_{c})=\left<\bm{z}_{g},~{}\bm{t}_{c}\right>}sim ( bold_italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = ⟨ bold_italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⟩, and the global prompt 𝒑isuperscript𝒑𝑖\bm{p}^{i}bold_italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT can be trained by minimizing CE(p(y=yc|𝒙;𝒑ig))subscriptCE𝑝𝑦conditionalsubscript𝑦𝑐𝒙subscriptsuperscript𝒑𝑔𝑖\mathcal{L}_{\text{CE}}(p(y=y_{c}|\bm{x};~{}\bm{p}^{g}_{i}))caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_p ( italic_y = italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | bold_italic_x ; bold_italic_p start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ).

In Sec. 3.1, we introduce a relevant similarity measure sim(𝒛,𝒕c)sim𝒛subscript𝒕𝑐\text{sim}(\bm{z},~{}\bm{t}_{c})sim ( bold_italic_z , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) for implementing Eq. 1 on local prompts. We rely on a sparsification strategy that only considers a small subset of class-relevant regions of the image. Furthermore, we use a linear projection to improve the vision-text alignment of local features, thus enhancing the quality of the learned prompts. In Sec. 3.2 we describe how we learn a diverse set of global and local prompts, whose combination can improve predictions’ performance. We introduce “prompt dropout” to increase the diversity of global prompts by randomly selecting a subset of prompts for each image. Finally, we introduce a multiscale loss by dedicating each local prompt to select different sub-region sizes of the input image.

3.1 Learning prompts from local visual representations

𝒵lsubscript𝒵𝑙\mathcal{Z}_{l}caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT 𝒕csubscript𝒕𝑐\bm{t}_{c}bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT

Refer to caption

Figure 3: GalLoP sparse local similarity sim(𝒵l,𝒕c)simsubscript𝒵𝑙subscript𝒕𝑐\text{sim}(\mathcal{Z}_{l},~{}\bm{t}_{c})sim ( caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) between class prompt 𝒕csubscript𝒕𝑐\bm{t}_{c}bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and visual features 𝒵lsubscript𝒵𝑙\mathcal{Z}_{l}caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the average of the top-k𝑘kitalic_k highest similarities (here, k=3).

In this section, we temporarily consider a single local prompt 𝒑jl𝒫lsubscriptsuperscript𝒑𝑙𝑗subscript𝒫𝑙\bm{p}^{l}_{j}\in\mathcal{P}_{l}bold_italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT without loss of generality. In this case, the visual representation 𝒛𝒛\bm{z}bold_italic_z that we consider is the set of visual local features, i.e. 𝒛=𝒵lL×d𝒛subscript𝒵𝑙superscript𝐿𝑑\bm{z}=\mathcal{Z}_{l}\in\mathbb{R}^{L\times d}bold_italic_z = caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_d end_POSTSUPERSCRIPT, obtained following [7] (see details in supplementary A.1). Here, we can not directly compute the probability of Eq. 1 as we need to define the similarity between the set of vectors 𝒵l=(𝒛1l,,𝒛Ll)subscript𝒵𝑙subscriptsuperscript𝒛𝑙1subscriptsuperscript𝒛𝑙𝐿\mathcal{Z}_{l}=(\bm{z}^{l}_{1},\cdots,\bm{z}^{l}_{L})caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ( bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) and the textual representation of the cthsuperscript𝑐𝑡c^{th}italic_c start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT class, 𝒕c=𝒯([𝒑jl,𝒄])subscript𝒕𝑐𝒯subscriptsuperscript𝒑𝑙𝑗𝒄\bm{t}_{c}=\mathcal{T}([\bm{p}^{l}_{j},\bm{c}])bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = caligraphic_T ( [ bold_italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c ] ).

Sparse local similarity. A naive way to obtain a single similarity for all regions is to average the similarities of each spatial location with the textual representation of the class. However, a substantial portion of the local features are irrelevant to the class, e.g. features from background areas, which may introduce noise and perturb the learning process. To solve this problem, we adopt a sparse approach, where only local features semantically related to the class are kept to perform classification. As illustrated in Fig. , we select the top-k𝑘kitalic_k local features with the highest similarities with the prompted class textual representation, and average their similarities to measure sim(𝒵l,𝒕c)simsubscript𝒵𝑙subscript𝒕𝑐\text{sim}(\mathcal{Z}_{l},~{}\bm{t}_{c})sim ( caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ).

Formally, we define the similarity between a prompt 𝒕csubscript𝒕𝑐\bm{t}_{c}bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the set of visual features 𝒵lsubscript𝒵𝑙\mathcal{Z}_{l}caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT as the average similarity for the k𝑘kitalic_k most similar regions:

simtop-k(𝒵l,𝒕c)subscriptsimtop-ksubscript𝒵𝑙subscript𝒕𝑐\displaystyle\text{sim}_{\text{top-$k$}}(\mathcal{Z}_{l},~{}\bm{t}_{c})sim start_POSTSUBSCRIPT top- italic_k end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) 1ki=1L𝟙top-k(i)𝒛il,𝒕cabsent1𝑘superscriptsubscript𝑖1𝐿subscript1top-k𝑖subscriptsuperscript𝒛𝑙𝑖subscript𝒕𝑐\displaystyle\coloneqq\frac{1}{k}\sum_{i=1}^{L}~{}\mathbb{1}_{\text{top-$k$}}(% i)\cdot\left<\bm{z}^{l}_{i},~{}\bm{t}_{c}\right>≔ divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT top- italic_k end_POSTSUBSCRIPT ( italic_i ) ⋅ ⟨ bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⟩ (2)
where𝟙top-k(i)wheresubscript1top-k𝑖\displaystyle\text{where}\quad\quad\mathbb{1}_{\text{top-$k$}}(i)where blackboard_1 start_POSTSUBSCRIPT top- italic_k end_POSTSUBSCRIPT ( italic_i ) ={1ifranki(𝒛il,𝒕c)k,0otherwise.absentcases1ifsubscriptrank𝑖subscriptsuperscript𝒛𝑙𝑖subscript𝒕𝑐𝑘0otherwise.\displaystyle=\left\{\begin{array}[]{ll}1&\quad\text{if}\quad\text{rank}_{i}(% \left<\bm{z}^{l}_{i},~{}\bm{t}_{c}\right>)\leq k,\\ 0&\quad\text{otherwise.}\end{array}\right.= { start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL if rank start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⟨ bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⟩ ) ≤ italic_k , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise. end_CELL end_ROW end_ARRAY (5)

which we plug into Eq. 1 to compute the probability for class c𝑐citalic_c. We show in Sec. 4.3 that relying on sparsity is mandatory for local prompt learning, boosting performances by almost 20pt in top-1 accuracy.

Improving local text-vision alignment. While previous works [50, 41, 29] have exploited the text-vision alignment of CLIP’s local features, we empirically verified in Sec. 4.3 that using these features leads to poor zero-shots classification results on ImageNet. This is expected, as CLIP is pre-trained to align the global visual features with its textual representation. Local features are thus suboptimal to learn effective prompts for image classification. Motivated by this observation, we propose to improve the discriminative power of CLIP’s local visual features by realigning them with the textual representations of the class labels of the downstream dataset. To do so, we propose to use a simple linear projection h𝜽subscript𝜽h_{\bm{\theta}}italic_h start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT. To ease the learning process, we initialize the linear layer hθsubscript𝜃h_{\theta}italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to identity, so that the initial features are close to CLIP’s representations. Henceforth, we use the set of linearly transformed local visual features h𝜽(𝒵l)subscript𝜽subscript𝒵𝑙h_{\bm{\theta}}(\mathcal{Z}_{l})italic_h start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) to compute the probability of Eq. 1, which becomes:

p(y=yc|𝒙;𝒑jl,k,𝜽)𝑝𝑦conditionalsubscript𝑦𝑐𝒙subscriptsuperscript𝒑𝑙𝑗𝑘𝜽\displaystyle p(y=y_{c}|\bm{x};~{}\bm{p}^{l}_{j},k,\bm{\theta})italic_p ( italic_y = italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | bold_italic_x ; bold_italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_k , bold_italic_θ ) =exp(simtop-k(h𝜽(𝒵l),𝒕c)/τ)cexp(simtop-k(h𝜽(𝒵l),𝒕c)/τ).absentsubscriptsimtop-ksubscript𝜽subscript𝒵𝑙subscript𝒕𝑐𝜏subscriptsuperscript𝑐subscriptsimtop-ksubscript𝜽subscript𝒵𝑙subscript𝒕superscript𝑐𝜏\displaystyle=\frac{\exp(\text{sim}_{\text{top-$k$}}(h_{\bm{\theta}}(\mathcal{% Z}_{l}),~{}\bm{t}_{c})~{}/~{}\tau)}{\sum_{c^{\prime}}\exp(\text{sim}_{\text{% top-$k$}}(h_{\bm{\theta}}(\mathcal{Z}_{l}),~{}\bm{t}_{c^{\prime}})~{}/~{}\tau)}.= divide start_ARG roman_exp ( sim start_POSTSUBSCRIPT top- italic_k end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_exp ( sim start_POSTSUBSCRIPT top- italic_k end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , bold_italic_t start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) / italic_τ ) end_ARG . (6)

Thus, a local prompt can be optimized by maximizing this probability with the cross-entropy loss. These design choices in GalLoP allow us to train a powerful classifier for local features: the sparsity helps to focus on the most relevant regions of an image and to remove potential background noise, while the linear projection enhances the text-vision alignment and boosts the fine-grained discriminating power of the local features. We study these design choices in Sec. 4.3.

3.2 Learning multiple diverse prompts

In this section, we describe how we induce diversity among the learned prompts. Besides exploiting different sources of information – the global and visual ones –, we introduce two mechanisms to increase diversity: “prompt dropout” and multiscale training.

Prompt dropout. Motivated by the success of the “dropout” [40, 10] technique classically used in deep learning, we introduce “prompt dropout” into the prompt learning framework. In “prompt dropout”, we randomly mask a subset of prompts for each image of the batch. Alternatively, from the perspective of each prompt, we select a different subset of the batch of images, thus inducing diversity in the learning process of the prompts through input randomization.

The training of our set of global prompts is performed with the following loss:

global(𝒫g)=i=1mCE(𝒑ig).subscriptglobalsubscript𝒫𝑔superscriptsubscript𝑖1𝑚subscriptCEsubscriptsuperscript𝒑𝑔𝑖\displaystyle\mathcal{L}_{\text{global}}(\mathcal{P}_{g})~{}=~{}\sum_{i=1}^{m}% \mathcal{L}_{\text{CE}}(\bm{p}^{g}_{i}).caligraphic_L start_POSTSUBSCRIPT global end_POSTSUBSCRIPT ( caligraphic_P start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( bold_italic_p start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (7)
Refer to caption Refer to caption
(a) Prompt dropout (b) Multiscale loss
Figure 4: (a) Prompt dropout induces diversity by randomly selecting different subsets of prompts for each image of the batch. In (a), each image will be used by half the prompts. (b) To learn diverse local prompts, we specialize each one of them using a different number of regions, and therefore a different level of sparsity.

Multiscale training. To specifically improve the diversity of the local prompts, we specialize each local prompt to select a different number of class-specific visual patches (scales). In this way, prompts dedicated to small scales will get more signals from classes corresponding to small visual concepts, e.g. “daisy flower” or “tailed frog”, while prompts learned with larger scales will receive more signals from images with wider concepts, e.g. “castle” or “valley”. More formally, let (k1,k1+Δk,,k1+(n1)Δk)subscript𝑘1subscript𝑘1subscriptΔ𝑘subscript𝑘1𝑛1subscriptΔ𝑘(k_{1},~{}k_{1}+\Delta_{k},\cdots,~{}k_{1}+(n-1)\cdot\Delta_{k})( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ⋯ , italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( italic_n - 1 ) ⋅ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) denote a set of increasing scales with k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT the first scale and ΔksubscriptΔ𝑘\Delta_{k}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the expansion factor. Each local prompt 𝒑jlsubscriptsuperscript𝒑𝑙𝑗\bm{p}^{l}_{j}bold_italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT will be learned with its associated scale kj=k1+(j1)Δksubscript𝑘𝑗subscript𝑘1𝑗1subscriptΔ𝑘k_{j}=k_{1}+(j-1)\cdot\Delta_{k}italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( italic_j - 1 ) ⋅ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

The training of our n𝑛nitalic_n local prompts is then performed by optimizing the probability defined in Eq. 6 for each prompt with a different scale, i.e. value of k𝑘kitalic_k:

x-scale(𝒫l,𝜽)=j=1nCE(𝒑jl,𝜽,kj).subscriptx-scalesubscript𝒫𝑙𝜽superscriptsubscript𝑗1𝑛subscriptCEsubscriptsuperscript𝒑𝑙𝑗𝜽subscript𝑘𝑗\displaystyle\mathcal{L}_{\text{x-scale}}(\mathcal{P}_{l},\bm{\theta})~{}=~{}% \sum_{j=1}^{n}\mathcal{L}_{\text{CE}}(\bm{p}^{l}_{j},~{}\bm{\theta},~{}k_{j}).caligraphic_L start_POSTSUBSCRIPT x-scale end_POSTSUBSCRIPT ( caligraphic_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_italic_θ ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( bold_italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_θ , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . (8)

The overall loss to train our set of prompts 𝒫=𝒫l𝒫g𝒫subscript𝒫𝑙subscript𝒫𝑔\mathcal{P}=\mathcal{P}_{l}\cup\mathcal{P}_{g}caligraphic_P = caligraphic_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∪ caligraphic_P start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is the sum of the local multiscale and global losses:

total(𝒫,𝜽)=global(𝒫g)+x-scale(𝒫l,𝜽)subscripttotal𝒫𝜽subscriptglobalsubscript𝒫𝑔subscriptx-scalesubscript𝒫𝑙𝜽\displaystyle\mathcal{L}_{\text{total}}(\mathcal{P},\bm{\theta})=\mathcal{L}_{% \text{global}}(\mathcal{P}_{g})+\mathcal{L}_{\text{x-scale}}(\mathcal{P}_{l},% \bm{\theta})caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT ( caligraphic_P , bold_italic_θ ) = caligraphic_L start_POSTSUBSCRIPT global end_POSTSUBSCRIPT ( caligraphic_P start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT x-scale end_POSTSUBSCRIPT ( caligraphic_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_italic_θ ) (9)

4 Experimental results

In this section, we present the experimental validation of GalLoP. We first show that GalLoP outperforms previous methods on top-1 accuracy on a collection of 11 datasets in Sec. 4.1 with ViT-B/16 [8]. We also show that GalLoP performs well for different few shot settings on ImageNet and with a ResNet-50 [14]. In Sec. 4.2, we compare robustness performances of GalLoP and other prompts learning methods in domain generalization and OOD detection, and show that GalLoP has better trade-off with top-1 accuracy contrary to previous methods. In Sec. 4.3, we conduct ablation studies of the different components of GalLoP.

Table 1: Top-1 accuracy with ViT-B/16 backbone. Comparison of GalLoP to other prompt learning methods on several standard benchmarks. results based on our own re-implementation.
Dataset

ImageNet [6]

Caltech101 [9]

OxfordPets [34]

Cars [23]

Flowers102 [32]

Food101 [2]

Aircraft [27]

SUN397 [47]

DTD [5]

EuroSAT [15]

UCF101 [39]

Average

CLIP [35] 66.7 92.2 88.4 65.5 70.7 84.8 24.8 62.3 44.1 48.3 64.7 75.7
Linear Probe 67.3 95.4 85.3 80.4 97.4 82.9 45.4 73.3 70.0 87.2 82.1 78.8
CoOp [52] 71.7 95.6 91.9 83.1 97.1 84.2 43.4 74.7 69.9 84.9 82.2 79.9
Co-CoOp [51] 71.0 95.2 93.3 71.6 87.8 87.2 31.2 72.2 63.0 73.3 78.1 74.9
MaPLe [21] 72.3 96.0 92.8 83.6 97.0 85.3 48.4 75.5 71.3 92.3 85.0 81.8
PLOT [3] 72.6 96.0 93.6 84.6 97.6 87.1 46.7 76.0 71.4 92.0 85.3 82.1
PromptSRC [22] 73.2 96.1 93.7 85.8 97.6 86.5 50.8 77.2 72.7 92.4 86.5 82.9
LoCoOp [29] 71.5 94.9 92.4 79.8 96.3 84.7 40.7 74.2 69.5 86.1 81.6 79.2
ProDA [26] 71.9 95.5 93.5 79.8 96.8 86.8 40.2 75.7 70.9 85.1 83.3 80.0
GalLoP 75.1 96.7 94.1 89.2 98.8 86.5 58.3 77.2 75.5 90.1 86.9 84.4
Implementation details.

We experiment with both ResNet-50 and ViT-B/16 CLIP models. When not specified, we use ViT-B/16. We train for 50 epochs on ImageNet and 200 epochs for other datasets with SGD, a learning rate of 0.002 decayed using cosine annealing and a weight decay of 0.01, following the setting of [52]. Unless specified otherwise, we train the models using 16 shots. Our base parameters for GalLoP are as follows: m=4𝑚4m=4italic_m = 4 global prompts with a dropout of 75% (in practice we keep a single prompt for each image), n=4𝑛4n=4italic_n = 4 local prompts with scales k1=10subscript𝑘110k_{1}=10italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 10 and Δk=10subscriptΔ𝑘10\Delta_{k}=10roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 10 for ViT-B/16 and k1=5subscript𝑘15k_{1}=5italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 5 and Δk=5subscriptΔ𝑘5\Delta_{k}=5roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 5 for ResNet-50 as there are fewer local patches. We keep τ𝜏\tauitalic_τ fixed from CLIP.

Baselines.

We compare GalLoP to recent prompt learning methods. Including, single prompt learning CoOp and Co-CoOp. Multi-prompt learning MaPLe, ProDA, PLOT, PromptSRC. We denote by PromptSRC the version designed for accuracy and PromptSRC the version designed for domain generalization. We also include OOD detection specific methods such as LoCoOp and LSN.

4.1 Main in-distribution results.

On Tab. 1, we compare GalLoP with a ViT-B/16 backbone on a suite of 11 datasets, a standard benchmark for prompt learning methods. On average, GalLoP outperforms previous methods by a large margin with +1.5pt compared to PromptSRC the next best performing method. Furthermore, GalLoP performs well on most datasets, achieving state-of-the-art among prompt learning methods. For instance, on the large-scale ImageNet dataset, it outperforms PLOT by +2.5pt and PromptSRC by +1.9pt. On some datasets, e.g. FGVC Aircraft, GalLoP outperforms the next best method by a large margin, with +7.5pt compared to PromptSRC.

Refer to caption
(a) Few shot results.
Refer to caption
(b) Results with ResNet-50.
Figure 5: Results on ImageNet with different few shot settings Fig. 5(a), and ResNet-50 Fig. 5(b).

We then compare GalLoP on Fig. 5(a) to prompt learning methods in different few-shot settings on ImageNet. GalLoP performs well in all configurations, outperforming for each setting the very competitive method, PromptSRC. Finally, in Fig. 5(b) we show that GalLoP works well with a ResNet-50, outperforming PLOT and CoOp by +3.1pt. Note that compared to other methods, e.g. MaPLe and PromptSRC, GalLoP is amenable to both convolutional and transformer vision backbones. Detailed results for ResNet-50 can be found in the supplementary material B.2.

4.2 Robustness results.

In this section, we compare the robustness performances of GalLoP vs. other prompt learning methods, see Fig. 6, on domain generalization and OOD detection. For both benchmarks, models are trained on ImageNet (16 shots).

Refer to caption
(a) Domain generalization.
Refer to caption
(b) OOD detection.
Figure 6: GalLoP robustness performances. GalLoP achieves strong performances on domain generalization Fig. 6(a) and on OOD detection Fig. 6(b), while outperforming prompt learning methods on top-1 accuracy.

Domain generalization results. We compare on Fig. 6(a) the domain generalization performances of GalLoP vs. other prompt learning methods. After being trained on ImageNet (16 shots), the models are evaluated on top-1 accuracy for different domains with the same classes as ImageNet, i.e. ImageNet-V2 [36], ImageNet-Sketch [46], ImageNet-A [19] and ImageNet-R [16]. GalLoP outperforms the domain-generalization specific method PromptSRC by +0.5pt on average, while outperforming it by +4.9pt on ImageNet. This illustrates the trade-off made by PromptSRC between top-1 accuracy and domain generalization. Indeed, GalLoP outperforms PromptSRC, designed for ImageNet accuracy, by +1.9pt on ImageNet and +1.5pt on average in domain generalization. GalLoP achieves the best trade-off between top-1 performances and domain generalization. The detailed results can be found in supplementary material B.4.

Results on OOD detection. In OOD detection the models must recognize between in-distribution examples (ImageNet test set) and different OOD datasets, namely iNaturalist [43], SUN [47], Places [49] and Textures [5], a standard benchmark in the OOD detection literature. We plot on Fig. 6(b) the average results on the ImageNet OOD benchmark of GalLoP and other prompt learning methods measured in FPR95 (lower is better, \downarrow). GalLoP outperforms traditional prompt learning methods, e.g. CoOp -3pt FPR95, as well as dedicated OOD detection methods, e.g.-1.4pt FPR95 vs. LoCoOp or -2.9pt FPR95 vs. LSN. Meanwhile, GalLoP also outperforms both LSN and LoCoOp by a large margin in top-1 accuracy, i.e. +3.2pt and +3.6pt respectively. The detailed results can be found in the supplementary material B.5.

4.3 Ablation studies.

In this section, we investigate the design choices for GalLoP. We first show how GalLoP leverages the complementarity of strong global and local prompts to boost performances Tab. 2. We then demonstrate the benefit of sparsity and local alignment in Fig. 7. Finally, we show the impact of our choice when learning multiple prompts for both global and local features Fig. 8.

Table 2: Ablation studies for the different components of our GalLoP.
Top-1 DG FPR95\downarrow   AUC
CLIPGlobal 66.6 57.2 42.8 90.8
CLIPLocal 12.5 9.49 73.3 73.7
CLIPGL 61.1 49.3 35.5 90.8
CoOpGlobal 71.4 59.2 39.1 91.1
CoOpLocal 41.2 30.1 65.2 78.3
CoOpGL 69.5 55.6 33.7 90.5
GalLoPGlobal 72.0 60.4 37.0 91.7
GalLoPLocal 70.9 54.1 36.0 90.1
GalLoP 75.1 61.3 27.3 93.2

Combining global and local features. On Tab. 2, we show that leveraging global and local features requires some important design choices. Indeed, we experiment with a baseline using CoOp on local features (“CoOpLocal”), learning a single prompt, without sparsity and no alignment. This baseline already outperforms using zero-shot local features, +28.7pt top-1. However, its combination with a standard CoOpGlobal, i.e. “CoOpGL”, is detrimental to final top-1 performances, with -1.9pt top-1 or -3.6pt DG compared to CoOpGlobal. On the other hand, GalLoP enjoys a boost in performances on all metrics when combining the learned global (GalLoPGlobal) and local (GalLoPLocal) prompts. We can see that the top-1 performances of GalLoP increase by +3.1pt compared to (GalLoPGlobal). Similarly, on OOD detection, GalLoP has a decrease of -8.9pt FPR95 compared to GalLoPLocal. Tab. 2 illustrates how the resulting performances of GalLoP, in both accuracy and robustness, come from the complementarity of both the local and global features.

Refer to caption
Figure 7: Impact of our sparsity choice for three regimes, zero-shot CLIP, learning a local prompt (“w/o linear”) and aligning our linear projection with a local prompt (“w. linear”).

The need for sparsity. In Fig. 7 we show how the sparsity when using local features allows achieving higher performances than attending to each local feature, for three regimes: zero-shot CLIP (“zero-shot”), while learning a local prompt (“w/o linear”), and when aligning a local prompt and our linear projection (“w. linear”). On the three regimes, the difference between looking at all local features and the best reported sparsity level is, respectively, +18.4pt, +17.6pt, and +8.5pt. Furthermore, we can see that when aligning a local prompt and the linear layer, our sparsity ratio works for a wide range of k𝑘kitalic_k, with performances above 69pt between k=5𝑘5k=5italic_k = 5 and k=50𝑘50k=50italic_k = 50. This shows the robustness to the choice of k𝑘kitalic_k. Finally, learning a local prompt allows to significantly boost the performances for the local features, e.g. +27.9pt for k=10𝑘10k=10italic_k = 10, and aligning with a linear projection further boosts performances, with +10pt for k=10𝑘10k=10italic_k = 10 compared to learning the prompt only. Fig. 7 shows the interest of both enforcing the sparsity when looking at local features and further aligning the local features with a local prompt.

Global prompt learning with prompt dropout. We display on Fig. 8(a) how prompt dropout allows learning efficiently multiple prompts for the global features. We display the top-1 accuracy when using more and more prompts, with (“w.”) or without (“w/o”) prompt dropout. We can observe that adding more prompts does not result in better performances without prompt dropout. For example, performances with 6 prompts decrease compared to using a single prompt . This is due to limited diversity among the learned prompts. In comparison, adding more prompts is always beneficial when using prompt dropout.

Refer to caption
(a) Impact of prompt dropout when learning multiple global prompts.
Refer to caption
(b) Impact of multiple scales when learning local prompts, with k1=10subscript𝑘110k_{1}=10italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 10, Δk=10subscriptΔ𝑘10\Delta_{k}=10roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 10.
Figure 8: Impact of our design choices on learning global Fig. 8(a) and local prompts Fig. 8(b) .

Local prompt learning at multiple scales. On Fig. 8(b), we show the interest of our multiscale approach. We experiment with various number of scales, i.e. from 1 to 6 scales with k1=10subscript𝑘110k_{1}=10italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 10 and Δk=10subscriptΔ𝑘10\Delta_{k}=10roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 10 and report the top-1 accuracy. We can observe a steady increase from 1 scale to 4 scales (+1pt). Performances stabilize afterward for 5 and 6 scales. Fig. 8(b) shows that learning at different scales is beneficial, but also that GalLoP is not too sensitive to the choice of number of prompts. Furthermore, learning at different scale also reduce the need to select an optimal k𝑘kitalic_k, although we show in Fig. 7 that performances are stable with respect to k𝑘kitalic_k.

4.4 Qualitative study.

We conduct in this section a qualitative study of GalLoP, by comparing it to CLIP on Fig. 9, and visualizing its different scales on Fig. 10. We show other qualitative results in supplementary material B.6.

Comparison to CLIP. On Fig. 9, we compare GalLoP and CLIP local features. We can observe that CLIP’s local features are not discriminative and do not allow to classify images correctly, which was observed in Sec. 4.3. On the other hand, GalLoP classifies correctly the images, even with a single scale. We can also observe GalLoP accurately segments the object of interest when using all its scales.

Visualize multiple scales. Finally, we show the different regions each of the local prompts attend to. We can see that scale # 1 focuses on the most discriminative features, i.e. the head and tail of the “Ring tailed lemur”. Each scale progressively attends to different parts of the body, leading to an accurate prediction.

Refer to caption
Refer to caption
Figure 9: Qualitative comparison of CLIP and GalLoP. From left to right, the original image with its ground truth, CLIP local wrong prediction, one scale (k𝑘kitalic_k=10) of GalLoP with correct prediction and GalLoP multiscale, resulting in correct prediction and segmentation.
Refer to caption
Figure 10: GalLoP multiscale visualization. Regions observed by the different prompts of GalLoP for a “Ring tailed lemur”.

5 Conclusion

This paper introduces GalLoP, a new prompt learning method that leverage both global and local visual representations. The key features of GalLoP are the strong discriminability of its local representations and its capacity to produce diverse predictions from both local and global prompts. Extensive experiments show that GalLoP outperforms previous prompt learning methods on top-1 accuracy on average for 11 datasets; that it works in different few shot settings; and for both convolutional and transformer vision-backbones. We show in ablation studies the interest of the design choices that make GalLoP work, i.e. complementarity between local and global prompts; sparsity and enhanced alignment; encouraging diversity. Finally, we conduct a qualitative study to show what local prompts focus on when classifying an image. Future works include learning the local feature alignment on a large vision-language dataset.

Acknowledgements

This work was done under grants from the DIAMELEX ANR program (ANR-20-CE45-0026) and the AHEAD ANR program (ANR-20-THIA-0002). It was granted access to the HPC resources of IDRIS under the allocation AD011012645R1 and AD011013370R1 made by GENCI.

References

  • [1] Agnolucci, L., Baldrati, A., Todino, F., Becattini, F., Bertini, M., Del Bimbo, A.: Eco: Ensembling context optimization for vision-language models. In: ICCV (2023)
  • [2] Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: ECCV (2014)
  • [3] Chen, G., Yao, W., Song, X., Li, X., Rao, Y., Zhang, K.: Plot: Prompt learning with optimal transport for vision-language models. In: The Eleventh International Conference on Learning Representations (2023)
  • [4] Cho, J., Nam, G., Kim, S., Yang, H., Kwak, S.: Promptstyler: Prompt-driven style generation for source-free domain generalization. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. pp. 15656–15666. IEEE (2023)
  • [5] Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2014)
  • [6] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
  • [7] Dong, X., Bao, J., Zheng, Y., Zhang, T., Chen, D., Yang, H., Zeng, M., Zhang, W., Yuan, L., Chen, D., et al.: Maskclip: Masked self-distillation advances contrastive language-image pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10995–11005 (2023)
  • [8] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
  • [9] Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Pattern Recognition Workshop (2004)
  • [10] Gal, Y., Ghahramani, Z.: Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: International Conference on Machine Learning. pp. 1050–1059. PMLR (2016)
  • [11] Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision 132(2), 581–595 (2024)
  • [12] Gondal, M.W., Gast, J., Ruiz, I.A., Droste, R., Macri, T., Kumar, S., Staudigl, L.: Domain aligned clip for few-shot classification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 5721–5730 (2024)
  • [13] Goyal, S., Kumar, A., Garg, S., Kolter, Z., Raghunathan, A.: Finetune like you pretrain: Improved finetuning of zero-shot vision models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. pp. 19338–19347. IEEE (2023)
  • [14] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arxiv e-prints. arXiv preprint arXiv:1512.03385 10 (2015)
  • [15] Helber, P., Bischke, B., Dengel, A., Borth, D.: Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification (2017)
  • [16] Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., Song, D., Steinhardt, J., Gilmer, J.: The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV (2021)
  • [17] Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136 (2016)
  • [18] Hendrycks, D., Mazeika, M., Dietterich, T.: Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606 (2018)
  • [19] Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. CVPR (2021)
  • [20] Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)
  • [21] Khattak, M.U., Rasheed, H., Maaz, M., Khan, S., Khan, F.S.: Maple: Multi-modal prompt learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19113–19122 (2023)
  • [22] Khattak, M.U., Wasim, S.T., Naseer, M., Khan, S., Yang, M.H., Khan, F.S.: Self-regulating prompts: Foundational model adaptation without forgetting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15190–15200 (2023)
  • [23] Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13). Sydney, Australia (2013)
  • [24] Lafon, M., Ramzi, E., Rambour, C., Thome, N.: Hybrid energy based model in the feature space for out-of-distribution detection. In: International Conference on Machine Learning. pp. 18250–18268. PMLR (2023)
  • [25] Lee, K., Lee, K., Lee, H., Shin, J.: A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems 31 (2018)
  • [26] Lu, Y., Liu, J., Zhang, Y., Liu, Y., Tian, X.: Prompt distribution learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5206–5215 (2022)
  • [27] Maji, S., Kannala, J., Rahtu, E., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. Tech. rep. (2013)
  • [28] Ming, Y., Cai, Z., Gu, J., Sun, Y., Li, W., Li, Y.: Delving into out-of-distribution detection with vision-language representations. Advances in Neural Information Processing Systems 35, 35087–35102 (2022)
  • [29] Miyai, A., Yu, Q., Irie, G., Aizawa, K.: Locoop: Few-shot out-of-distribution detection via prompt learning. NeurIPS 36 (2023)
  • [30] Miyai, A., Yu, Q., Irie, G., Aizawa, K.: Zero-shot in-distribution detection in multi-object settings using vision-language foundation models. CoRR (2023)
  • [31] Nie, J., Zhang, Y., Fang, Z., Liu, T., Han, B., Tian, X.: Out-of-distribution detection with negative prompts. In: The Twelfth International Conference on Learning Representations (2024)
  • [32] Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing (Dec 2008)
  • [33] Parisot, S., Yang, Y., McDonagh, S.: Learning to name classes for vision and language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23477–23486 (2023)
  • [34] Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)
  • [35] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
  • [36] Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: International conference on machine learning. pp. 5389–5400. PMLR (2019)
  • [37] Sehwag, V., Chiang, M., Mittal, P.: SSD: A unified framework for self-supervised outlier detection. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 (2021)
  • [38] Shu, Y., Guo, X., Wu, J., Wang, X., Wang, J., Long, M.: Clipood: Generalizing CLIP to out-of-distributions. In: ICML (2023)
  • [39] Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  • [40] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1), 1929–1958 (2014)
  • [41] Sun, X., Hu, P., Saenko, K.: Dualcoop: Fast adaptation to multi-label recognition with limited annotations. Advances in Neural Information Processing Systems 35, 30569–30582 (2022)
  • [42] Sun, Y., Ming, Y., Zhu, X., Li, Y.: Out-of-distribution detection with deep nearest neighbors. In: International Conference on Machine Learning. pp. 20827–20840. PMLR (2022)
  • [43] Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun, C., Shepard, A., Adam, H., Perona, P., Belongie, S.: The inaturalist species classification and detection dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8769–8778 (2018)
  • [44] Villani, C., et al.: Optimal transport: old and new, vol. 338. Springer (2009)
  • [45] Wang, F., Li, M., Lin, X., Lv, H., Schwing, A., Ji, H.: Learning to decompose visual features with latent textual prompts. In: The Eleventh International Conference on Learning Representations (2022)
  • [46] Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. In: Advances in Neural Information Processing Systems. pp. 10506–10518 (2019)
  • [47] Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale scene recognition from abbey to zoo. In: CVPR (2010)
  • [48] Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip-adapter: Training-free adaption of CLIP for few-shot classification. In: ECCV. pp. 493–510 (2022)
  • [49] Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)
  • [50] Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from CLIP. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel. Lecture Notes in Computer Science, vol. 13688, pp. 696–712. Springer (2022)
  • [51] Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16816–16825 (2022)
  • [52] Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. International Journal of Computer Vision 130(9), 2337–2348 (2022)

A Additional details on method.

In this section, we give additional details about GalLoP. In Sec. A.1, we describe how the local features are extracted from CLIP’s vision encoder, for both ResNet and ViT architectures. In Sec. A.2, we describe the inference procedure in GalLoP as well as the GL-MCM score [30] which we use for OOD detection. Finally, we discuss in Sec. A.3 the use of an additional explicit diversity loss to train GalLoP.

A.1 CLIP’s local visual features.

To obtain the visual local features from CLIP we follow previous works [50, 41, 30], which we describe in the following.

ViT backbone. When the vision encoder is a ViT, the output of the vision encoder is composed of the class token embedding, 𝒛clssubscript𝒛cls\bm{z}_{\text{cls}}bold_italic_z start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT, and a set of L𝐿Litalic_L local features 𝒵l=(𝒛1l,,𝒛Ll)subscript𝒵𝑙subscriptsuperscript𝒛𝑙1subscriptsuperscript𝒛𝑙𝐿\mathcal{Z}_{l}=(\bm{z}^{l}_{1},...,\bm{z}^{l}_{L})caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ( bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ). The global visual representation used in CLIP is the class token embedding, i.e. 𝒛g=𝒛clssubscript𝒛𝑔subscript𝒛cls\bm{z}_{g}=\bm{z}_{\text{cls}}bold_italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = bold_italic_z start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT, however the local features after the last transformer block are of low quality as only the class token receives a supervision signal during training. Hence, prior studies [50, 41, 30] have recommended utilizing visual local features from the penultimate transformer block and forward them through the last transformer block without using the self attention mechanism.

Specifically, we have i{1,,L}for-all𝑖1𝐿\forall i\in\{1,...,L\}∀ italic_i ∈ { 1 , … , italic_L }:

𝒛il=𝒛il+v(𝒛il)+f(𝒛il+v(𝒛il)),subscriptsuperscript𝒛𝑙𝑖subscriptsuperscript𝒛𝑙𝑖𝑣subscriptsuperscript𝒛𝑙𝑖𝑓subscriptsuperscript𝒛𝑙𝑖𝑣subscriptsuperscript𝒛𝑙𝑖\displaystyle\begin{split}\bm{z}^{l}_{i}~{}&=~{}\bm{z}^{l}_{i}+v(\bm{z}^{l}_{i% })~{}+~{}f(\bm{z}^{l}_{i}+v(\bm{z}^{l}_{i})),\\ \end{split}start_ROW start_CELL bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL = bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_v ( bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_f ( bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_v ( bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , end_CELL end_ROW

where v()𝑣v(\cdot)italic_v ( ⋅ ) denotes the linear projection used to compute the values in the self-attention module and f()𝑓f(\cdot)italic_f ( ⋅ ) is the feed-forward network of the last transformer block.

ResNet backbone. When the vision encoder is a ResNet the vision encoder outputs a feature map containing L𝐿Litalic_L local patches 𝒵l=(𝒛1l,,𝒛Ll)subscript𝒵𝑙subscriptsuperscript𝒛𝑙1subscriptsuperscript𝒛𝑙𝐿\mathcal{Z}_{l}=(\bm{z}^{l}_{1},...,\bm{z}^{l}_{L})caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ( bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ). Then, the global visual feature, 𝒛gsubscript𝒛𝑔\bm{z}_{g}bold_italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, is obtained using a self-attention pooling module:

𝒛g=isoftmax(q(𝒛l¯)k(𝒛il)Td)v(𝒛il),subscript𝒛𝑔subscript𝑖softmax𝑞¯superscript𝒛𝑙𝑘superscriptsubscriptsuperscript𝒛𝑙𝑖𝑇𝑑𝑣subscriptsuperscript𝒛𝑙𝑖\displaystyle\begin{split}\bm{z}_{g}&=\sum_{i}\text{softmax}(\frac{q(\overline% {\bm{z}^{l}})~{}k(\bm{z}^{l}_{i})^{T}}{\sqrt{d}})\cdot v(\bm{z}^{l}_{i}),\\ \end{split}start_ROW start_CELL bold_italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT softmax ( divide start_ARG italic_q ( over¯ start_ARG bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG ) italic_k ( bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) ⋅ italic_v ( bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , end_CELL end_ROW

where d𝑑ditalic_d is the feature dimension, 𝒛l¯=1Li=1L𝒛il¯superscript𝒛𝑙1𝐿superscriptsubscript𝑖1𝐿subscriptsuperscript𝒛𝑙𝑖\overline{\bm{z}^{l}}=\frac{1}{L}\sum_{i=1}^{L}\bm{z}^{l}_{i}over¯ start_ARG bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the average-pooled feature used as unique query, and q()𝑞q(\cdot)italic_q ( ⋅ ), k()𝑘k(\cdot)italic_k ( ⋅ ), v()𝑣v(\cdot)italic_v ( ⋅ ) denote the query, key and value projections, respectively. To obtain useful visual local features, it is then sufficient to use the values of the local features without the attention mechanism, i.e. 𝒛il=v(𝒛il)subscriptsuperscript𝒛𝑙𝑖𝑣subscriptsuperscript𝒛𝑙𝑖\bm{z}^{l}_{i}=v(\bm{z}^{l}_{i})bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_v ( bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

A.2 Details on GalLoP’s inference.

In this section, we give more details on our inference procedure. As described in Sec. 3.2, GalLoP is trained by summing the global and multiscale losses, associated to global and local prompts. Therefore, we naturally adopt an “ensembling-style” inference strategy by averaging the similarities obtained with each prompt to obtain a final similarity, sim(𝒛,𝒕c)sim𝒛subscript𝒕𝑐\text{sim}(\bm{z},~{}\bm{t}_{c})sim ( bold_italic_z , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ), for each class ycsubscript𝑦𝑐y_{c}italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.

Specifically, writing 𝒛=[𝒛g,𝒵l]𝒛subscript𝒛𝑔subscript𝒵𝑙\bm{z}=[\bm{z}_{g},~{}\mathcal{Z}_{l}]bold_italic_z = [ bold_italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ], we compute:

sim(𝒛,𝒕c)sim𝒛subscript𝒕𝑐\displaystyle\text{sim}(\bm{z},~{}\bm{t}_{c})sim ( bold_italic_z , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) =1ni=1n𝒛g,𝒕c(𝒑𝒈i)+1mj=1msimtop-k(𝒵l,𝒕c(𝒑jl)),absent1𝑛superscriptsubscript𝑖1𝑛subscript𝒛𝑔subscript𝒕𝑐subscriptsuperscript𝒑𝒈𝑖1𝑚superscriptsubscript𝑗1𝑚subscriptsimtop-ksubscript𝒵𝑙subscript𝒕𝑐subscriptsuperscript𝒑𝑙𝑗\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left<\bm{z}_{g},~{}\bm{t}_{c}(\bm{p^{g% }}_{i})\right>~{}+~{}\frac{1}{m}\sum_{j=1}^{m}\text{sim}_{\text{top-$k$}}(% \mathcal{Z}_{l},~{}\bm{t}_{c}(\bm{p}^{l}_{j})),= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⟨ bold_italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_p start_POSTSUPERSCRIPT bold_italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⟩ + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT sim start_POSTSUBSCRIPT top- italic_k end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) ,

where simtop-k(𝒵l,𝒕c(𝒑jl))subscriptsimtop-ksubscript𝒵𝑙subscript𝒕𝑐subscriptsuperscript𝒑𝑙𝑗\text{sim}_{\text{top-$k$}}(\mathcal{Z}_{l},~{}\bm{t}_{c}(\bm{p}^{l}_{j}))sim start_POSTSUBSCRIPT top- italic_k end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) is defined in Eq. (2) of the main paper. Then with this final similarity computed, we use Eq. (1) of the main paper to compute the probability for class ycsubscript𝑦𝑐y_{c}italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.

To perform out-of-distribution detection with GalLoP we use the GL-MCM score [30] which rely on both global and local information. The idea behind the MCM score [28] and the GL-MCM score [30] is to perform a maximum concept matching, which is a natural extension of the maximum class probability (MCP) score [17] which is widely used baseline within the OOD community [25, 37, 42, 24].

Formally, the GL-MCM score is expressed as:

SGL-MCM =  SG-MCM  +  SL-MCM

where

SG-MCM=maxcexp(1ni=1n𝒛g,𝒕c(𝒑𝒈i)/τ)cexp(1ni=1n𝒛g,𝒕c(𝒑𝒈i)/τ),𝑆G-MCMsubscript𝑐1𝑛superscriptsubscript𝑖1𝑛subscript𝒛𝑔subscript𝒕𝑐subscriptsuperscript𝒑𝒈𝑖𝜏subscriptsuperscript𝑐1𝑛superscriptsubscript𝑖1𝑛subscript𝒛𝑔subscript𝒕superscript𝑐subscriptsuperscript𝒑𝒈𝑖𝜏\displaystyle S\textsubscript{G-MCM}~{}=~{}\max_{c}~{}\frac{\exp(\frac{1}{n}% \sum_{i=1}^{n}\left<\bm{z}_{g},~{}\bm{t}_{c}(\bm{p^{g}}_{i})\right>~{}/~{}\tau% )}{\sum_{c^{\prime}}\exp(\frac{1}{n}\sum_{i=1}^{n}\left<\bm{z}_{g},~{}\bm{t}_{% c^{\prime}}(\bm{p^{g}}_{i})\right>~{}/~{}\tau)},italic_S = roman_max start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT divide start_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⟨ bold_italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_p start_POSTSUPERSCRIPT bold_italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⟩ / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⟨ bold_italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_p start_POSTSUPERSCRIPT bold_italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⟩ / italic_τ ) end_ARG ,
SL-MCM=maxc,iexp(1mj=im𝒛il,𝒕c(𝒑jl)/τ)cexp(1mj=1m𝒛il,𝒕c(𝒑jl)/τ).𝑆L-MCMsubscript𝑐𝑖1𝑚superscriptsubscript𝑗𝑖𝑚subscriptsuperscript𝒛𝑙𝑖subscript𝒕𝑐subscriptsuperscript𝒑𝑙𝑗𝜏subscriptsuperscript𝑐1𝑚superscriptsubscript𝑗1𝑚subscriptsuperscript𝒛𝑙𝑖subscript𝒕superscript𝑐subscriptsuperscript𝒑𝑙𝑗𝜏\displaystyle S\textsubscript{L-MCM}~{}=\max_{c,~{}i}~{}\frac{\exp(\frac{1}{m}% \sum_{j=i}^{m}\left<\bm{z}^{l}_{i},~{}\bm{t}_{c}(\bm{p}^{l}_{j})\right>~{}/~{}% \tau)}{\sum_{c^{\prime}}\exp(\frac{1}{m}\sum_{j=1}^{m}\left<\bm{z}^{l}_{i},~{}% \bm{t}_{c^{\prime}}(\bm{p}^{l}_{j})\right>~{}/~{}\tau)}.italic_S = roman_max start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT divide start_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⟨ bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩ / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⟨ bold_italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩ / italic_τ ) end_ARG .

A.3 Diversity loss.

Previous works on prompt ensembling have explored the use of an explicit loss term encouraging the semantic orthogonality between prompts to increase their diversity [26, 4]. This loss is expressed as:

div.(𝒫)=1N(N1)i=1Nj=i+1N|𝒕i,𝒕j|,subscriptdiv.𝒫1𝑁𝑁1superscriptsubscript𝑖1𝑁superscriptsubscript𝑗𝑖1𝑁subscript𝒕𝑖subscript𝒕𝑗\mathcal{L}_{\text{div.}}(\mathcal{P})=\frac{1}{N\cdot(N-1)}\sum_{i=1}^{N}\sum% _{j=i+1}^{N}|\left<\bm{t}_{i},\bm{t}_{j}\right>|,caligraphic_L start_POSTSUBSCRIPT div. end_POSTSUBSCRIPT ( caligraphic_P ) = divide start_ARG 1 end_ARG start_ARG italic_N ⋅ ( italic_N - 1 ) end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | ⟨ bold_italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ | ,

where 𝒫𝒫\mathcal{P}caligraphic_P is a set of N𝑁Nitalic_N prompts and i{1,,N},𝒕ifor-all𝑖1𝑁subscript𝒕𝑖\forall i\in\{1,\cdots,N\},~{}\bm{t}_{i}∀ italic_i ∈ { 1 , ⋯ , italic_N } , bold_italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the textual representations of the prompts without incorporating class names. The strength of the diversity loss is controlled with a hyper-parameter λdiv.subscript𝜆div.\lambda_{\text{div.}}italic_λ start_POSTSUBSCRIPT div. end_POSTSUBSCRIPT.

We have experimented optimizing GalLoP with the following loss: total(𝒫,𝜽)+λdiv.div.(𝒫)subscripttotal𝒫𝜽subscript𝜆div.subscriptdiv.𝒫\mathcal{L}_{\text{total}}(\mathcal{P},~{}\bm{\theta})+\lambda_{\text{div.}}% \cdot\mathcal{L}_{\text{div.}}(\mathcal{P})caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT ( caligraphic_P , bold_italic_θ ) + italic_λ start_POSTSUBSCRIPT div. end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT div. end_POSTSUBSCRIPT ( caligraphic_P ). In Fig. 11, we show that training GalLoP with divsubscriptdiv\mathcal{L}_{\text{div}}caligraphic_L start_POSTSUBSCRIPT div end_POSTSUBSCRIPT does not improve top-1 accuracy, even when increasing λdiv.subscript𝜆div.\lambda_{\text{div.}}italic_λ start_POSTSUBSCRIPT div. end_POSTSUBSCRIPT. As a result, we did not include div.subscriptdiv.\mathcal{L}_{\text{div.}}caligraphic_L start_POSTSUBSCRIPT div. end_POSTSUBSCRIPT in GalLoP, as it did not lead to significant improvement in either accuracy or robustness, while introducing an extra hyperparameter, λdiv.subscript𝜆div.\lambda_{\text{div.}}italic_λ start_POSTSUBSCRIPT div. end_POSTSUBSCRIPT.

Refer to caption
Figure 11: Impact of λdiv.subscript𝜆div.\lambda_{\text{div.}}italic_λ start_POSTSUBSCRIPT div. end_POSTSUBSCRIPT.

B Additional experimental results.

In this section, we give additional experimental results of GalLoP. In Sec. B.1 we conduct more results for few-shots settings experiments on the suite 11 of datasets. In Sec. B.2 we give results of GalLoP when using a ResNet-50 CLIP backbone. In Sec. B.4 and Sec. B.5 we give the detailed results for the ImageNet-1k domain generalization and out-of-distribution detection benchmarks, respectively. In Sec. B.3 we compare GalLoP to other few-shots learning methods. Finally, we show additional qualitative results in Sec. B.6.

Additional implementation details.

In this section, we give more implementation details of GalLoP. We show on Tab. 3 the hyperparameters used to train GalLoP on ImageNet for the 16-shots setting. We use the same data augmentation than CoOp [52].

Table 3: Hyperparameters to train GalLoP on ImageNet (16 shots) with ViT-B/16 backbone.
Hyperparameters Value
batch size 128
learning rate 0.002
lr-scheduler CosineAnnealingLR
epochs 50
optimizer SGD
weight decay 0.01
momentum 0.9
local prompts 4
global prompts 4
tokens per prompt 4
prompt init “A photo of a”

B.1 Full few shot results.

In this section, we give the detailed results for different few-shots settings. We report the top-1 accuracy of GalLoP on each dataset of the few-shot learning benchmark introduced in [52]. We can see in Tab. 4 that GalLoP outperforms other prompt learning baselines for all shots on average on the suite of 11 datasets. Specifically, GalLoP consistently outperforms the second-best method PromptSRC by +0.5pt with 1-shot, +1.1pt with 2-shots, +0.8pt with 4-shots, +1.5pt with 8-shots and +1.6pt with 16-shots. All results for each of the 11 datasets are ploted on Fig. 12.

Table 4: Averaged few-shots results on the suite 11 datasets with ViT-B/16 backbone.
Method 0-shot 1-shot 2-shots 4-shots 8-shots 16-shots
CLIP 64.9 - - - - -
CoOp - 67.6 70.6 74.0 77.0 79.9
MaPLe - 69.3 72.6 75.8 78.9 81.8
PLOT - 70.7 74.0 76.9 79.6 82.1
PromptSRC - 72.3 75.3 78.3 80.7 82.9
GalLoP - 72.8 76.4 79.1 82.2 84.5
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12: Few-shot learning results of GalLoP on the 11 datasets with the ViT-B/16 backbone.

B.2 Detailed results for ResNet-50.

In this section we give detailed results for GalLoP when trained using a ResNet-50 backbone. We can see in Tab. 5 that GalLoP outperforms other ResNet-50 compatible prompt learning methods on all the datasets except Food101. Specifically, GalLoP achieves 77.3% accuracy on average on the suite 11 of datasets, outperforming PLOT by +3.4pt and CoOp by +3.9pt. Note that the second-best method on ViT-B/16, PromptSRC [22], is not compatible with convolutional backbones such as the ResNet-50.

Table 5: Top-1 accuracy with a Resnet-50 backbone in the 16-shots setting. Comparison of GalLoP to other prompt learning methods on the suite of 11 datasets.
Dataset

ImageNet [6]

Caltech101 [9]

OxfordPets [34]

Cars [23]

Flowers102 [32]

Food101 [2]

Aircraft [27]

SUN397 [47]

DTD [5]

EuroSAT [15]

UCF101 [39]

Average

CLIP 58.1 84.1 82.7 55.8 66.0 75.0 17.0 57.1 42.9 36.3 57.9 57.5
Linear Probe 55.9 90.6 76.4 70.1 95.0 70.2 36.4 67.2 64.0 82.8 73.7 71.1
CoOp 63.0 91.8 87.0 73.4 94.5 74.7 31.3 69.3 63.6 83.5 75.7 73.4
Co-CoOp 62.9 90.2 88.3 61.6 78.3 80.0 21.3 67.3 56.2 70.1 71.1 67.9
PLOT 63.0 92.2 87.2 72.8 94.8 77.1 31.5 70.0 65.6 82.2 77.3 73.9
GalLoP 66.1 92.8 89.3 79.3 96.7 76.5 41.6 72.2 67.6 87.6 80.4 77.3

B.3 GalLoP vs. other few-shots learning methods.

In this section, we compare GalLoP to other type of few-shots learning methods. We compare GalLoP against the standard fine-tuning of all the parameters of CLIP’s vision and text encoders as well as FLYP [13], which is a more recent version using the same contrastive objective as CLIP to fine-tune on downstream datasets. We also consider CLIPOOD [38], which only trains the visual encoder. Furthermore, we also include adapters, e.g. the recent CLIP-Adapter [11], which uses residual adapters on both the visual and textual representations. Finally, we compare against cached-based methods like Tip-Adapter / Tip-Adapter-F [48] and DAC-V / DAC-VT [12].

We show in Tab. 6 the performance of GalLoP vs. the other few-shots learning methods in the 16-shots setting using a ViT-B/16 backbone. We can see that GalLoP obtains better top-1 accuracy than the recent fine-tuning method FLYP while it fine-tunes ×250absent250\times 250× 250 more parameters than GalLoP. Furthermore, GalLoP outperforms the best cache-based method, DAC-VT, by +0.5pt, while having half the number of parameters. Also, when compared to DAC-V, which has the same number of parameters, GalLoP obtains +2.1pt in top-1 accuracy.

Table 6: Comparison of GalLoP vs. other few-shots learning methods on ViT-B/16 in the 16-shots setting.
Top-1 # params (×\times×106)
Zero-Shot CLIP 68.6 0
Tip-Adapter [48] 70.8 0
Full fine-tuning [13] 73.1 149.7
CLIPOOD [38] 71.6 86.7
FLYP [13] 74.9 149.7
CLIP-Adapter [11] 71.1 0.2
Tip-Adapter-F [48] 73.7 16.4
DAC-V [12] 73.0 0.6
DAC-VT [12] 74.6 1.1
GalLoP 75.1 0.6

B.4 Detailed domain generalization results.

In this section, we give the detailed results for the ImageNet domain generalization benchmark. We compare the performances of GalLoP with several prompt learning methods. Each method is trained on ImageNet with 16 shots per class and is evaluated on top-1 accuracy on four variants of ImageNet, i.e. ImageNet-V2 [36], ImageNet-Sketch [46], ImageNet-A [19] and ImageNet-R [16]. We can see in Tab. 7 that GalLoP outperforms previous prompt learning methods on average on the four ImageNet variants with +0.6pt top1-accuracy vs. PromptSRC. More specifically, we obtain better results on ImageNet-V2, with +1.8pt with respect to the second-best method, and comparable results to PromptSRC on ImageNet-Sketch and ImageNet-R.

B.5 Detailed OOD detection results.

In this section, we give the detailed results of GalLoP for OOD detection. We use the OOD detection benchmark from [28] where ImageNet-1k is the in-distribution (ID) dataset, and iNaturalist [43], SUN [47], Places [49] and Textures [5] are used as OOD datasets. We report the results using the FPR95\downarrow and the AUC\uparrow metrics, two standard metrics used by the OOD detection community. The FPR95 is the false positive rate, using a threshold corresponding that classifies 95% of the ID images correctly. The AUC is the area under the receiver operating characteristic curve (ROC). We can see in Tab. 8 that GalLoP obtains better averaged FPR95 results than other prompt learning methods with -1.4pt vs. LoCoOp while achieving 93.2 averaged AUC, the second-best result after LoCoOp (93.5 averaged AUC).

Table 7: Domain generalization from ImageNet with ViT-B/16 backbone. Prompt learning methods are trained on ImageNet and evaluated on datasets with domain shifts. results based on our re-implementation.
Source Target
ImageNet -V2 [36] -S [46] -A [19] -R [16] Avg.
CLIP 66.7 60.8 46.2 47.8 74.0 57.2
CoOp 71.7 64.6 47.9 49.9 75.1 59.4
Co-CoOp 71.0 64.1 48.8 50.6 76.2 59.9
MaPLe 70.7 64.1 49.2 50.9 77.0 60.3
PLOT 72.6 64.9 46.8 48.0 73.9 58.4
PromptSRC 71.3 64.4 49.6 50.9 77.8 60.7
PromptSRC 73.2 65.7 49.1 47.6 76.9 59.8
LoCoOp 71.5 64.7 47.4 49.8 75.0 57.5
ProDA 71.9 64.5 48.6 50.7 76.3 60.0
GalLoP 75.1 67.5 49.5 50.3 77.8 61.3
Table 8: OOD detection with ViT-B/16 as backbone. CoOp and LoCoOp results reported from [29]. CoCoOp and LSN results are reported from [31]. denotes results based on our re-implementation. For PLOT, we use their released checkpoint and evaluate its OOD detection results ourselves.
iNat [43] SUN [47] Places [49] Textures [5] Average Top-1
FPR95\downarrow AUC\uparrow FPR95\downarrow AUC\uparrow FPR95\downarrow AUC\uparrow FPR95\downarrow AUC\uparrow FPR95\downarrow AUC\uparrow
MCM 30.9 94.6 37.7 92.6 44.8 89.8 57.9 86.1 42.8 90.8 66.7
GL-MCM 15.2 96.7 30.4 93.1 38.9 89.9 57.9 83.6 35.5 90.8 66.7
PLOT 15.9 96.6 33.7 92.8 38.2 91.0 39.2 90.2 31.8 92.7 72.6
PromptSRC 28.8 93.9 35.9 92.6 42.4 90.0 46.9 88.9 38.5 91.4 71.3
PromptSRC 20.6 95.7 30.1 93.7 38.0 91.1 46.0 89.0 33.7 92.4 73.2
ProDA 32.4 93.2 35.7 92.4 42.6 90.0 46.2 89.3 39.2 91.2 71.9
CoOpMCM 28.0 94.4 37.0 92.3 43.0 89.7 39.3 91.2 36.8 91.9 71.7
CoOpGL 14.6 96.6 28.5 92.7 36.5 90.0 43.1 88.0 30.7 91.8 71.7
CoCoOp 30.7 94.7 31.2 93.2 38.8 90.6 53.8 87.9 38.6 91.6 71.0
LoCoOpMCM 23.1 95.5 32.7 93.4 39.9 90.6 40.2 91.3 34.0 92.7 71.5
LoCoOpGL 16.1 96.9 23.4 95.1 32.9 92.0 42.3 90.2 28.7 93.5 71.5
LSN+CoOp 23.5 95.5 29.8 93.5 36.4 90.9 38.2 89.5 32.0 92.3 72.9
LSN+CoCoOp 21.6 95.8 26.3 94.4 34.5 91.3 38.5 90.4 30.2 93.0 71.9
GalLoP 13.7 97.1 24.9 94.0 32.5 91.3 38.4 90.4 27.3 93.2 75.1

B.6 Additional qualitative results.

Finally, we display additional qualitative results of GalLoPLocal on Fig. 13.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 13: Additional qualitative results for GalLoP. From left to right, the original image with its ground truth, CLIP local wrong prediction, one scale (k𝑘kitalic_k=10) of GalLoPLocal with correct prediction and GalLoPLocal multiscale.