Mining Open Semantics from CLIP: A Relation Transition Perspective for Few-Shot Learning

Cilin Yan
Beihang University
[email protected]
&Haochen Wang
University of Amsterdam
[email protected]
&Xiaolong Jiang, Yao Hu, Xu Tang
Xiaohongshu
[email protected] &Guoliang Kang11footnotemark: 1
Beihang University
[email protected]
&Efstratios Gavves
University of Amsterdam
[email protected]
Abstract

Contrastive Vision-Language Pre-training (CLIP) demonstrates impressive zero-shot capability. The key to improve the adaptation of CLIP to downstream task with few exemplars lies in how to effectively model and transfer the useful knowledge embedded in CLIP. Previous work mines the knowledge typically based on the limited visual samples and close-set semantics (i.e., within target category set of downstream task). However, the aligned CLIP image/text encoders contain abundant relationships between visual features and almost infinite open semantics, which may benefit the few-shot learning but remains unexplored. In this paper, we propose to mine open semantics as anchors to perform a relation transition from image-anchor relationship to image-target relationship to make predictions. Specifically, we adopt a transformer module which takes the visual feature as "Query", the text features of the anchors as "Key" and the similarity matrix between the text features of anchor and target classes as "Value". In this way, the output of such a transformer module represents the relationship between the image and target categories, i.e., the classification predictions. To avoid manually selecting the open semantics, we make the [CLASS] token of input text embedding learnable. We conduct extensive experiments on eleven representative classification datasets. The results show that our method performs favorably against previous state-of-the-arts considering few-shot classification settings.

1 Introduction

Recent years have witnessed the rising of large-scale vision-language pre-trained models radford2021learning ; jia2021scaling ; yao2021filip ; li2022fine ; lee2022uniclip ; gao2022pyramidclip ; zhou2022non . Among those visio-language models, CLIP radford2021learning is a widely-adopted representative due to its superior zero-shot capability on downstream tasks wang2024learn . CLIP jointly trains the text and image encoders with image-text pairing supervision. Benefiting from the shared image and text feature space, CLIP can be directly adopted to recognize images from novel categories by examining the degrees of alignment between image features and text features of novel categories. Despite the amazing zero-shot performance of CLIP, there remains huge potential to improve the adaptation ability of CLIP if a few exemplars from the downstream visual task are accessible during training.

There are a few works that investigate how to improve the few-shot adaptation performance of CLIP radford2021learning . The key to improve the adaptation of CLIP lies in how to extract and transfer the useful knowledge of CLIP in terms of specific downstream tasks. Most previous works extract the transferable knowledge only with limited visual samples and close-set definitions of categories (i.e., the category set of downstream task), including prompt tuning zhou2022learning ; zhou2022conditional , classifier fine-tuning he2022synthetic , or image-image relation modeling zhang2022tip , etc. However, as we know, CLIP is trained with large-scale open datasets and thus contains abundant relationships across almost infinite open semantics. How to extract and utilize such semantic relation knowledge to improve the few-shot learning of CLIP remains unexplored.

Refer to caption
Figure 1: Transit Image-Anchor Relationship to Image-Target Relationship via Anchor-Target Relation Matrix.

As shown in Fig. 1, we aim to recognize samples from target categories (green), i.e., "tiger-cat", "chinchilla", "anteater", and "lemur". We manually select some other categories as anchors (blue), i.e., "tiger", "cat", "kangaroo", and "pig". In order to recognize the relation between image samples and target categories, we may first compute the similarities between image samples and the anchors to denote the image-anchor relation. We may also model the anchor-target relation by computing the similarities between anchors and target categories, as shown in Fig. 1. Through the transition of anchor-target relation matrix, the image-target relations can be obtained. A reasonable prior is that the image-target relations should keep consistent after relation transition, e.g., an image of "chinchilla" should be classified into the "chinchilla" category, no matter whether relation transition is performed.

In this paper, we propose to mine open semantics as anchors to perform a relation transition from image-anchor relation to image-target relation to facilitating few-shot learning. Inspired by the illustration of Fig. 1, we design the relation transition module (RTM) as a transformer decoder architecture which takes the visual feature as "Query", the text features of the anchors as "Key" and the similarity matrix between the text features of anchors and target classes as "Value". In this way, the output of RTM represents the relationship between the image and target categories, i.e., the classification predictions. To avoid manually selecting the open semantics, we make the [CLASS] token of input text embedding learnable. During training, we freeze the CLIP encoders and impose cross-entropy loss with labeled visual samples to update the learnable [CLASS] token and the transformer module. Via the transition of open semantics, we expect richer semantic relationships can be assembled to reduce the prediction error. We name our framework as Relation Transition with Open Semantics (RTOS). We conduct extensive experiments on eleven few-shot benchmarks and show that our method performs favorably against previous state-of-the-arts.

In a nutshell, our contributions can be summarized as follows

  • We propose a new perspective which aims to utilize the abundant semantic knowledge encoded in CLIP to benefit the few-shot learning task, i.e., mining open semantics as anchors to perform the relation transition. Via the transition of open semantics, we expect richer semantic relationships can be assembled to reduce the prediction error.

  • We propose to learn the open semantics via a learnable semantic token of text input and use a transformer module to perform relation transition from image-anchor relation to image-target relation. With such designs, we avoid manually selecting anchors and make semantic relation modeling more adapted to the downstream task.

  • Extensive experiments demonstrate our RTOS performs favorably against previous state-of-the-arts, e.g., on Flowers102 datasets, we achieve 80.8%, outperforming previous state-of-the-art method by 7.3%. Moreover, ablations verify the effectiveness and necessity of each of our designs.

2 Related Work

2.1 Traditional Few-shot Learning

In the domain of traditional few-shot learning, researchers commonly curate multiple few-shot train and test sets, each comprising a limited number of training samples along with their corresponding test samples. Earlier research on few-shot learning primarily focused on generative models that employed intricate iterative inference methodologies fei2006one ; lake2011one . The success of deep learning-based methods in data-rich environments krizhevsky2017imagenet ; he2016deep ; simonyan2014very has led to a growing interest in adapting these approaches to handle few-shot learning scenarios. To this end, various successful methods for few-shot learning have been proposed and leveraged techniques like meta-learning, metric learning, transfer learning, and transductive learning. Meta-learning-based approaches finn2017model ; ravi2017optimization involve training an auxiliary parameterization net that is capable of learning how to parameterize a feed-forward classification problem in terms of few-shot sample set. Metric-learning-based approaches bateni2020improved ; snell2017prototypical ; vinyals2016matching ; sung2018learning aim to learn a set of projection functions such that when represented in this embedding, images are easy to recognize using simple nearest neighbor or linear classifiers. Transfer-learning-based methods hariharan2017low ; qi2018low typically work by pre-training a deep neural network on a source dataset with abundant labeled examples, and fine-tuning the network on a few-shot learning task with limited labeled examples in the target dataset to improve performance. Transductive-learning-based methods dhillon2019baseline ; joachims1999transductive ; liu2018learning typically work by leveraging the unlabeled examples in the target dataset to improve the few-shot learning performance, by exploiting the similarities and relationships between unlabeled and labeled examples.

2.2 VLM-based Few-shot Learning

Recently, AI has grown towards the dominant paradigm of learning foundation models bommasani2021opportunities from large-scale web data. The development of Vision-Language Pre-trained Models (VLM) radford2021learning ; jia2021scaling ; yao2021filip ; li2022fine ; lee2022uniclip ; gao2022pyramidclip ; zhou2022non has been particularly rapid, thanks to the low cost of collecting image-text pairs from the web. With the emergence of Vision-Language Models, a new few-shot evaluation protocol has been developed, as implemented by recent works on few-shot adaptation with CLIP radford2021learning . In this new protocol, the meta-training phase is replaced with pre-trained CLIP models, and the official test splits of each dataset are utilized as the test sets. Prompt tuning zhou2022learning ; zhou2022conditional and adapter zhang2022tip ; gao2021clip have emerged as representative parameter-efficient learning paradigms that adapt VLM to downstream tasks. Prompt tuning of CLIP radford2021learning ; zhou2022conditional ; lu2022prompt ; xing2022class ; zhu2022prompt ; sun2022dualcoop draws inspiration from the successful implementation of prefix-tuning in language models deng2022rlprompt ; gao2020making ; haviv2021bertese ; jiang2020can , specifically aiming to extract broader text features. Similarly, CLIP-Adapter gao2021clip and Tip-Adapter zhang2022tip draw inspiration from parameter-efficient fine-tuning methods houlsby2019parameter ; jia2022visual ; zhang2020side that optimize lightweight MLPs while kee** the encoder frozen. These approaches mine the knowledge typically based on the limited visual samples and close-set semantics. However, the aligned CLIP image/text encoders contain abundant relationships between visual features and almost infinite open semantics, which may benefit the few-shot learning but remains unexplored. In this paper, we focus on mining open semantics as anchors to perform a relation transition from image-anchor relationship to image-target relationship to make predictions.

3 Method

Refer to caption
Figure 2: The Framework of RTOS. (a) Anchor-target Relation Modelling: Input with target categories and anchors, CLIP text encoder is adopted to extract target features tarsubscript𝑡𝑎𝑟\mathcal{F}_{tar}caligraphic_F start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT and anchor features prosubscript𝑝𝑟𝑜\mathcal{F}_{pro}caligraphic_F start_POSTSUBSCRIPT italic_p italic_r italic_o end_POSTSUBSCRIPT, which are used to model anchor-target relation matrix \mathcal{R}caligraphic_R between target categories and anchors. (b) Relation Transition: Input a test image \mathcal{I}caligraphic_I, CLIP image encoder is adopted to extract the image visual feature. Next, the Relation Transition Module (RTM) ΦΦ\Phiroman_Φ is applied to transition image-anchor relation to image-target relation. CLIP classifier is initialized using target features tarsubscript𝑡𝑎𝑟\mathcal{F}_{tar}caligraphic_F start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT. The knowledge acquired from relation transition is then integrated with CLIP’s pre-trained knowledge, enabling accurate prediction.

The overall framework of Relation Transition with Open Semantics (RTOS) we proposed is shown in Fig. 2. Given an image to recognize, we obtain its feature through CLIP’s image encoder. We construct two groups of text inputs. One group corresponds to the target classes, which consists of manually designed prompts and different target class names; the other one corresponds to the anchors, which consists of manually designed prompts and different [CLASS] tokens. We make the [CLASS] token of anchor text input learnable to mine open semantics from CLIP. Then the anchor-target relation matrix is constructed by computing similarities between anchor text features and target text features (Sec. 3.1). In our framework, we design a relation transition module (RTM) based on the transformer architecture which takes image features as Query, anchor text features as Key, and anchor-target relation matrix as Value. The output of RTM denotes the relations between image and target classes. Finally, we combine the output of RTM with the output of CLIP’s zero-shot classifier to produce the final prediction (Sec. 3.2).

3.1 Anchor-target Relation Modelling

In this paper, our primary task focuses on zero-shot and few-shot image classification, where we aim to classify images into their respective categories with limited or no training examples available for the target categories. The target categories serve as a set of labels for the classification task. Our primary goal is to mine these open semantics to improve the CLIP’s classification performance. We use anchors as cues derived from the open semantics within CLIP. Anchors serve as a bridge to connect and transition image features to their corresponding target categories in image classification task. To be more specific, anchors can take different forms depending on their origin and semantics. For instance, anchors can be category names sampled from cross-dataset category list, which carry explicit semantic information. Alternatively, they may be randomly initialized class tokens that do not hold any clear semantic information. We model the anchor-target relation, which is used to transition the image-anchor relation to image-target relation.

The anchor-target relation is modeled using the CLIP text encoder. Following zhou2022learning , we define the target categories prompts 𝒯tarsubscript𝒯𝑡𝑎𝑟\mathcal{T}_{tar}caligraphic_T start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT and the anchors prompts 𝒯ancsubscript𝒯𝑎𝑛𝑐\mathcal{T}_{anc}caligraphic_T start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT given to the text encoder as follows:

𝒯tar,isubscript𝒯𝑡𝑎𝑟𝑖\displaystyle\mathcal{T}_{tar,i}\;caligraphic_T start_POSTSUBSCRIPT italic_t italic_a italic_r , italic_i end_POSTSUBSCRIPT =[𝒗1,𝒗2,,𝒗M,𝒔tar,i]absentsubscript𝒗1subscript𝒗2subscript𝒗𝑀subscript𝒔𝑡𝑎𝑟𝑖\displaystyle=[\boldsymbol{v}_{1},\boldsymbol{v}_{2},...,\boldsymbol{v}_{M},% \boldsymbol{s}_{tar,i}\;]= [ bold_italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_v start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t italic_a italic_r , italic_i end_POSTSUBSCRIPT ] (1)
𝒯anc,jsubscript𝒯𝑎𝑛𝑐𝑗\displaystyle\mathcal{T}_{anc,j}caligraphic_T start_POSTSUBSCRIPT italic_a italic_n italic_c , italic_j end_POSTSUBSCRIPT =[𝒗1,𝒗2,,𝒗M,𝒔anc,j]absentsubscript𝒗1subscript𝒗2subscript𝒗𝑀subscript𝒔𝑎𝑛𝑐𝑗\displaystyle=[\boldsymbol{v}_{1},\boldsymbol{v}_{2},...,\boldsymbol{v}_{M},% \boldsymbol{s}_{anc,j}]= [ bold_italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_v start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_a italic_n italic_c , italic_j end_POSTSUBSCRIPT ] (2)

where i{1,2,,Ctar}𝑖12subscript𝐶𝑡𝑎𝑟i\in\{1,2,...,C_{tar}\}italic_i ∈ { 1 , 2 , … , italic_C start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT } is the target categories index, Ctarsubscript𝐶𝑡𝑎𝑟C_{tar}italic_C start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT denotes the number of target categories, 𝒔tar,isubscript𝒔𝑡𝑎𝑟𝑖\boldsymbol{s}_{tar,i}bold_italic_s start_POSTSUBSCRIPT italic_t italic_a italic_r , italic_i end_POSTSUBSCRIPT denotes word embedding of the i𝑖iitalic_i-th target categories name sisubscripts𝑖\mathrm{s}_{i}roman_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, j{1,2,,Canc}𝑗12subscript𝐶𝑎𝑛𝑐j\in\{1,2,...,C_{anc}\}italic_j ∈ { 1 , 2 , … , italic_C start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT } is the anchors index, Cancsubscript𝐶𝑎𝑛𝑐C_{anc}italic_C start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT denotes the number of anchors, the 𝒔anc,jsubscript𝒔𝑎𝑛𝑐𝑗\boldsymbol{s}_{anc,j}bold_italic_s start_POSTSUBSCRIPT italic_a italic_n italic_c , italic_j end_POSTSUBSCRIPT denotes word embedding of the j𝑗jitalic_j-th anchors name sjsubscripts𝑗\mathrm{s}_{j}roman_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The anchors are initialized by cross-dataset category list or randomly initialized in the experiments. [𝒗1,𝒗2,,𝒗M]subscript𝒗1subscript𝒗2subscript𝒗𝑀[\boldsymbol{v}_{1},\boldsymbol{v}_{2},...,\boldsymbol{v}_{M}][ bold_italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_v start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ] denotes word embedding of the prompt sentences prefix (e.g. "A photo of {}.").

After obtaining the text feature, as shown in Fig. 2 (a), by forwarding the target categories prompts 𝒯tarsubscript𝒯𝑡𝑎𝑟\mathcal{T}_{tar}caligraphic_T start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT and the anchors prompts 𝒯ancsubscript𝒯𝑎𝑛𝑐\mathcal{T}_{anc}caligraphic_T start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT to the CLIP text encoder, we can obtain the target L2 normalized features tarCtar×Dsubscript𝑡𝑎𝑟superscriptsubscript𝐶𝑡𝑎𝑟𝐷\mathcal{F}_{tar}\in\mathbb{R}^{C_{tar}\times D}caligraphic_F start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT and the anchor L2 normalized features ancCanc×Dsubscript𝑎𝑛𝑐superscriptsubscript𝐶𝑎𝑛𝑐𝐷\mathcal{F}_{anc}\in\mathbb{R}^{C_{anc}\times D}caligraphic_F start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT. Where D𝐷Ditalic_D denotes the feature dimension of the CLIP’s visual-language feature space (e.g. D=512𝐷512D=512italic_D = 512 for ResNet50 he2016deep backbone in CLIP). The features of target categories and anchors are used to build anchor-target relation matrix Ctar×Cancsuperscriptsubscript𝐶𝑡𝑎𝑟subscript𝐶𝑎𝑛𝑐\mathcal{R}\in\mathbb{R}^{C_{tar}\times C_{anc}}caligraphic_R ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to model the anchor-target relation. The anchor-target relation matrix \mathcal{R}caligraphic_R is defined in Eq. 3.

i,j=exp(cos(tar,i,anc,j)/τ)j=1Cancexp(cos(tar,i,anc,j)/τ)subscript𝑖𝑗subscript𝑡𝑎𝑟𝑖subscript𝑎𝑛𝑐𝑗𝜏superscriptsubscript𝑗1subscript𝐶𝑎𝑛𝑐subscript𝑡𝑎𝑟𝑖subscript𝑎𝑛𝑐𝑗𝜏\mathcal{R}_{i,j}=\frac{\exp(\cos(\mathcal{F}_{tar,i},\mathcal{F}_{anc,j})/% \tau)}{\sum\nolimits_{j=1}^{C_{anc}}\exp(\cos(\mathcal{F}_{tar,i},\mathcal{F}_% {anc,j})/\tau)}caligraphic_R start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG roman_exp ( roman_cos ( caligraphic_F start_POSTSUBSCRIPT italic_t italic_a italic_r , italic_i end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_a italic_n italic_c , italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_exp ( roman_cos ( caligraphic_F start_POSTSUBSCRIPT italic_t italic_a italic_r , italic_i end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_a italic_n italic_c , italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG (3)

3.2 Relation Transition

Upon establishing anchor-target relation, we employ the Relation Transition Module (RTM) ΦΦ\Phiroman_Φ to transform the image-anchor relation into the image-target relation for classification prediction. This process essentially involves mining the open semantics embedded within the CLIP framework, which in turn ensures better representation and recognition of the target categories. As shown in Fig. 2 (b), by forwarding the test image H×W×3superscript𝐻𝑊3\mathcal{I}\in\mathbb{R}^{H\times W\times 3}caligraphic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT to the CLIP image encoder, we can obtain the test image L2 normalized feature ftestDsubscript𝑓𝑡𝑒𝑠𝑡superscript𝐷f_{test}\in\mathbb{R}^{D}italic_f start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. The Relation Transition Module (RTM) ΦΦ\Phiroman_Φ is a cross-attention module, which accepts the image visual feature ftestsubscript𝑓𝑡𝑒𝑠𝑡f_{test}italic_f start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT as "Query", anchor features ancsubscript𝑎𝑛𝑐\mathcal{F}_{anc}caligraphic_F start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT as "Key", and the anchor-target relation matrix \mathcal{R}caligraphic_R as "Value" to transition image-anchor relation 𝒫ancCancsubscript𝒫𝑎𝑛𝑐superscriptsubscript𝐶𝑎𝑛𝑐\mathcal{P}_{anc}\in\mathbb{R}^{C_{anc}}caligraphic_P start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to image-target relation 𝒫tarCtarsubscript𝒫𝑡𝑎𝑟superscriptsubscript𝐶𝑡𝑎𝑟\mathcal{P}_{tar}{\color[rgb]{1,0,0}\in\mathbb{R}^{C_{tar}}}caligraphic_P start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, defined as 𝒫tar=Φ(ftest,anc,)subscript𝒫𝑡𝑎𝑟Φsubscript𝑓𝑡𝑒𝑠𝑡subscript𝑎𝑛𝑐\mathcal{P}_{tar}=\Phi(f_{test},\mathcal{F}_{anc},\mathcal{R})caligraphic_P start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT = roman_Φ ( italic_f start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT , caligraphic_R ).

Based on the reasonable prior that the image-target relations remain consistent through the relation transition process, it is possible to perform image classification prediction in the zero-shot setting by utilizing the relationships between images, target categories, and anchors. To accomplish this, we first map the images and target categories onto the anchor space. Next, we determine the image-target relations within this mapped space. Based on this assumption, the anchor-target relation \mathcal{R}caligraphic_R acts as the map** for target categories onto anchors. Similarly, we can define the map** of an image onto anchors through Eq. 4.

𝒫anc,j=exp(cos(ftest,anc,j)/τ)j=1Cancexp(cos(ftest,anc,j)/τ)subscript𝒫𝑎𝑛𝑐𝑗subscript𝑓𝑡𝑒𝑠𝑡subscript𝑎𝑛𝑐𝑗𝜏superscriptsubscript𝑗1subscript𝐶𝑎𝑛𝑐subscript𝑓𝑡𝑒𝑠𝑡subscript𝑎𝑛𝑐𝑗𝜏\mathcal{P}_{anc,j}=\frac{\exp(\cos(f_{test},\mathcal{F}_{anc,j})/\tau)}{\sum_% {j=1}^{C_{anc}}\exp(\cos(f_{test},\mathcal{F}_{anc,j})/\tau)}caligraphic_P start_POSTSUBSCRIPT italic_a italic_n italic_c , italic_j end_POSTSUBSCRIPT = divide start_ARG roman_exp ( roman_cos ( italic_f start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_a italic_n italic_c , italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_exp ( roman_cos ( italic_f start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_a italic_n italic_c , italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG (4)

Then, the Relation Transition Module (RTM) performs a relation transition from image-anchor relation 𝒫ancsubscript𝒫𝑎𝑛𝑐\mathcal{P}_{anc}caligraphic_P start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT to image-target relation 𝒫tarsubscript𝒫𝑡𝑎𝑟\mathcal{P}_{tar}caligraphic_P start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT. This transition is defined as

𝒫tar,i=exp(cos(𝒫anc,j)/τ)i=1Ctarexp(cos(𝒫anc,i)/τ).subscript𝒫𝑡𝑎𝑟𝑖subscript𝒫𝑎𝑛𝑐subscript𝑗superscript𝜏superscriptsubscript𝑖1subscript𝐶𝑡𝑎𝑟subscript𝒫𝑎𝑛𝑐subscript𝑖superscript𝜏\mathcal{P}_{tar,i}=\frac{\exp(\cos(\mathcal{P}_{anc},\mathcal{R}_{j})/\tau^{% \prime})}{\sum_{i=1}^{C_{tar}}\exp(\cos(\mathcal{P}_{anc},\mathcal{R}_{i})/% \tau^{\prime})}.caligraphic_P start_POSTSUBSCRIPT italic_t italic_a italic_r , italic_i end_POSTSUBSCRIPT = divide start_ARG roman_exp ( roman_cos ( caligraphic_P start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT , caligraphic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_exp ( roman_cos ( caligraphic_P start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT , caligraphic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG . (5)

By following this approach, we are able to make use of the open semantics within the pre-trained CLIP framework, which allows for better representation and recognition of the target categories, ultimately leading to improved performance in image classification tasks.

When we have a limited amount of data available for training, it is possible that the manually designed relation transitions are not optimal. Therefore, we employ the Transformer model, allowing the network to learn more effective relation transition strategies autonomously. We use the transformer architecture as RTM ΦΦ\Phiroman_Φ. The image-target relation is obtained by

𝒫tar=Transformer(ftest,anc,).subscript𝒫𝑡𝑎𝑟Transformersubscript𝑓𝑡𝑒𝑠𝑡subscript𝑎𝑛𝑐\mathcal{P}_{tar}=\text{Transformer}(f_{test},\mathcal{F}_{anc},\mathcal{R}).caligraphic_P start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT = Transformer ( italic_f start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT , caligraphic_R ) . (6)

Then, the relation-based prediction 𝒫tarsubscript𝒫𝑡𝑎𝑟\mathcal{P}_{tar}caligraphic_P start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT works in conjunction with CLIP’s zero-shot prediction 𝒫clipsubscript𝒫𝑐𝑙𝑖𝑝\mathcal{P}_{clip}caligraphic_P start_POSTSUBSCRIPT italic_c italic_l italic_i italic_p end_POSTSUBSCRIPT to produce the final prediction 𝒫finalsubscript𝒫𝑓𝑖𝑛𝑎𝑙\mathcal{P}_{final}caligraphic_P start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT.

𝒫final=𝒫clip+α𝒫tarsubscript𝒫𝑓𝑖𝑛𝑎𝑙subscript𝒫𝑐𝑙𝑖𝑝𝛼subscript𝒫𝑡𝑎𝑟\mathcal{P}_{final}=\mathcal{P}_{clip}+\alpha\mathcal{P}_{tar}caligraphic_P start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT = caligraphic_P start_POSTSUBSCRIPT italic_c italic_l italic_i italic_p end_POSTSUBSCRIPT + italic_α caligraphic_P start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT (7)

Using this method, we harness anchors as open semantics, which enable the effective relation transition from image-anchor relation to image-target relation.

3.3 Training and Inference

Zero-shot Setting. In the zero-shot setting, our method does not require any training, and we refer to it as ZS-RTOS. First, we manually select anchors from cross-dataset category list. Then, we build the anchor-target relation matrix \mathcal{R}caligraphic_R to model anchor-target relationship. The anchor-target relation matrix only needs to be created once, which means that we only need to perform one forward pass of the CLIP text encoder. Once the anchor-target relation is established, the CLIP text encoder can be discarded. Next, the Relation Transition Module ΦΦ\Phiroman_Φ is applied to convert the image-anchor relation 𝒫ancsubscript𝒫𝑎𝑛𝑐\mathcal{P}_{anc}caligraphic_P start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT to the image-target relation 𝒫tarsubscript𝒫𝑡𝑎𝑟\mathcal{P}_{tar}caligraphic_P start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT. Finally, the relation-based prediction 𝒫tarsubscript𝒫𝑡𝑎𝑟\mathcal{P}_{tar}caligraphic_P start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT collaborates with CLIP’s zero-shot prediction 𝒫zssubscript𝒫𝑧𝑠\mathcal{P}_{zs}caligraphic_P start_POSTSUBSCRIPT italic_z italic_s end_POSTSUBSCRIPT to produce the final prediction 𝒫finalsubscript𝒫𝑓𝑖𝑛𝑎𝑙\mathcal{P}_{final}caligraphic_P start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT.

Few-shot Setting. ZS-RTOS offers a substantial boost to CLIP by incorporating fresh open semantics relation transition knowledge. Moreover, RTOS remains compatible with situations that involve training data. In the few-shot scenario, a limited amount of data can aid RTOS in acquiring superior anchors, thereby elevating its performance even further. RTOS focuses on updating the anchor embeddings (which can be pre-designed or randomly initialized) to learn anchor classes that provide significant performance improvements to CLIP. For few-shot setting, we use cross entropy loss to update RTOS. With the updates made to RTOS, it attains state-of-the-art performance on 11 widely adopted datasets.

In addition to the relation transition method based on the consistency prior described above, we also explored a relation transition approach based on the total probability formula. In this approach, the anchor-target relationship is defined in Eq. 8, and the relation transition can be expressed as 𝒫tar=𝒫ancsubscript𝒫𝑡𝑎𝑟subscript𝒫𝑎𝑛𝑐\mathcal{P}_{tar}=\mathcal{R}\mathcal{P}_{anc}caligraphic_P start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT = caligraphic_R caligraphic_P start_POSTSUBSCRIPT italic_a italic_n italic_c end_POSTSUBSCRIPT.

i,j=exp(cos(tar,i,anc,j)/τ)i=1Ctarexp(cos(tar,i,anc,j)/τ)subscript𝑖𝑗subscript𝑡𝑎𝑟𝑖subscript𝑎𝑛𝑐𝑗𝜏superscriptsubscript𝑖1subscript𝐶𝑡𝑎𝑟subscript𝑡𝑎𝑟𝑖subscript𝑎𝑛𝑐𝑗𝜏\mathcal{R}_{i,j}=\frac{\exp(\cos(\mathcal{F}_{tar,i},\mathcal{F}_{anc,j})/% \tau)}{\sum\nolimits_{i=1}^{C_{tar}}\exp(\cos(\mathcal{F}_{tar,i},\mathcal{F}_% {anc,j})/\tau)}caligraphic_R start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG roman_exp ( roman_cos ( caligraphic_F start_POSTSUBSCRIPT italic_t italic_a italic_r , italic_i end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_a italic_n italic_c , italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_exp ( roman_cos ( caligraphic_F start_POSTSUBSCRIPT italic_t italic_a italic_r , italic_i end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_a italic_n italic_c , italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG (8)

Moreover, we also investigated the image-image relation method. In this approach, we use the training image features as the input keys for the Relation Transition Model (RTM), and the one-hot labels of the training images serve as the RTM’s values. The performance of the methods based on the total probability formula and image-image relation is evaluated in Experiment Section.

4 Experiments

Method

Average

EuroSAT

Caltech101

DTD

UCF101

StanfordCars

OxfordPets

SUN397

Flowers102

FGVCAircraft

ImageNet

Food101

CLIP 58.9 37.5 85.9 42.2 61.4 55.7 85.8 58.5 66.0 17.1 60.3 77.3
Random 59.0 36.2 86.1 42.9 61.7 55.9 85.9 58.9 66.2 17.2 60.4 77.3
Selected 59.4 39.3 87.5 43.3 62.0 55.8 85.8 58.9 66.1 17.2 60.4 77.4
Table 1: Accuracy of zero-shot learning on the 11 datasets. The "Average" denotes the average accuracy over the 11 datasets. The "Random" denotes the accuracy of method with randomly initialized anchor class embeddings. The "Selected" denotes the accuracy of method with manually selected anchors. The class names of anchors are chosen from the class definitions in the other 10 datasets.
Method Number of shots
1 2 4 8 16
✧ Linear probe CLIP zhou2022learning 36.7 47.6 57.2 65.0 71.1
✧ CoOp zhou2022learning 59.6 62.3 66.8 69.9 73.4
✧ CLIP-Adapter gao2021clip 62.7 65.5 68.6 71.3 74.4
✧ ProGrad zhu2022prompt 62.6 64.9 68.5 71.4 73.9
✧ Tip-Adapter zhang2022tip 62.3 64.6 66.5 68.5 70.3
✧ Tip-Adapter-F zhang2022tip 64.6 66.7 69.7 72.4 75.8
✧ RTOS 65.5 67.4 70.3 72.7 75.8
✧ RTOS-IR 66.1 68.2 71.1 73.6 76.6
✦ Synthetic he2022synthetic 60.5 62.9 67.0 69.9 73.1
✦ RTOS 62.4 64.0 66.9 69.2 72.5
✦ RTOS-IR 62.7 64.6 67.6 70.2 73.5
Table 2: Comparison with previous few-shot methods. ✧ denotes the average accuracy of 11 datasets, ✦ denotes the average accuracy of 8 datasets reported in he2022synthetic . Our methods outperform existing state-of-the-art methods in few-shot classification settings.

4.1 Experimental Setups

Datasets Following CoOp zhou2022learning , we conduct experiments for RTOS on 11 widely-used image classification datasets: ImageNet deng2009imagenet , StandfordCars krause20133d , UCF101 soomro2012ucf101 , Caltech101 fei2004learning , Flowers102 nilsback2008automated , SUN397 xiao2010sun , DTD cimpoi2014describing , EuroSAT helber2019eurosat , FGVCAircraft maji2013fine , OxfordPets parkhi2012cats , and Food101 bossard2014food . These datasets constitute a comprehensive benchmark, which covers a diverse set of vision tasks including the classification of generic objects, scenes, actions, and fine-grained categories, as well as specialized tasks like recognizing textures and satellite imagery.

Implementation Details We follow the data preprocessing protocol in CLIP radford2021learning , which is composed of random crop**, resizing, and random horizontal flip. Following Tip-Adapter zhang2022tip , we adopt prompt ensembling for experiments on ImageNet and use single handcrafted prompt on the other 10 datasets. For the CLIP radford2021learning backbone, we utilize ResNet-50 he2016deep as the visual encoder. We obtain the pre-trained weights of both encoders from radford2021learning and freeze them during training. The batch size is set to 256. We adopt the AdamW kingma2014adam optimizer with learning rate set to 0.00001 and a cosine scheduler. Following Tip-Adapter zhang2022tip , we train 100 epochs on the EuroSAT dataset and 20 epochs on the other 10 datasets. We set τ𝜏\tauitalic_τ and τsuperscript𝜏\tau^{\prime}italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to 0.01. We set hype-parameter α𝛼\alphaitalic_α following Tip-Adapter zhang2022tip . For the zero-shot setting, we directly test the model’s performance on the full test set. Besides, following radford2021learning , we consider the 1-, 2-, 4-, 8-, 16-shot settings, where we utilize 1, 2, 4, 8, 16 labeled samples to train the model and then evaluate the trained model on the full test set.

4.2 Comparison under Zero-Shot Setting

We conduct experiments with different anchor selection ways under zero-shot setting, and compare them to the zero-shot CLIP baseline in Tab. 1. The average accuracy across the 11 datasets is reported for comparison. As shown in Tab. 1, without any training and extra annotations, selecting anchors to perform relation transition performs favorably against the zero-shot CLIP baseline. Specifically, the method with manually selected anchors outperforms CLIP by 0.5%. Notably, on the EuroSAT and Caltech101 datasets, the zero-shot accuracy of ours outperforms CLIP by 1.8% and 1.6% respectively. These improvements justify that with carefully selected anchors, performing the relation transition benefits the adaption of CLIP to downstream tasks. However, method with randomly selected anchors barely yields accuracy improvement, which inspires us to determine the effective anchors in a learnable way.

Alias Knowledge Number of shots
Cons. Prob. Image. 1 2 4 8 16
RTOS-C 65.0 67.0 70.0 72.4 75.4
RTOS-P 63.3 63.9 65.2 67.4 71.6
RTOS 65.5 67.4 70.3 72.7 75.8
- 65.9 68.0 70.9 73.5 76.6
RTOS-IR 66.1 68.2 71.1 73.6 76.6
Table 3: Different configuration of our methods. Average accuracy of the 11 datasets. Cons. denotes the approach based on consistency prior, Prob. denotes the approach based on the total probability formula. Image. denotes the approach based on image-image relation. The methods highlighted in gray represent the approaches we compare in the experimental Sec. 4.3.

4.3 Comparison with SOTA few-shot methods

Refer to caption
Figure 3: Accuracy of zero-shot and few-shot learning on the 11 datasets. Overall, Zero-shot RTOS improves Zero-shot CLIP without any data. RTOS consistently surpasses all previous start-of-the-art methods by efficiently fine-tuning the anchor class embeddings.

We compare our method to previous state-of-the-art methods including Zero-shot CLIP radford2021learning , Linear-probe CLIP radford2021learning , CoOp zhou2022learning , CLIP-Adapter gao2021clip , ProGrad zhu2022prompt , Tip-Adapter zhang2022tip , and Synthetic he2022synthetic in Tab. 2 and Fig. 3. With several annotated images to train the [Class] tokens of anchors, the RTOS achieves new state-of-the-arts on average. In particular, on the EuroSAT dataset, RTOS surpasses Tip-Adapter-F by 8.4% and 3.9% in 1-shot and 2-shot settings. After integrating with the image-image relations (see Tab. 4.3), the RTOS-IR outperforms all existing methods in all few-shot settings. For example, RTOS-IR obviously outperforms Tip-Adapter-F by 1.2% and 0.8% in 8-shot and 16-shot settings. The results demonstrate that mining open semantic relations from CLIP greatly benefits few-shot learning, which justifies the design of our framework.

4.4 Ablation Studies

Refer to caption
Figure 4: Ablation on the relation between target categories and anchors. (a) illustrates the performance gain in accuracy of CLIP on the Caltech101 dataset when using different datasets’ categories as anchors. (b) show association heatmaps between different anchors and target categories. (c) show association heatmaps before and after training.

Effect of different choices of anchors. In this section, we investigate which type of anchor will benefit the adaptation of CLIP. We conduct experiments with our ZS-RTOS framework on the Caltech101 fei2004learning dataset. We use all 11 datasets’ categories as anchors and evaluate corresponding performance gains. As shown in Fig. 4 (a), different instantiations of anchors yield different zero-shot accuracy. For example, using categories from DTD cimpoi2014describing dataset as anchors achieves 1.79% improvement compared to the zero-shot CLIP baseline, while using categories from Caltech101 itself or from Flowers102 dataset as anchors don’t bring any performance improvement.

We also examine the anchor-target relation matrix in Fig. 4 (b) to investigate the powerful pattern which may benefit the relation transition and final prediction. As we observed, if the marginal distribution of target classes (which can be obtained by averaging all the rows of the anchor-target relation matrix) is more balanced, the performance gain will be larger. For example, the marginal distribution of target classes is more balanced for DTD (1.79% gain) than for Flowers102 (no gain). Moreover, as shown in Fig. 4 (c), the relation matrix between learned anchors and target classes exhibits similar pattern to that using DTD categories as anchors, verifying that our learnable way mines beneficial open semantics to facilitate effective relation transition.

Ablation Studies on RTOS-C
shots 0 1 2 4 8 16
human-designed 87.5 89.6 90.3 90.9 92.1 92.4
(a) Init. Method rand. initialized 86.1 89.8 90.3 91.3 92.9 93.8
text features - 89.6 90.4 91.0 92.2 92.6
(b) Tuned Params. class embeddings - 89.8 90.3 91.3 92.9 93.8
direct - 88.4 89.1 89.8 90.9 91.6
(c) Relation Transition Module transformer - 89.8 90.3 91.3 92.9 93.8
Table 4: Classification Accuracy on Caltect101. (a) Ablation studies on different initialization methods. (b) Ablation studies on different finetuned parameters. (c) Ablation studies on directly using Eq. 4 and Eq. 5 or using transformer as relation transition modele. The methods marked in gray indicate the configurations adopted by the RTOS.

Effect of different initialization ways of anchors. Tab. 4 (a) shows the zero-shot (0) and few-shot (1-16) performance on Caltect101 dataset under two different anchor class initialization methods. The human-designed method achieves 87.5% under the zero-shot setting, outperforming the random initialization method by 0.6%. On the contrary, under different few-shot settings, the random initialization method outperforms the human-designed method. That is because the random initialization of anchor embeddings allows for more flexible learning of task-beneficial semantics from available data. On the contrary, using manually designed categories as anchors may restrict such learning and adaptation processes and thus achieve sub-optimal results.

Effect of different parameters to tune. Tab. 4 (b) presents the classification accuracy on the Caltect101 dataset for various finetuned parameters. The results indicate that fine-tuning anchor class embeddings is a more effective method compared to directly fine-tuning text features. Directly fine-tuning text features may have a risk of overfitting to the training data.

Effect of learnable relation transition. The results in Tab. 4 (c) demonstrate that employing a transformer to learn different relation transitions outperforms the direct relation transition via multiplication between image-anchor relation vector and anchor-target relation matrix, showing the learnable way can better depict the relation transition process.

Shots Number of Anchors
5 10 20 40 60 80 100 150 200
1 88.2 89.3 88.9 88.9 89.0 89.6 88.7 88.7 88.5
2 89.3 89.2 89.5 89.3 89.5 89.7 89.4 89.3 89.7
4 90.4 90.8 90.1 90.1 90.4 91.0 90.8 90.8 90.6
8 90.2 90.8 91.0 91.4 91.2 92.0 91.5 91.9 91.4
16 90.9 92.0 92.7 92.8 92.6 92.9 93.0 92.8 93.1
Table 5: Ablation studies on the number of anchors on Caltech101 dataset.

Effect of anchor numbers. In this section, we investigate the influence of the number of anchor classes over the few-shot performance of RTOS-C. As shown in Tab. 5, when the number of anchor categories is smaller than the number of target categories, RTOS-C brings negligible improvements. When the number of anchor categories is set to 80, RTOS-C achieves the best performance.

5 Conclusions

In this paper, we aim to utilize the abundant semantic knowledge encoded in CLIP to benefit the few-shot learning task by mining open semantics as anchors to perform the relation transition. To this end, we propose RTOS, which learns the open semantics via a learnable semantic token of text input and uses a transformer module to perform relation transition from image-anchor relation to image-target relation. Extensive experiments verify the effectiveness of our proposed method.

Broader Impact and Limitations Our method will not introduce bias but it may be impacted by the bias contained in CLIP. Our method mines semantic knowledge from CLIP to benefit the few-shot learning. There still remain other useful priors which may benefit the CLIP-based few-shot learning to be investigated in the future.

References

  • (1) A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PMLR, 2021.
  • (2) C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning, pp. 4904–4916, PMLR, 2021.
  • (3) L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C. Xu, “Filip: fine-grained interactive language-image pre-training,” arXiv preprint arXiv:2111.07783, 2021.
  • (4) J. Li, X. He, L. Wei, L. Qian, L. Zhu, L. Xie, Y. Zhuang, Q. Tian, and S. Tang, “Fine-grained semantically aligned vision-language pre-training,” Advances in Neural Information Processing Systems, 2022.
  • (5) J. Lee, J. Kim, H. Shon, B. Kim, S. H. Kim, H. Lee, and J. Kim, “Uniclip: Unified framework for contrastive language-image pre-training,” Advances in Neural Information Processing Systems, 2022.
  • (6) Y. Gao, J. Liu, Z. Xu, J. Zhang, K. Li, R. Ji, and C. Shen, “Pyramidclip: Hierarchical feature alignment for vision-language model pretraining,” Advances in Neural Information Processing Systems, vol. 35, pp. 35959–35970, 2022.
  • (7) J. Zhou, L. Dong, Z. Gan, L. Wang, and F. Wei, “Non-contrastive learning meets language-image pre-training,” arXiv preprint arXiv:2210.09304, 2022.
  • (8) J. Wang and G. Kang, “Learn to rectify the bias of clip for unsupervised semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4102–4112, 2024.
  • (9) K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022.
  • (10) K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16816–16825, 2022.
  • (11) R. He, S. Sun, X. Yu, C. Xue, W. Zhang, P. Torr, S. Bai, and X. Qi, “Is synthetic data from generative models ready for image recognition?,” arXiv preprint arXiv:2210.07574, 2022.
  • (12) R. Zhang, W. Zhang, R. Fang, P. Gao, K. Li, J. Dai, Y. Qiao, and H. Li, “Tip-adapter: Training-free adaption of clip for few-shot classification,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 493–510, Springer, 2022.
  • (13) L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of object categories,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 4, pp. 594–611, 2006.
  • (14) B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum, “One shot learning of simple visual concepts,” in Proceedings of the annual meeting of the cognitive science society, 2011.
  • (15) A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
  • (16) K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
  • (17) K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • (18) C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International conference on machine learning, pp. 1126–1135, PMLR, 2017.
  • (19) S. Ravi and H. Larochelle, “Optimization as a model for few-shot learning,” in International conference on learning representations, 2017.
  • (20) P. Bateni, R. Goyal, V. Masrani, F. Wood, and L. Sigal, “Improved few-shot visual classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14493–14502, 2020.
  • (21) J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” Advances in neural information processing systems, vol. 30, 2017.
  • (22) O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al., “Matching networks for one shot learning,” Advances in neural information processing systems, vol. 29, 2016.
  • (23) F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1199–1208, 2018.
  • (24) B. Hariharan and R. Girshick, “Low-shot visual recognition by shrinking and hallucinating features,” in Proceedings of the IEEE international conference on computer vision, pp. 3018–3027, 2017.
  • (25) H. Qi, M. Brown, and D. G. Lowe, “Low-shot learning with imprinted weights,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5822–5830, 2018.
  • (26) G. S. Dhillon, P. Chaudhari, A. Ravichandran, and S. Soatto, “A baseline for few-shot image classification,” arXiv preprint arXiv:1909.02729, 2019.
  • (27) T. Joachims et al., “Transductive inference for text classification using support vector machines,” in Icml, vol. 99, pp. 200–209, 1999.
  • (28) Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. J. Hwang, and Y. Yang, “Learning to propagate labels: Transductive propagation network for few-shot learning,” arXiv preprint arXiv:1805.10002, 2018.
  • (29) R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
  • (30) P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao, “Clip-adapter: Better vision-language models with feature adapters,” arXiv preprint arXiv:2110.04544, 2021.
  • (31) Y. Lu, J. Liu, Y. Zhang, Y. Liu, and X. Tian, “Prompt distribution learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5206–5215, 2022.
  • (32) Y. Xing, Q. Wu, D. Cheng, S. Zhang, G. Liang, and Y. Zhang, “Class-aware visual prompt tuning for vision-language pre-trained model,” arXiv preprint arXiv:2208.08340, 2022.
  • (33) B. Zhu, Y. Niu, Y. Han, Y. Wu, and H. Zhang, “Prompt-aligned gradient for prompt tuning,” arXiv preprint arXiv:2205.14865, 2022.
  • (34) X. Sun, P. Hu, and K. Saenko, “Dualcoop: Fast adaptation to multi-label recognition with limited annotations,” arXiv preprint arXiv:2206.09541, 2022.
  • (35) M. Deng, J. Wang, C.-P. Hsieh, Y. Wang, H. Guo, T. Shu, M. Song, E. P. Xing, and Z. Hu, “Rlprompt: Optimizing discrete text prompts with reinforcement learning,” arXiv preprint arXiv:2205.12548, 2022.
  • (36) T. Gao, A. Fisch, and D. Chen, “Making pre-trained language models better few-shot learners,” arXiv preprint arXiv:2012.15723, 2020.
  • (37) A. Haviv, J. Berant, and A. Globerson, “Bertese: Learning to speak to bert,” arXiv preprint arXiv:2103.05327, 2021.
  • (38) Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can we know what language models know?,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 423–438, 2020.
  • (39) N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in International Conference on Machine Learning, pp. 2790–2799, PMLR, 2019.
  • (40) M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pp. 709–727, Springer, 2022.
  • (41) J. O. Zhang, A. Sax, A. Zamir, L. Guibas, and J. Malik, “Side-tuning: a baseline for network adaptation via additive side networks,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pp. 698–714, Springer, 2020.
  • (42) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009.
  • (43) J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in Proceedings of the IEEE international conference on computer vision workshops, pp. 554–561, 2013.
  • (44) K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
  • (45) L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in 2004 conference on computer vision and pattern recognition workshop, pp. 178–178, IEEE, 2004.
  • (46) M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729, IEEE, 2008.
  • (47) J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in 2010 IEEE computer society conference on computer vision and pattern recognition, pp. 3485–3492, IEEE, 2010.
  • (48) M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3606–3613, 2014.
  • (49) P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019.
  • (50) S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” arXiv preprint arXiv:1306.5151, 2013.
  • (51) O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” in 2012 IEEE conference on computer vision and pattern recognition, pp. 3498–3505, IEEE, 2012.
  • (52) L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pp. 446–461, Springer, 2014.
  • (53) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.