HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: etoc
  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2402.10099v1 [cs.CV] 15 Feb 2024

Any-Shift Prompting for Generalization over Distributions

    Zehao Xiao1        Jiayi Shen1        Mohammad Mahdi Derakhshani1
    Shengcai Liao2        Cees G. M. Snoek1
   1University of Amsterdam    2Core42
Abstract

Image-language models with prompt learning have shown remarkable advances in numerous downstream vision tasks. Nevertheless, conventional prompt learning methods overfit their training distribution and lose the generalization ability on test distributions. To improve generalization across various distribution shifts, we propose any-shift prompting: a general probabilistic inference framework that considers the relationship between training and test distributions during prompt learning. We explicitly connect training and test distributions in the latent space by constructing training and test prompts in a hierarchical architecture. Within this framework, the test prompt exploits the distribution relationships to guide the generalization of the CLIP image-language model from training to any test distribution. To effectively encode the distribution information and their relationships, we further introduce a transformer inference network with a pseudo-shift training mechanism. The network generates the tailored test prompt with both training and test information in a feedforward pass, avoiding extra training costs at test time. Extensive experiments on twenty-three datasets demonstrate the effectiveness of any-shift prompting on the generalization over various distribution shifts.

1 Introduction

Refer to caption
Figure 1: Any-shift prompting. (a) Various distribution shifts in real-world applications. (b) We propose any-shift prompting that aggregates training and test information for jointly handling individual distribution shifts and their combinations.

Recent image-language foundation models like CLIP [55] show remarkable advances in various computer vision tasks. Benefiting from large image-text pairing datasets for pre-training, these models perform well when adapting to downstream tasks by manual prompts [40, 51, 59, 56] and prompt learning [87, 86]. However, it is difficult for conventional prompt learning approaches to handle distribution shifts in downstream tasks [65, 9]. The learned prompts usually overfit their training data, leading to performance degradation on unseen test distributions.

To improve generalization of prompt learning, recent methods introduce uncertainty into the learnable prompt [9] or fine-tune the prompt on each test sample with extra unsupervised optimizations [65, 61]. Nevertheless, these methods do not explicitly consider the relationships between training and test distributions of the downstream tasks. However, in real-world applications, the distribution shifts are usually complex and unpredictable, where models may encounter different distribution shifts (Figure 1 (a)), and even their combinations. Hence, we deem it crucial to explore the relationships between training and test distributions for the generalization of prompting across different distribution shifts. To this end, we make three contributions in this paper.

First, we propose any-shift prompting, a general probabilistic inference framework that can explore distribution relationships in prompt learning. Specifically, we introduce probabilistic training and test prompts in a hierarchical architecture to explicitly connect the training and test distributions. Within this framework, the test prompt encodes the test information and the relationships of the training and test distributions, thereby improving the generalization ability on various test distributions (Figure 1 (b)).

Second, we propose a pseudo-shift training mechanism, where the hierarchical probabilistic model learns the ability to encode distribution relationships by simulating distribution shifts. Consequently, at test time, our method generalizes to any specific distribution by generating a tailored prompt on the fly in just one feedforward process, without the need for re-learning or fine-tuning.

Third, to effectively and comprehensively encode the distribution information and their relationships, we design a transformer inference network for prompt generation. The transformer takes test information of both image and label space features, as well as the training prompts, as inputs. It then aggregates the training and test information and their relationships into the test-specific prompt. The test prompt is utilized to guide both the feature extraction and classification processes to generate test-specific features and classifiers, which bolsters robust predictions across distribution shifts.

We validate our method through extensive experiments on twenty-three benchmarks with various distribution shifts, including covariate shift, label shift, conditional shift, concept shift, and even joint shift. The results demonstrate the effectiveness of the proposed method on generalization across various distribution shifts.

2 Preliminary

We propose any-shift prompting based on CLIP [55] to handle various distribution shifts in a general way. Here we provide the technical background on CLIP as well as definitions of distribution shifts considered.

CLIP model. Contrastive Language-Image Pre-training (CLIP) [55] consists of an image encoder fΦI(𝐱)subscript𝑓subscriptΦ𝐼𝐱f_{\Phi_{I}}(\mathbf{x})italic_f start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x ) and a text encoder fΦT(𝐥)subscript𝑓subscriptΦ𝑇𝐥f_{\Phi_{T}}(\mathbf{l)}italic_f start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_l ), which are trained by a contrastive loss on a large dataset of image-language (𝐱,𝐥𝐱𝐥\mathbf{x},\mathbf{l}bold_x , bold_l) pairs. For a downstream classification task with an input image 𝐱𝐱\mathbf{x}bold_x and a set of class names 𝒴={ci}i=1C𝒴superscriptsubscriptsubscript𝑐𝑖𝑖1𝐶\mathcal{Y}{=}\{c_{i}\}_{i{=}1}^{C}caligraphic_Y = { italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, the image feature is extracted by 𝐳=fΦI(𝐱)𝐳subscript𝑓subscriptΦ𝐼𝐱\mathbf{z}{=}f_{\Phi_{I}}(\mathbf{x})bold_z = italic_f start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x ) and the classifiers are composed of a set of text features {𝐭i}i=1Csuperscriptsubscriptsubscript𝐭𝑖𝑖1𝐶\{\mathbf{t}_{i}\}_{i{=}1}^{C}{ bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, where 𝐭i=fΦT(𝐥i)subscript𝐭𝑖subscript𝑓subscriptΦ𝑇subscript𝐥𝑖\mathbf{t}_{i}{=}f_{\Phi_{T}}(\mathbf{l}_{i})bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Here, 𝐥isubscript𝐥𝑖\mathbf{l}_{i}bold_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a manually crafted prompt to describe the corresponding class name cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, e.g., ``an image of a [class].'' Thus, the prediction function of the CLIP model for downstream tasks without fine-tuning is formulated as:

p(𝐲|𝐱,𝒴)=softmax(𝐳𝐭).𝑝conditional𝐲𝐱𝒴softmaxsuperscript𝐳top𝐭p(\mathbf{y}|\mathbf{x},\mathcal{Y}){=}\mathrm{softmax}(\mathbf{z}^{\top}% \mathbf{t}).italic_p ( bold_y | bold_x , caligraphic_Y ) = roman_softmax ( bold_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_t ) . (1)

This enables the pre-trained CLIP model to handle zero-shot learning classification in various downstream tasks.

Distribution shifts. A data distribution is generally denoted as p(𝐱,𝐲)𝑝𝐱𝐲p(\mathbf{x},\mathbf{y})italic_p ( bold_x , bold_y ), which is a joint distribution of the input data 𝐱𝐱\mathbf{x}bold_x and the label 𝐲𝐲\mathbf{y}bold_y. The models are usually trained on a training distribution p(𝐱s,𝐲s)𝑝subscript𝐱𝑠subscript𝐲𝑠p(\mathbf{x}_{s},\mathbf{y}_{s})italic_p ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) and then deployed on test distributions p(𝐱t,𝐲t)𝑝subscript𝐱𝑡subscript𝐲𝑡p(\mathbf{x}_{t},\mathbf{y}_{t})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). In real-world applications, differences between the training and test distributions are known as the joint distribution shift:

p(𝐱s,𝐲s)p(𝐱t,𝐲t).𝑝subscript𝐱𝑠subscript𝐲𝑠𝑝subscript𝐱𝑡subscript𝐲𝑡p(\mathbf{x}_{s},\mathbf{y}_{s})\neq p(\mathbf{x}_{t},\mathbf{y}_{t}).italic_p ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ≠ italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (2)
Joint distribution shift p(𝐱s,𝐲s)p(𝐱t,𝐲t)𝑝subscript𝐱𝑠subscript𝐲𝑠𝑝subscript𝐱𝑡subscript𝐲𝑡p(\mathbf{x}_{s},\mathbf{y}_{s})\neq p(\mathbf{x}_{t},\mathbf{y}_{t})italic_p ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ≠ italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
Partial distribution shifts
Covariate shift p(𝐱s)p(𝐱t)𝑝subscript𝐱𝑠𝑝subscript𝐱𝑡p(\mathbf{x}_{s})\neq p(\mathbf{x}_{t})italic_p ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ≠ italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )    p(𝐲s|𝐱s)=p(𝐲t|𝐱t)𝑝conditionalsubscript𝐲𝑠subscript𝐱𝑠𝑝conditionalsubscript𝐲𝑡subscript𝐱𝑡p(\mathbf{y}_{s}|\mathbf{x}_{s})=p(\mathbf{y}_{t}|\mathbf{x}_{t})italic_p ( bold_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
Label shift p(𝐲s)p(𝐲t)𝑝subscript𝐲𝑠𝑝subscript𝐲𝑡p(\mathbf{y}_{s})\neq p(\mathbf{y}_{t})italic_p ( bold_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ≠ italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )    p(𝐱s|𝐲s)=p(𝐱t|𝐲t)𝑝conditionalsubscript𝐱𝑠subscript𝐲𝑠𝑝conditionalsubscript𝐱𝑡subscript𝐲𝑡p(\mathbf{x}_{s}|\mathbf{y}_{s})=p(\mathbf{x}_{t}|\mathbf{y}_{t})italic_p ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
Concept shift p(𝐱s)=p(𝐱t)𝑝subscript𝐱𝑠𝑝subscript𝐱𝑡p(\mathbf{x}_{s})=p(\mathbf{x}_{t})italic_p ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )    p(𝐲s|𝐱s)p(𝐲t|𝐱t)𝑝conditionalsubscript𝐲𝑠subscript𝐱𝑠𝑝conditionalsubscript𝐲𝑡subscript𝐱𝑡p(\mathbf{y}_{s}|\mathbf{x}_{s})\neq p(\mathbf{y}_{t}|\mathbf{x}_{t})italic_p ( bold_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ≠ italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
Conditional shift p(𝐲s)=p(𝐲t)𝑝subscript𝐲𝑠𝑝subscript𝐲𝑡p(\mathbf{y}_{s})=p(\mathbf{y}_{t})italic_p ( bold_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )    p(𝐱s|𝐲s)p(𝐱t|𝐲t)𝑝conditionalsubscript𝐱𝑠subscript𝐲𝑠𝑝conditionalsubscript𝐱𝑡subscript𝐲𝑡p(\mathbf{x}_{s}|\mathbf{y}_{s})\neq p(\mathbf{x}_{t}|\mathbf{y}_{t})italic_p ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ≠ italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
Table 1: Common distribution shifts. The joint distribution shift is usually decomposed into four partial shifts, which are investigated individually in the literature. By contrast, we focus in this paper on various shifts and even consider their combinations.

Common distribution shifts in the literature. Due to the joint distribution shift, the performance of the trained model degrades on the test data [71, 39], sometimes significantly so. Since the joint distribution shift is complex, previous methods limit the scope of the problem and simplify the joint distribution shift to different partial distribution shifts. From a Bayesian perspective, the joint distribution is decomposed into p(𝐱,𝐲)=p(𝐱)p(𝐲|𝐱)=p(𝐲)p(𝐱|𝐲)𝑝𝐱𝐲𝑝𝐱𝑝conditional𝐲𝐱𝑝𝐲𝑝conditional𝐱𝐲p(\mathbf{x},\mathbf{y}){=}p(\mathbf{x})p(\mathbf{y}|\mathbf{x}){=}p(\mathbf{y% })p(\mathbf{x}|\mathbf{y})italic_p ( bold_x , bold_y ) = italic_p ( bold_x ) italic_p ( bold_y | bold_x ) = italic_p ( bold_y ) italic_p ( bold_x | bold_y ). According to the different components in the decomposition, we summarize the partial distribution shifts into four different definitions in Table 1 and detail them one by one.

Covariate shift [68, 34, 63] assumes the distribution shifts occur only in the input space p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ) while the labels given the input features p(𝐲|𝐱)𝑝conditional𝐲𝐱p(\mathbf{y}|\mathbf{x})italic_p ( bold_y | bold_x ) remain the same, e.g., by image corruptions [24] or changing image styles [34, 54]. Covariate shift is widely investigated by domain generalization [85, 34, 77] and domain adaptation methods [71, 39]. Label shift focuses on the opposite problem, where the label distributions p(𝐲)𝑝𝐲p(\mathbf{y})italic_p ( bold_y ) are different, but the label-conditional distributions p(𝐱|𝐲)𝑝conditional𝐱𝐲p(\mathbf{x}|\mathbf{y})italic_p ( bold_x | bold_y ) are the same [69, 58]. Previous methods generate datasets with uniform distribution p(𝐲)𝑝𝐲p(\mathbf{y})italic_p ( bold_y ) during training and different distributions at test time [22, 2, 74]. The classification of unknown classes can be treated as a specific and worse case of the label shift [41, 64, 86], where p(𝐲)=0𝑝𝐲0p(\mathbf{y}){=}0italic_p ( bold_y ) = 0 for the unknown classes. Concept shift treats the distribution of input p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ) the same while the conditional distributions p(𝐲|𝐱)𝑝conditional𝐲𝐱p(\mathbf{y}|\mathbf{x})italic_p ( bold_y | bold_x ) are different, indicating different annotation methods for the same data distribution [42]. Conditional shift assumes the label distribution is the same while the conditional distribution p(𝐱|𝐲)𝑝conditional𝐱𝐲p(\mathbf{x}|\mathbf{y})italic_p ( bold_x | bold_y ) are different [41, 81, 18], where different classes can have their own shift protocols on the input data, e.g., sub-population problems [62, 32].

Distribution shifts in this paper. Conventional prompting methods [87, 86] learn the prompt on the training distribution of the downstream task, which is easy to overfit and vulnerable to the above shifts [9, 65]. Moreover, in real-world scenarios, all distribution shifts may happen unpredictably, and even simultaneously. Hence, we propose to encode test information and the training-test relationships for generalization over distributions. Our method is not designed for specific partial distribution shifts. Instead, it is proposed to handle various shifts, even when they occur simultaneously.

3 Any-Shift Prompting

3.1 Prompt modeling

We propose any-shift prompting, a general probabilistic inference framework to explore distribution relationships. Specifically, we introduce training and test prompts as latent variables in a hierarchical architecture. The graphical model of our method is provided in Figure 2.

Training prompt. The intuitive idea of adapting the CLIP model is to inject the downstream training data 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT in a training prompt for prediction (eq. 1). 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT consists of training input-output pairs sampled from the distribution p(𝐱s,𝐲s)𝑝subscript𝐱𝑠subscript𝐲𝑠p(\mathbf{x}_{s},\mathbf{y}_{s})italic_p ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). The predictive function of CLIP for the test distribution p(𝐱t,𝐲t)𝑝subscript𝐱𝑡subscript𝐲𝑡p(\mathbf{x}_{t},\mathbf{y}_{t})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is then formulated as:

pΦ(𝐲t|𝐱t,𝒴t,𝒟s)pΦ(𝐲t|𝐱t,𝐯s,𝒴t)p(𝐯s|𝒟s),proportional-tosubscript𝑝Φconditionalsubscript𝐲𝑡subscript𝐱𝑡subscript𝒴𝑡subscript𝒟𝑠subscript𝑝Φconditionalsubscript𝐲𝑡subscript𝐱𝑡subscript𝐯𝑠subscript𝒴𝑡𝑝conditionalsubscript𝐯𝑠subscript𝒟𝑠p_{\Phi}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathcal{Y}_{t},\mathcal{D}_{s})\propto p% _{\Phi}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathbf{v}_{s},\mathcal{Y}_{t})p(\mathbf% {v}_{s}|\mathcal{D}_{s}),italic_p start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∝ italic_p start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , (3)

where ΦΦ\Phiroman_Φ denotes the frozen parameters of the image and text encoders of the CLIP model. Here 𝐯ssubscript𝐯𝑠\mathbf{v}_{s}bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the training prompt that encodes the training downstream task information, which improves the performance of the CLIP model on the training distribution. However, the prompt 𝐯ssubscript𝐯𝑠\mathbf{v}_{s}bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT usually overfits the training data, which may not benefit and even harm the prediction on the unseen test distribution due to the distribution shifts at test time.

Probabilistic test prompt.

To generalize across distribution shifts in downstream tasks at test time, we further introduce a probabilistic test prompt within a hierarchical Bayes framework to encode the information of test distributions. Specifically, the test prompt 𝐯tsubscript𝐯𝑡\mathbf{v}_{t}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is inferred from the training prompt 𝐯ssubscript𝐯𝑠\mathbf{v}_{s}bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and the accessible test information, i.e., a test image 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the class names 𝒴tsubscript𝒴𝑡\mathcal{Y}_{t}caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. To build the connections between the training and test prompts, we take the training prompt 𝐯ssubscript𝐯𝑠\mathbf{v}_{s}bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT as a condition for the generation of the test prompt. This enables the method to generate the test prompt across different shifts by considering the relationships between training and test distributions and exploring relevant training information. By introducing 𝐯tsubscript𝐯𝑡\mathbf{v}_{t}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the CLIP prediction function is formulated as:

pΦ,θ(𝐲t|𝐱t,𝒴t,𝒟s)subscript𝑝Φ𝜃conditionalsubscript𝐲𝑡subscript𝐱𝑡subscript𝒴𝑡subscript𝒟𝑠\displaystyle p_{\Phi,\theta}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s})italic_p start_POSTSUBSCRIPT roman_Φ , italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) (4)
=pΦ(𝐲t|𝐱t,𝐯t,𝒴t)p𝜽(𝐯t|𝐯s,𝐱t,𝒴t)p(𝐯s|𝒟s)𝑑𝐯t𝑑𝐯s,absentsubscript𝑝Φconditionalsubscript𝐲𝑡subscript𝐱𝑡subscript𝐯𝑡subscript𝒴𝑡subscript𝑝𝜽conditionalsubscript𝐯𝑡subscript𝐯𝑠subscript𝐱𝑡subscript𝒴𝑡𝑝conditionalsubscript𝐯𝑠subscript𝒟𝑠differential-dsubscript𝐯𝑡differential-dsubscript𝐯𝑠\displaystyle=\int\int p_{\Phi}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathbf{v}_{t},% \mathcal{Y}_{t})p_{\bm{\theta}}(\mathbf{v}_{t}|\mathbf{v}_{s},\mathbf{x}_{t},% \mathcal{Y}_{t})p(\mathbf{v}_{s}|\mathcal{D}_{s})d\mathbf{v}_{t}d\mathbf{v}_{s},= ∫ ∫ italic_p start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_d bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ,

where 𝜽𝜽\bm{\theta}bold_italic_θ denotes the learnable inference network for the test prompt. With the probabilistic test prompt, we provide a general way to incorporate the training and test information, as well as their relationships, into the prediction of the CLIP model, enabling it to generalize on any test distribution.

Refer to caption
Figure 2: Graphical model for any-shift prompting. We introduce probabilistic training and test prompts in a hierarchical inference framework to explore distribution relationships.

Variational test prompt. To optimize the model for generating the probabilistic test prompt in eq. (4), we use variational inference to approximate the true posterior p(𝐯t,𝐯s|𝒟t,𝒴t,𝒟s)𝑝subscript𝐯𝑡conditionalsubscript𝐯𝑠subscript𝒟𝑡subscript𝒴𝑡subscript𝒟𝑠p(\mathbf{v}_{t},\mathbf{v}_{s}|\mathcal{D}_{t},\mathcal{Y}_{t},\mathcal{D}_{s})italic_p ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), which is factorized as:

q𝜽(𝐯t,𝐯s|𝒟t,𝒴t,𝒟s)=q𝜽(𝐯t|𝐯s,𝒟t,𝒴t)p(𝐯s|𝒟s),subscript𝑞𝜽subscript𝐯𝑡conditionalsubscript𝐯𝑠subscript𝒟𝑡subscript𝒴𝑡subscript𝒟𝑠subscript𝑞𝜽conditionalsubscript𝐯𝑡subscript𝐯𝑠subscript𝒟𝑡subscript𝒴𝑡𝑝conditionalsubscript𝐯𝑠subscript𝒟𝑠q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s}|\mathcal{D}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s}){=}q_{\bm{\theta}}(\mathbf{v}_{t}|\mathbf{v}_{s},\mathcal{D}_{% t},\mathcal{Y}_{t})p(\mathbf{v}_{s}|\mathcal{D}_{s}),italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , (5)

where 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT consists of test input-output pairs sampled from the test distribution p(𝐱t,𝐲t)𝑝subscript𝐱𝑡subscript𝐲𝑡p(\mathbf{x}_{t},\mathbf{y}_{t})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The variational posterior of the test prompt shares the same inference model 𝜽𝜽\bm{\theta}bold_italic_θ with its prior. By integrating eq. (5) into eq. (4), we derive the evidence lower bound (ELBO) of the predictive function as:

logpΦ,𝜽(𝐲t|𝐱t,𝒴t,𝒟s)𝔼q𝜽(𝐯t,𝐯s)[logpΦ(𝐲t|𝐱t,𝐯t,𝒴t)]subscript𝑝Φ𝜽conditionalsubscript𝐲𝑡subscript𝐱𝑡subscript𝒴𝑡subscript𝒟𝑠subscript𝔼subscript𝑞𝜽subscript𝐯𝑡subscript𝐯𝑠delimited-[]subscript𝑝Φconditionalsubscript𝐲𝑡subscript𝐱𝑡subscript𝐯𝑡subscript𝒴𝑡\displaystyle\log p_{\Phi,\bm{\theta}}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathcal{% Y}_{t},\mathcal{D}_{s})\geq\mathbb{E}_{q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{% v}_{s})}\big{[}\log p_{\Phi}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathbf{v}_{t},% \mathcal{Y}_{t})\big{]}roman_log italic_p start_POSTSUBSCRIPT roman_Φ , bold_italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ≥ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] (6)
𝔻KL[q𝜽(𝐯t|𝐯s,𝒟t,𝒴t)||p𝜽(𝐯t|𝐯s,𝐱t,𝒴t)].\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ -\mathbb{D}_{\mathrm{KL}}\big{[}q_{% \bm{\theta}}(\mathbf{v}_{t}|\mathbf{v}_{s},\mathcal{D}_{t},\mathcal{Y}_{t})||p% _{\bm{\theta}}(\mathbf{v}_{t}|\mathbf{v}_{s},\mathbf{x}_{t},\mathcal{Y}_{t})% \big{]}.- blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] .

The variational posterior of the test prompt q𝜽(𝐯t)subscript𝑞𝜽subscript𝐯𝑡q_{\bm{\theta}}(\mathbf{v}_{t})italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) encodes more input-output information of the test distribution and their relationships, yielding a more representative test prompt for better generalization on the test distributions. We provide the step-by-step derivations in the supplemental material.

Notably, the variational posteriors and ELBO are intractable since large numbers of test samples and their ground truth labels in 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are usually unavailable at test time. Thus, in the next section, we propose a pseudo-shift training setup to approximate the ELBO for any-shift prompting.

Refer to caption
Figure 3: Transformer inference network of the pseudo-test prompt. The prior (a) of the pseudo-test prompt is inferred by aggregating the pseudo-training prompt, a single image, and all class names of the pseudo-test distribution. The posterior (b) is inferred from the shared pseudo-training prompt, a batch of pseudo-test images, and corresponding class names. Therefore, the posterior incorporates more pseudo-test information and relationships and guides the prior to learn the same knowledge by KL divergence. The image and text encoders of CLIP are frozen. Only the shared transformer, pseudo-training prompt distribution, and MLP networks are trainable, saving training costs.

3.2 Training and inference

Pseudo-shift training mechanism.

To approximate the intractable ELBO in eq. (6), we develop a pseudo-shift training mechanism. Specifically, the mini-batch data in the current iteration is treated as the pseudo-test data 𝒟tsubscript𝒟superscript𝑡\mathcal{D}_{t^{\prime}}caligraphic_D start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT from the pseudo-test distribution p(𝐱t,𝐲t)𝑝subscript𝐱superscript𝑡subscript𝐲superscript𝑡p(\mathbf{x}_{t^{\prime}},\mathbf{y}_{t^{\prime}})italic_p ( bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). Likewise, the mini-batches in previous iterations are treated as the pseudo-training data 𝒟ssubscript𝒟superscript𝑠\mathcal{D}_{s^{\prime}}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT from the pseudo-training distribution p(𝐱s,𝐲s)𝑝subscript𝐱superscript𝑠subscript𝐲superscript𝑠p(\mathbf{x}_{s^{\prime}},\mathbf{y}_{s^{\prime}})italic_p ( bold_x start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). In this case, the ground truth labels of the pseudo-test data are available during training. We then approximate the ELBO and obtain the optimization function for any-shift prompting as:

=𝔼q𝜽(𝐯t,𝐯s)[logpΦ(𝐲t|𝐱t,𝐯t,𝒴t)]subscript𝔼subscript𝑞𝜽subscript𝐯superscript𝑡subscript𝐯superscript𝑠delimited-[]subscript𝑝Φconditionalsubscript𝐲superscript𝑡subscript𝐱superscript𝑡subscript𝐯superscript𝑡subscript𝒴superscript𝑡\displaystyle\mathcal{L}{=}-\mathbb{E}_{q_{\bm{\theta}}(\mathbf{v}_{t^{\prime}% },\mathbf{v}_{s^{\prime}})}\big{[}\log p_{\Phi}(\mathbf{y}_{t^{\prime}}|% \mathbf{x}_{t^{\prime}},\mathbf{v}_{t^{\prime}},\mathcal{Y}_{t^{\prime}})\big{]}caligraphic_L = - blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ] (7)
+𝔻KL[q𝜽(𝐯t|𝐯s,𝒟t,𝒴t)||p𝜽(𝐯t|𝐯s,𝐱t,𝒴t)],\displaystyle+\mathbb{D}_{\mathrm{KL}}\big{[}q_{\bm{\theta}}(\mathbf{v}_{t^{% \prime}}|\mathbf{v}_{s^{\prime}},\mathcal{D}_{t^{\prime}},\mathcal{Y}_{t^{% \prime}})||p_{\bm{\theta}}(\mathbf{v}_{t^{\prime}}|\mathbf{v}_{s^{\prime}},% \mathbf{x}_{t^{\prime}},\mathcal{Y}_{t^{\prime}})\big{]},+ blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) | | italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ] ,

where 𝐯tsubscript𝐯superscript𝑡\mathbf{v}_{t^{\prime}}bold_v start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and 𝐯ssubscript𝐯superscript𝑠\mathbf{v}_{s^{\prime}}bold_v start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT denote the pseudo-test and pseudo-training prompts, respectively. In practice, we assume the prompts follow the standard Gaussian distributions. The negative log-likelihood in eq. (7) is implemented by a cross-entropy loss. The mini-batch training mechanism mimics the distribution shifts and trains the any-shift prompting to handle the distribution shifts during training, where the model never accesses any test data. Minimizing the KL terms encourages the prior to implicitly learn more comprehensive pseudo-test information from the variational posterior, which aggregates more data information together with the ground truth labels.

Transformer inference network. The pseudo-test prompt in eq. (7) is inferred from: the pseudo-training information in 𝐯ssubscript𝐯superscript𝑠\mathbf{v}_{s^{\prime}}bold_v start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, the pseudo-test image 𝐱tsubscript𝐱superscript𝑡\mathbf{x}_{t^{\prime}}bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, and the class names 𝒴tsubscript𝒴superscript𝑡\mathcal{Y}_{t^{\prime}}caligraphic_Y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. To better aggregate the different information sources and consider their relationships, we introduce a transformer inference network to generate the pseudo-test prompt.

In our model, the prior p𝜽(𝐯t|𝐯s,𝐱t,𝒴t)subscript𝑝𝜽conditionalsubscript𝐯superscript𝑡subscript𝐯superscript𝑠subscript𝐱superscript𝑡subscript𝒴superscript𝑡p_{\bm{\theta}}(\mathbf{v}_{t^{\prime}}|\mathbf{v}_{s^{\prime}},\mathbf{x}_{t^% {\prime}},\mathcal{Y}_{t^{\prime}})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) and variational posterior q𝜽(𝐯t|𝐯s,𝒟t,𝒴t)subscript𝑞𝜽conditionalsubscript𝐯superscript𝑡subscript𝐯superscript𝑠subscript𝒟superscript𝑡subscript𝒴superscript𝑡q_{\bm{\theta}}(\mathbf{v}_{t^{\prime}}|\mathbf{v}_{s^{\prime}},\mathcal{D}_{t% ^{\prime}},\mathcal{Y}_{t^{\prime}})italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) of the pseudo-test prompt share the same inference network to encode the different conditions. Compared with the prior, the variational posterior has access to one batch of pseudo-test images with the corresponding ground-truth labels. Figure 3 illustrates the deployment of the shared transformer inference network. In the following, we provide the detailed inference of the prior and variational posterior.

As shown in Figure 3 (a), the prior of the pseudo-test prompt is generated by the pseudo-training prompt 𝐯ssubscript𝐯superscript𝑠\mathbf{v}_{s^{\prime}}bold_v start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, the pseudo-test image 𝐱tsuperscriptsubscript𝐱𝑡\mathbf{x}_{t}^{\prime}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and class names 𝒴tsuperscriptsubscript𝒴𝑡\mathcal{Y}_{t}^{\prime}caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Specifically, we sample a pseudo-training prompt 𝐯s(j)superscriptsubscript𝐯superscript𝑠𝑗\mathbf{v}_{s^{\prime}}^{(j)}bold_v start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT from a Gaussian distribution 𝒩(𝐯s;μs,σs)𝒩subscript𝐯superscript𝑠subscript𝜇superscript𝑠subscript𝜎superscript𝑠\mathcal{N}(\mathbf{v}_{s^{\prime}};\mu_{s^{\prime}},\sigma_{s^{\prime}})caligraphic_N ( bold_v start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) by the reparameterization trick [30]. The mean μssubscript𝜇superscript𝑠\mu_{s^{\prime}}italic_μ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and variance σssubscript𝜎superscript𝑠\sigma_{s^{\prime}}italic_σ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are two sets of parameters trained with the pseudo-training data 𝒟ssubscript𝒟superscript𝑠\mathcal{D}_{s^{\prime}}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT in the previous iterations. The pseudo-test image is fed into the fixed CLIP image encoder to get the image feature fΦI(𝐱t)subscript𝑓subscriptΦ𝐼subscript𝐱superscript𝑡f_{\Phi_{I}}(\mathbf{x}_{t^{\prime}})italic_f start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). The class names of the pseudo-test distribution are processed by the fixed text encoder to extract the textual features fΦT(𝒴t)subscript𝑓subscriptΦ𝑇subscript𝒴superscript𝑡f_{\Phi_{T}}(\mathcal{Y}_{t^{\prime}})italic_f start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_Y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). After the pre-processing, we take the sampled pseudo-training prompt, pseudo-test image feature, and textual features as input tokens of our transformer inference network to generate the prior of the pseudo-test prompt:

[𝐯~tp;;]=𝚃𝚛𝚊𝚗𝚜([𝐯s(j);fΦI(𝐱t);fΦT(𝒴t)]),superscriptsubscript~𝐯superscript𝑡𝑝𝚃𝚛𝚊𝚗𝚜subscriptsuperscript𝐯𝑗superscript𝑠subscript𝑓subscriptΦ𝐼subscript𝐱superscript𝑡subscript𝑓subscriptΦ𝑇subscript𝒴superscript𝑡[\widetilde{\mathbf{v}}_{t^{\prime}}^{p};\cdot;\cdot]=\texttt{Trans}([\mathbf{% v}^{(j)}_{s^{\prime}};f_{\Phi_{I}}(\mathbf{x}_{t^{\prime}});f_{\Phi_{T}}(% \mathcal{Y}_{t^{\prime}})]),[ over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ; ⋅ ; ⋅ ] = Trans ( [ bold_v start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ; italic_f start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ; italic_f start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_Y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ] ) , (8)
μtp=𝙼𝙻𝙿μ(𝐯~tp),σtp=𝙼𝙻𝙿σ(𝐯~tp),formulae-sequencesuperscriptsubscript𝜇superscript𝑡𝑝subscript𝙼𝙻𝙿𝜇superscriptsubscript~𝐯superscript𝑡𝑝superscriptsubscript𝜎superscript𝑡𝑝subscript𝙼𝙻𝙿𝜎superscriptsubscript~𝐯superscript𝑡𝑝\mu_{t^{\prime}}^{p}=\texttt{MLP}_{\mu}(\widetilde{\mathbf{v}}_{t^{\prime}}^{p% }),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \sigma_{t^{\prime}}^{p}=\texttt{MLP}_{\sigma}(\widetilde{\mathbf{v}}% _{t^{\prime}}^{p}),italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = MLP start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = MLP start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) , (9)
p𝜽(𝐯t|𝐯s,𝐱t;𝒴t)=𝒩(𝐯t;μtp,σtp).subscript𝑝𝜽conditionalsubscript𝐯superscript𝑡subscript𝐯superscript𝑠subscript𝐱superscript𝑡subscript𝒴superscript𝑡𝒩subscript𝐯superscript𝑡superscriptsubscript𝜇superscript𝑡𝑝superscriptsubscript𝜎superscript𝑡𝑝p_{\bm{\theta}}(\mathbf{v}_{t^{\prime}}|\mathbf{v}_{s^{\prime}},\mathbf{x}_{t^% {\prime}};\mathcal{Y}_{t^{\prime}})=\mathcal{N}(\mathbf{v}_{t^{\prime}};\mu_{t% ^{\prime}}^{p},\sigma_{t^{\prime}}^{p}).italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ; caligraphic_Y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = caligraphic_N ( bold_v start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) . (10)

The prior of the pseudo-test prompt follows the Gaussian distribution in eq. (10), whose mean and variance are obtained by two MLP networks on the output of the transformer 𝐯~tpsuperscriptsubscript~𝐯superscript𝑡𝑝\widetilde{\mathbf{v}}_{t^{\prime}}^{p}over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT.

In Figure 3 (b), with the pseudo-test data 𝒟tsubscript𝒟superscript𝑡\mathcal{D}_{t^{\prime}}caligraphic_D start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, the variational posterior learns more distribution information as well as the relations between inputs and outputs. To be clearer, we rewrite the variational posterior q𝜽(𝐯t|𝐯s,𝒟t,𝒴t)subscript𝑞𝜽conditionalsubscript𝐯superscript𝑡subscript𝐯superscript𝑠subscript𝒟superscript𝑡subscript𝒴superscript𝑡q_{\bm{\theta}}(\mathbf{v}_{t^{\prime}}|\mathbf{v}_{s^{\prime}},\mathcal{D}_{t% ^{\prime}},\mathcal{Y}_{t^{\prime}})italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) as q𝜽(𝐯t|𝐯s,Xt,Yt)subscript𝑞𝜽conditionalsubscript𝐯superscript𝑡subscript𝐯superscript𝑠subscript𝑋superscript𝑡subscript𝑌superscript𝑡q_{\bm{\theta}}(\mathbf{v}_{t^{\prime}}|\mathbf{v}_{s^{\prime}},X_{t^{\prime}}% ,Y_{t^{\prime}})italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ), where Xtsubscript𝑋superscript𝑡X_{t^{\prime}}italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT contains a batch of pseudo-test images in 𝒟tsubscript𝒟superscript𝑡\mathcal{D}_{t^{\prime}}caligraphic_D start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and Ytsubscript𝑌superscript𝑡Y_{t^{\prime}}italic_Y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT consists of the ground truth class names of Xtsubscript𝑋superscript𝑡X_{t^{\prime}}italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT in 𝒴tsubscript𝒴superscript𝑡\mathcal{Y}_{t^{\prime}}caligraphic_Y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Hence, the shared transformer takes all image features and their corresponding label features as input tokens to infer the variational posterior:

[𝐯~tq;;]=𝚃𝚛𝚊𝚗𝚜([𝐯s(j);fΦI(Xt);fΦT(Yt)]),superscriptsubscript~𝐯superscript𝑡𝑞𝚃𝚛𝚊𝚗𝚜subscriptsuperscript𝐯𝑗superscript𝑠subscript𝑓subscriptΦ𝐼subscript𝑋superscript𝑡subscript𝑓subscriptΦ𝑇subscript𝑌superscript𝑡[\widetilde{\mathbf{v}}_{t^{\prime}}^{q};\cdot;\cdot]=\texttt{Trans}([\mathbf{% v}^{(j)}_{s^{\prime}};f_{\Phi_{I}}(X_{t^{\prime}});f_{\Phi_{T}}(Y_{t^{\prime}}% )]),[ over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ; ⋅ ; ⋅ ] = Trans ( [ bold_v start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ; italic_f start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ; italic_f start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ] ) , (11)
μtq=𝙼𝙻𝙿μ(𝐯~tq),σtq=𝙼𝙻𝙿σ(𝐯~tq),formulae-sequencesuperscriptsubscript𝜇superscript𝑡𝑞subscript𝙼𝙻𝙿𝜇superscriptsubscript~𝐯superscript𝑡𝑞superscriptsubscript𝜎superscript𝑡𝑞subscript𝙼𝙻𝙿𝜎superscriptsubscript~𝐯superscript𝑡𝑞\mu_{t^{\prime}}^{q}=\texttt{MLP}_{\mu}(\widetilde{\mathbf{v}}_{t^{\prime}}^{q% }),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \sigma_{t^{\prime}}^{q}=\texttt{MLP}_{\sigma}(\widetilde{\mathbf{v}}% _{t^{\prime}}^{q}),italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = MLP start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = MLP start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( over~ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) , (12)
q𝜽(𝐯t|𝐯s,𝒟t,𝒴t)=𝒩(𝐯t;μtq,σtq).subscript𝑞𝜽conditionalsubscript𝐯superscript𝑡subscript𝐯superscript𝑠subscript𝒟superscript𝑡subscript𝒴superscript𝑡𝒩subscript𝐯superscript𝑡superscriptsubscript𝜇superscript𝑡𝑞superscriptsubscript𝜎superscript𝑡𝑞q_{\bm{\theta}}(\mathbf{v}_{t^{\prime}}|\mathbf{v}_{s^{\prime}},\mathcal{D}_{t% ^{\prime}},\mathcal{Y}_{t^{\prime}})=\mathcal{N}(\mathbf{v}_{t^{\prime}};\mu_{% t^{\prime}}^{q},\sigma_{t^{\prime}}^{q}).italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = caligraphic_N ( bold_v start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) . (13)

With the inferred pseudo-test prompt, we take its samples from the variational posterior as the input tokens for both image and text encoders of CLIP to make predictions during training. Thus, although the encoders are fixed, the image and textual features are generalized by utilizing the distribution information in the prompts during the feature extraction and classification procedure, enabling the method to handle different distribution shifts.

Prediction. At test time, we make predictions on each test image 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with the test prompt generated by the transformer inference network. Since the test data and labels in 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are unavailable, the variational posterior becomes intractable. Thus, we sample the test prompt 𝐯t(i)superscriptsubscript𝐯𝑡𝑖\mathbf{v}_{t}^{(i)}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT from the prior distribution p𝜽(𝐯t|𝐯s(j),𝐱t,𝒴t)subscript𝑝𝜽conditionalsubscript𝐯𝑡superscriptsubscript𝐯𝑠𝑗subscript𝐱𝑡subscript𝒴𝑡p_{\bm{\theta}}(\mathbf{v}_{t}|\mathbf{v}_{s}^{(j)},\mathbf{x}_{t},\mathcal{Y}% _{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where 𝐯s(j)superscriptsubscript𝐯𝑠𝑗\mathbf{v}_{s}^{(j)}bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT is a sample of the training prompt following p(𝐯s|𝒟s)𝑝conditionalsubscript𝐯𝑠subscript𝒟𝑠p(\mathbf{v}_{s}|\mathcal{D}_{s})italic_p ( bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). 𝐯t(i)superscriptsubscript𝐯𝑡𝑖\mathbf{v}_{t}^{(i)}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is then introduced into both the image and text encoders of the CLIP model for generalization and prediction as:

pΦ(𝐲t|𝐱t,𝒴t,𝒟s)=1Nt1Nsi=1Ntsubscript𝑝Φconditionalsubscript𝐲𝑡subscript𝐱𝑡subscript𝒴𝑡subscript𝒟𝑠1subscript𝑁𝑡1subscript𝑁𝑠superscriptsubscript𝑖1subscript𝑁𝑡\displaystyle p_{\Phi}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathcal{Y}_{t},\mathcal{% D}_{s})=\frac{1}{N_{t}}\frac{1}{N_{s}}\sum_{i=1}^{N_{t}}italic_p start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT j=1NspΦ(𝐲t|𝐱t,𝐯t(i),𝒴t),superscriptsubscript𝑗1subscript𝑁𝑠subscript𝑝Φconditionalsubscript𝐲𝑡subscript𝐱𝑡superscriptsubscript𝐯𝑡𝑖subscript𝒴𝑡\displaystyle\sum_{j=1}^{N_{s}}p_{\Phi}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathbf{% v}_{t}^{(i)},\mathcal{Y}_{t}),∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (14)
𝐯t(i)p𝜽(𝐯t|𝐯s(j),𝐱t,𝒴t),similar-tosuperscriptsubscript𝐯𝑡𝑖subscript𝑝𝜽conditionalsubscript𝐯𝑡subscriptsuperscript𝐯𝑗𝑠subscript𝐱𝑡subscript𝒴𝑡\displaystyle\mathbf{v}_{t}^{(i)}\sim p_{\bm{\theta}}(\mathbf{v}_{t}|\mathbf{v% }^{(j)}_{s},\mathbf{x}_{t},\mathcal{Y}_{t}),\leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∼ italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , 𝐯s(j)p(𝐯s|𝒟s).similar-tosuperscriptsubscript𝐯𝑠𝑗𝑝conditionalsubscript𝐯𝑠subscript𝒟𝑠\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \mathbf{v}_{s}^{(j)}\sim p(\mathbf{v}_{s}|\mathcal{D}_{s}).bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ∼ italic_p ( bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) .

Although the test data and their labels are not available at test time, the information in each test sample and all class names in the vocabulary of the test task are available to infer the prior of the test prompt. The ability to encode test information from a single test image and the class vocabulary is learned during training by minimizing the KL divergence between the prior and posterior. Note the CLIP image encoder and text encoder are always frozen. Only the test prompt changes for different test distributions by aggregating the training and test information in each test sample 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the class names 𝒴tsubscript𝒴𝑡\mathcal{Y}_{t}caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In this case, we utilize the original generalization ability of CLIP to generate the test prompt for generalization on downstream tasks across various distribution shifts.

4 Related Work

Prompt learning. Image-language foundation models such as CLIP [55] and ALIGN [28] achieve significant advances in various downstream tasks. To adapt the foundation models to downstream tasks, adapter [16] and prompt learning methods [33, 37, 87] are proposed. Zhou et al. [87] propose a learnable prompt as the input of the language model in CLIP. To avoid forgetting the original knowledge in the CLIP model, Zhu et al. [88] and Yao et al. [79] guide prompt learning with hand-crafted prompts. Instead of generating prompts for the language model, Bahng et al. [3] introduce prompting of the image model. Khattak et al. [29] learn a joint prompt for both image and language encoders. Zhou et al. [86] introduce the imaging conditions into the language prompt to enhance the generalization ability of zero-shot performance. To further improve the generalization ability, Derakhshani et al. [9] propose Bayesian prompt learning, which considers the uncertainty in the learned prompts for zero-shot generalization. Shu et al. [65] and Hassan et al. [61] fine-tune the prompt at test time to a specific distribution. We also improve the generalization of prompt learning. Different from previous methods that consider uncertainty or fine-tune the prompt for specific distributions, we propose any-shift prompting that explicitly explores distribution information and relationships within a hierarchical probabilistic framework. The method generates the test-specific prompt on the fly for any test distribution.

Distribution shift generalization. Domain generalization [47, 36, 85, 21] and domain adaptation [44, 73, 38, 80] are the most widely investigated methods for handling distribution shifts. Some domain generalization methods train invariant models on the training distributions [1, 77, 46], which are assumed to be invariant on the test distributions also. To further improve the generalization ability, some methods [35, 11, 4] introduce meta-learning in domain generalization to mimic domain shifts during training. In this paper, we also simulate the distribution shift by a pseudo-shift training mechanism, which uses different mini-batches as distributions. To better utilize the test information for generalization without accessing the test data during training, Sun et al. [68] and Wang et al. [71] propose test-time adaptation, which fine-tunes the trained model on test data with self-supervised losses. The method is followed by many methods [82, 49, 19, 50, 43] due to its good generalization ability on covariate shift. In addition, test-time adaptation is also investigated with other methods like normalization statistics re-estimation [63, 39], or classifier adjustment [27, 78, 84]. Most of these methods focus on covariate shift [71, 78, 19, 12], such as changes of the image styles [34, 54] and corruptions [24]. Some other methods work on the conditional shift [41, 81, 18, 17] or label shift [81, 69, 52, 17]. We also utilize the test information for generalization, but without any test-time optimization. Different from the previous methods, we explicitly bridge the training and test information and explore their relationships to address various distribution shifts in a general way.

5 Experiments

Method PACS VLCS Office-Home DomainNet ImageNet-V2 ImageNet-S ImageNet-A ImageNet-R
Prompting without test-time optimization
CLIP [55] 96.13 81.43 80.35 54.08 60.83 46.15 47.77 73.96
CLIP-D [55] 96.65 80.70 81.51 56.24 - - - -
CoOp [87] 96.45 82.51 82.12 58.82 64.20 47.99 49.71 75.21
CoCoOp [86] 97.00 83.89 82.77 59.43 64.07 48.75 50.63 76.18
DPL [83] 97.07 83.99 83.00 59.86 - - - -
BPL [9] - - - - 64.23 49.20 51.33 77.00
This paper 98.16 ±plus-or-minus\pm± 0.4 86.54 ±plus-or-minus\pm± 0.4 85.16 ±plus-or-minus\pm± 0.6 60.93 ±plus-or-minus\pm± 0.6 64.53 ±plus-or-minus\pm± 0.2 49.80 ±plus-or-minus\pm± 0.5 51.52 ±plus-or-minus\pm± 0.6 77.56 ±plus-or-minus\pm± 0.4
Prompting with test-time optimization
TPT [65] 97.25 84.33 83.45 59.90 63.45 47.94 54.77 77.06
CoOp + TPT [65] 97.85 85.06 84.32 60.65 66.83 49.29 57.95 77.27
CoCoOp + TPT [65] 97.95 85.55 84.54 60.44 64.85 48.47 58.47 78.65
This paper + TPT 98.47 ±plus-or-minus\pm± 0.4 86.98 ±plus-or-minus\pm± 0.4 86.00 ±plus-or-minus\pm± 0.8 61.75 ±plus-or-minus\pm± 0.8 67.08 ±plus-or-minus\pm± 0.6 50.83 ±plus-or-minus\pm± 0.6 58.05 ±plus-or-minus\pm± 0.5 79.23 ±plus-or-minus\pm± 0.5
Table 2: Covariate shift comparison. The experiments are conducted on eight domain generalization datasets, with average classification accuracy reported. Any-shift prompting achieves the best results compared with the original CLIP and other prompt learning methods, which demonstrates the generalization ability of our method on covariate shift. When combined with TPT's test-time optimization, promting methods in general, as well as our method improves further.

Twenty-three datasets.

To demonstrate the generalization ability of any-shift prompting, we evaluate the method on datasets with different distribution shifts. For covariate shift, we conduct experiments on the common domain generalization datasets, PACS [34], Office-Home [70], VLCS [14], and DomainNet [54], which contain images from different domains such as image styles. We also evaluate the model on covariate shifts of ImageNet [8] following Zhou et al. [86], where the model is trained on ImageNet with 16-shot images and evaluated on other variants ImageNet-V2 [57], ImageNet-(S)ketch [72], ImageNet-A [26], and ImageNet-R [25]. For label shift, we follow the base-to-new class generalization from Zhou et al. [87], with 11 datasets that cover various tasks, ImageNet [8], Caltech101 [15], OxfordPets [53], StanfordCars [31], Flowers102 [48], Food101 [5], FGVCAircraft [45], SUN397 [76], DTD [7], EuroSAT [23], and UCF101 [67]. For concept shift, we build a ImageNet-Superclass dataset, where we evaluate the ImageNet-trained model on super-classes in [62]. For conditional shift, we evaluate on the sub-population datasets Living-17 and Entity-30 [62], where the training and test distributions consist of the same classes with different subpopulations. To evaluate our method on the combination of different distribution shifts, we follow the open-domain generalization setting [66] on the Office-Home dataset, which contains four domains, Art, Clipart, Product, and Real-world. We refer to it as Open-Office-Home, which combines covariate shift and label shift. The detailed settings are provided in the supplemental materials.

Implementation details. Our model consists of the pretrained image and language encoders of CLIP [55], and the proposed transformer inference network to generate the test prompt. We use the ViT-B/16 [10] as the image encoder following [86, 9]. The pretrained image and language encoders of CLIP are frozen during training and inference. To generate the prior and variational posterior of the prompt, we use a 2-layer transformer in the inference network. As shown in Figure 3, the inputs of the transformer include the training prompt, the image features, and the class-name features. The distribution of the training prompts consists of two trainable vectors as the mean and variance respectively. The class-name tokens are generated by the hand-crafted tokens ``an image of a [class]''. The transformer also contains two kinds of trainable position embeddings to indicate the image and language tokens. The introduced prompts are sampled from the corresponding distributions by the reparameterization trick [30]. More detailed implementations and hyperparameters are provided in the supplemental materials.

5.1 Results on various distribution shifts

Covariate shift. We conduct experiments on eight domain generalization datasets with covariate shift. The averaged results of classification accuracy for each dataset are provided in Table 2. We follow the leave-one-out protocol [34] for evaluation on the first four datasets, where the model evaluated on each test domain is trained on the other domains. The detailed results on each test domain are provided in the supplemental materials. For the last four datasets, we evaluate the same ImageNet-pretrained model on them individually. Our method outperforms the other prompt learning methods CoOp, CoCoOp, and DPL on all eight datasets. Note that the comparisons with the other prompt learning methods are fair since we generate the test prompt and make predictions in a single feedforward pass, without any optimization or backpropagation at test time. The proposed method also performs better on seven of the eight datasets compared with the test-time tuning method TPT, securing the second position on ImageNet-A. Moreover, since the proposed method learns the prompt and transformer network only during training, it can also be combined with test-time optimization. Then we obtain even better results, which are also competitive on ImageNet-A, indicating the effectiveness of any-shift prompting on covariate shift.

(a) Average over 11 datasets.
Base New H
CLIP 69.34 74.22 71.70
CoOp 82.69 63.22 71.66
CoCoOp 80.47 71.69 75.83
BPL 80.10 74.94 77.43
MaPLe 82.28 75.14 78.55
This paper 82.36 76.30 79.21
(b) ImageNet.
Base New H
CLIP 72.43 68.14 70.22
CoOp 76.47 67.88 71.92
CoCoOp 75.98 70.43 73.10
BPL - 70.93 -
MaPLe 76.66 70.54 73.47
This paper 76.63 71.33 73.88
(c) Caltech101.
Base New H
CLIP 96.84 94.00 95.40
CoOp 98.00 89.81 93.73
CoCoOp 97.96 93.81 95.84
BPL - 94.93 -
MaPLe 97.74 94.36 96.02
This paper 98.28 94.27 96.23
(d) OxfordPets.
Base New H
CLIP 91.17 97.26 94.12
CoOp 93.67 95.29 94.47
CoCoOp 95.20 97.69 96.43
BPL - 98.00 -
MaPLe 95.43 97.76 96.58
This paper 95.78 97.80 96.78
(e) StanfordCars.
Base New H
CLIP 63.37 74.89 68.65
CoOp 78.12 60.40 68.13
CoCoOp 70.49 73.59 72.01
BPL - 73.23 -
MaPLe 72.94 74.00 73.47
This paper 73.05 75.83 74.41
(f) Flowers102.
Base New H
CLIP 72.08 77.80 74.83
CoOp 97.60 59.67 74.06
CoCoOp 94.87 71.75 81.71
BPL - 70.40 -
MaPLe 95.92 72.46 82.56
This paper 96.50 76.20 85.16
(g) Food101.
Base New H
CLIP 90.10 91.22 90.66
CoOp 88.33 82.26 85.19
CoCoOp 90.70 91.29 90.99
BPL - 92.13 -
MaPLe 90.71 92.05 91.38
This paper 90.87 91.35 91.11
(h) FGVCAircraft.
Base New H
CLIP 27.19 36.29 31.09
CoOp 40.44 22.30 28.75
CoCoOp 33.41 23.71 27.74
BPL - 35.00 -
MaPLe 37.44 35.61 36.50
This paper 37.10 35.70 36.39
(i) SUN397.
Base New H
CLIP 69.36 75.35 72.23
CoOp 80.60 65.89 72.51
CoCoOp 79.74 76.86 78.27
BPL - 77.87 -
MaPLe 80.82 78.70 79.75
This paper 80.50 78.50 79.48
(j) DTD.
Base New H
CLIP 53.24 59.90 56.37
CoOp 79.44 41.18 54.24
CoCoOp 77.01 56.00 64.85
BPL - 60.80 -
MaPLe 80.36 59.18 68.16
This paper 79.63 61.98 69.71
(k) EuroSAT.
Base New H
CLIP 56.48 64.05 60.03
CoOp 92.19 54.74 68.69
CoCoOp 87.49 60.04 71.21
BPL - 75.30 -
MaPLe 94.07 73.23 82.35
This paper 93.07 77.63 84.65
(l) UCF101.
Base New H
CLIP 70.53 77.50 73.85
CoOp 84.69 56.05 67.46
CoCoOp 82.33 73.45 77.64
BPL - 75.77 -
MaPLe 83.00 78.66 80.77
This paper 84.60 78.70 81.54
Table 3: Label shift comparison. The models are trained on the base classes with 16 shots and evaluated on both the base and new classes. We bold the best results and underline the runner-up. H denotes the Harmonic mean [75]. Our method performs well on both base and new classes, therefore achieving the best overall Harmonic mean, demonstrating the generalization ability across label shifts.

Label shift. We conduct the experiments on label shift following the base-to-new class generalization setting in Zhou et al. [86]. The results on eleven datasets and the averaged performance are provided in Table 3. Since our any-shift prompts encode both training and test information, as well as their relationships, it performs well in both base and new classes, therefore achieving the best overall Harmonic mean on the eleven datasets. Compared with the original CLIP model, the proposed method achieves better performance in the base classes, showing good adaptation to the downstream tasks with the training information. Compared with the other prompt learning methods CoOp [87], CoCoOp [86], BPL [9], and MaPLe [29], our method performs best in the new classes on seven of the eleven datasets and is competitive on the other four. This demonstrates the ability of the method to handle label shift by incorporating the distribution information and their relationships.

Concept shift. For concept shift, we conduct experiments on the introduced ImageNet-Superclass dataset, where the same images are assigned with different annotations. To do so, we evaluate the ImageNet-trained model on the validation set with the superclass annotations. As shown in Table 4, the prompt learning methods achieve similar performance compared with the original CLIP. By contrast, our method improves the performance of CLIP by about 2%, indicating the ability to handle concept shift.

Concept Shift Conditional Shift
Method ImageNet-Superclass Living-17 Entity-30
CLIP† 69.23 86.94 67.95
CoOp† 69.35 87.11 78.02
CoCoOp† 69.77 87.24 79.52
This paper 71.12 ±plus-or-minus\pm± 0.6 88.41 ±plus-or-minus\pm± 0.3 81.74 ±plus-or-minus\pm± 0.4
Table 4: Concept shift and conditional shift comparison. The results of the compared methods are based on the author-provided code since the prompt learning methods do not provide results on these shifts.

Conditional shift. We also conduct experiments on two datasets with conditional shift. The results are also reported in Table 4. The prompt learning methods perform similarly to CLIP while achieving more improvement on Entity-30. The reason can be that the class names of Living-17 (e.g., wolf, fox) are more detailed than Entity-30 (e.g., crustacean, carnivore, insect), revealing the importance of adapting the original CLIP model to downstream tasks in specific scenarios. Moreover, compared with the conventional prompt learning methods CoOp and CoCoOp, our method consistently improves the performance on both datasets and performs better, demonstrating the effectiveness of any-shift prompting for the conditional shift.

Method Art Clipart Product Real Mean
CLIP† 79.32 67.70 86.93 87.46 80.35
CLIP-D† 80.47 68.83 87.93 88.80 81.51
CoOp† 80.50 69.05 88.26 89.01 81.71
CoCoOp† 80.93 69.51 88.85 89.32 82.19
This paper 83.40 ±plus-or-minus\pm± 0.8 72.53 ±plus-or-minus\pm± 0.5 91.24 ±plus-or-minus\pm± 0.6 90.84 ±plus-or-minus\pm± 0.3 84.50 ±plus-or-minus\pm± 0.4
Table 5: Multiple shifts comparison on Open-Office-Home, including both covariate and label shifts. The results of other methods are based on the author-provided code.

Joint distribution shift. In Table 5, we report the results on Open-Office-Home for the joint distribution shifts. Following Shu et al. [66], we assign data from different parts of classes in the training domains and evaluate the model on the test domain with both seen and unseen classes. Therefore, the model encounters covariate and label shifts jointly. As shown in Table 5, the CLIP-based zero-shot methods keep the same performance as the close-set generalization setting (Table 2) since they are kept frozen. The prompt learning methods perform slightly worse than the close-set setting. Our method outperforms the others on all test domains, showing the ability to handle joint distribution shifts.

Overall, our method achieves good performance on covariate, label, concept, conditional, and even joint shifts, demonstrating the effectiveness of handling various distribution shifts by considering the distribution information and their relationship with any-shift prompting.

5.2 Ablation studies

Refer to caption
Figure 4: Effectiveness of training and test prompts. The test prompt in the proposed any-shift prompting achieves good generalization on both seen and unseen classes, indicating its ability to handle different shifts jointly.

Effectiveness of training and test prompts. To investigate the benefits of the training and test prompts of any-shift prompting, we evaluate our method with training and test prompts separately. The experiments are conducted on Open-office-Home with joint distribution shift. We compare the prompts with the original CLIP model as well as CoOp and CoCoOp in Figure 4, and provide the accuracy on all classes, seen classes, and unseen classes, respectively. CoOp and CoCoOp show better performance on seen classes across covariate shift but struggle in the unseen classes where both covariate shift and label shift exist. The training prompt in our method encounters the same problem since it encodes the training information with seen classes but also tends to overfit the training distribution. The performance is slightly better since it considers uncertainty in the prompt. By contrast, the test prompt in our method encodes the test information with the relationships between the training and test distribution. This enables the method to achieve good generalization across different shifts, leading to higher performance on both seen (covariate shift) and unseen classes (both covariate shift and label shift).

Refer to caption
Figure 5: Visualization of generalization effect on the image and text features before and after generalization. Different colors denote different classes. The image and text features with the same categories get closer after generalization by our method, leading to more accurate predictions.
Training prompt 𝐯ssubscript𝐯𝑠\mathbf{v}_{s}bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT Test text feature of 𝒴tsubscript𝒴𝑡\mathcal{Y}_{t}caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Test image feature of 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Accuracy
82.62
82.67
83.11
83.63
84.50
Table 6: Benefits of training and test information in any-shift prompt. The experiments are conducted across the joint shifts on Open-Office-Home. Both training and test information in the prompt benefit the method across joint shifts.

Visualization of generalization effect. To further show the benefits of generalization with our method, we visualize the image and text features before and after generalization by any-shift prompting. The experiments are conducted on the ``Art" domain under Open-Office-Home. The image and text features before generalization are generated by the fixed CLIP image and language encoders respectively. As shown in Figure 5, after generalization by any-shift prompting, the image features get closer to the text features of the corresponding ground truth labels, which leads to more accurate predictions.

Benefits of training and test information in any-shift prompt. To show the benefits of considering different information in the test prompt, we conduct experiments on Open-Office-Home, which contains both covariate and label shifts. As shown in Table 6, using only the training prompt achieves better performance than CLIP (80.35) and we get similar results with only test text features or test image features. The information from the test images gains more improvement. The reason can be that test images include more unseen information in this setting. The test prompt generated by both image and text information further improves the generalization of test distributions, indicating the importance of considering test information for generalization. Moreover, including the training prompt provides the relationships and shift information between training and test distribution in the prompt, leading to the best performance.

6 Conclusion

We propose any-shift prompting to adapt the large image-language model (CLIP) to downstream tasks while enhancing the generalization ability across different distribution shifts at test time. The proposed method bridges the training and test distributions under a hierarchical probabilistic framework, which generates the specific prompt for each test sample by encoding the distribution information and relationships of the training and test distributions. Once trained, we generate the test-specific prompt across any distribution shift in a single feedforward pass without any fine-tuning or backpropagation. The test prompt generalizes both the image and language encoders of CLIP to the specific test distribution. Experiments on various distribution shifts, including covariate shift, label shift, conditional shift, concept shift, and joint shift, demonstrate the effectiveness of the proposed method on the generalization of any test distribution.

Acknowledgment

This work is financially supported by the Inception Institute of Artificial Intelligence, the University of Amsterdam, and the allowance Top consortia for Knowledge and Innovation (TKIs) from the Netherlands Ministry of Economic Affairs and Climate Policy.

References

  • Arjovsky et al. [2019] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
  • Azizzadenesheli et al. [2019] Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized learning for domain adaptation under label shifts. arXiv preprint arXiv:1903.09734, 2019.
  • Bahng et al. [2022] Hyo** Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274, 2022.
  • Balaji et al. [2018] Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. MetaReg: Towards domain generalization using meta-regularization. In Advances in Neural Information Processing Systems, pages 998–1008, 2018.
  • Bossard et al. [2014] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In European Conference on Computer Vision, pages 446–461. Springer, 2014.
  • Choi et al. [2010] Myung ** Choi, Joseph J Lim, Antonio Torralba, and Alan S Willsky. Exploiting hierarchical context on a large database of object categories. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 129–136. IEEE, 2010.
  • Cimpoi et al. [2014] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  • Derakhshani et al. [2023] Mohammad Mahdi Derakhshani, Enrique Sanchez, Adrian Bulat, Victor G Turrisi da Costa, Cees GM Snoek, Georgios Tzimiropoulos, and Brais Martinez. Bayesian prompt learning for image-language model generalization. In IEEE International Conference on Computer Vision, pages 15237–15246, 2023.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  • Dou et al. [2019] Qi Dou, Daniel C Castro, Konstantinos Kamnitsas, and Ben Glocker. Domain generalization via model-agnostic learning of semantic features. In Advances in Neural Information Processing Systems, 2019.
  • Dubey et al. [2021] Abhimanyu Dubey, Vignesh Ramanathan, Alex Pentland, and Dhruv Mahajan. Adaptive methods for real-world domain generalization. In IEEE Conference on Computer Vision and Pattern Recognition, pages 14340–14349, 2021.
  • Everingham et al. [2010] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
  • Fang et al. [2013] Yuming Fang, Weisi Lin, Zhenzhong Chen, Chia-Ming Tsai, and Chia-Wen Lin. A video saliency detection model in compressed domain. IEEE Transactions on Circuits and Systems for Video Technology, 24(1):27–38, 2013.
  • Fei-Fei et al. [2004] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In IEEE Conference on Computer Vision and Pattern Recognition Workshop, pages 178–178. IEEE, 2004.
  • Gao et al. [2023] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, pages 1–15, 2023.
  • Garg et al. [2023] Saurabh Garg, Nick Erickson, James Sharpnack, Alex Smola, Sivaraman Balakrishnan, and Zachary Chase Lipton. Rlsbench: Domain adaptation under relaxed label shift. In International Conference on Machine Learning, pages 10879–10928. PMLR, 2023.
  • Gong et al. [2016] Mingming Gong, Kun Zhang, Tongliang Liu, Dacheng Tao, Clark Glymour, and Bernhard Schölkopf. Domain adaptation with conditional transferable components. In International Conference on Machine Learning, pages 2839–2848. PMLR, 2016.
  • Goyal et al. [2022] Sachin Goyal, Mingjie Sun, Aditi Raghunathan, and Zico Kolter. Test-time adaptation via conjugate pseudo-labels. In Advances in Neural Information Processing Systems, 2022.
  • Griffin et al. [2007] Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256 object category dataset. 2007.
  • Gulrajani and Lopez-Paz [2020] Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. In International Conference on Learning Representations, 2020.
  • Guo et al. [2020] Jiaxian Guo, Mingming Gong, Tongliang Liu, Kun Zhang, and Dacheng Tao. Ltf: A label transformation framework for correcting label shift. In International Conference on Machine Learning, pages 3843–3853. PMLR, 2020.
  • Helber et al. [2019] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
  • Hendrycks and Dietterich [2019] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
  • Hendrycks et al. [2021a] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In IEEE International Conference on Computer Vision, pages 8340–8349, 2021a.
  • Hendrycks et al. [2021b] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In IEEE Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021b.
  • Iwasawa et al. [2021] Yusuke Iwasawa et al. Test-time classifier adjustment module for model-agnostic domain generalization. In Advances in Neural Information Processing Systems, 2021.
  • Jia et al. [2021] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  • Khattak et al. [2023] Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In IEEE Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023.
  • Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Krause et al. [2013] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013.
  • Lee et al. [2023] Yoonho Lee, Annie S Chen, Fahim Tajwar, Ananya Kumar, Huaxiu Yao, Percy Liang, and Chelsea Finn. Surgical fine-tuning improves adaptation to distribution shifts. In International Conference on Learning Representations, 2023.
  • Lester et al. [2021] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  • Li et al. [2017] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In IEEE International Conference on Computer Vision, pages 5542–5550, 2017.
  • Li et al. [2018a] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy Hospedales. Learning to generalize: Meta-learning for domain generalization. In AAAI Conference on Artificial Intelligence, 2018a.
  • Li et al. [2018b] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5400–5409, 2018b.
  • Li and Liang [2021] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  • Liang et al. [2020] Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning, pages 6028–6039. PMLR, 2020.
  • Lim et al. [2023] Hyesu Lim, Byeonggeun Kim, Jaegul Choo, and Sungha Choi. Ttn: A domain-shift aware batch normalization in test-time adaptation. In International Conference on Learning Representations, 2023.
  • Liu et al. [2023] Pengfei Liu, Weizhe Yuan, **lan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  • Liu et al. [2021a] Xiaofeng Liu, Zhenhua Guo, Site Li, Fangxu Xing, Jane You, C-C Jay Kuo, Georges El Fakhri, and Jonghye Woo. Adversarial unsupervised domain adaptation with conditional and label shift: Infer, align and iterate. In IEEE International Conference on Computer Vision, pages 10367–10376, 2021a.
  • Liu et al. [2022] Xiaofeng Liu, Chaehwa Yoo, Fangxu Xing, Hye** Oh, Georges El Fakhri, Je-Won Kang, Jonghye Woo, et al. Deep unsupervised domain adaptation: A review of recent advances and perspectives. APSIPA Transactions on Signal and Information Processing, 11(1), 2022.
  • Liu et al. [2021b] Yuejiang Liu, Parth Kothari, Bastien van Delft, Baptiste Bellot-Gurlet, Taylor Mordan, and Alexandre Alahi. Ttt++: When does self-supervised test-time training fail or thrive? In Advances in Neural Information Processing Systems, 2021b.
  • Long et al. [2015] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pages 97–105. PMLR, 2015.
  • Maji et al. [2013] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  • Motiian et al. [2017] Saeid Motiian, Marco Piccirilli, Donald A Adjeroh, and Gianfranco Doretto. Unified deep supervised domain adaptation and generalization. In IEEE International Conference on Computer Vision, pages 5715–5725, 2017.
  • Muandet et al. [2013] Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pages 10–18. PMLR, 2013.
  • Nilsback and Zisserman [2008] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
  • Niu et al. [2022] Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. Efficient test-time model adaptation without forgetting. In International Conference on Machine Learning, pages 16888–16905. PMLR, 2022.
  • Niu et al. [2023] Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world. In International Conference on Learning Representations, 2023.
  • Novack et al. [2023] Zachary Novack, Julian McAuley, Zachary Chase Lipton, and Saurabh Garg. Chils: Zero-shot image classification with hierarchical label sets. In International Conference on Machine Learning, pages 26342–26362. PMLR, 2023.
  • Park et al. [2023] Sunghyun Park, Seunghan Yang, Jaegul Choo, and Sungrack Yun. Label shift adapter for test-time adaptation under covariate and label shifts. In IEEE International Conference on Computer Vision, pages 16421–16431, 2023.
  • Parkhi et al. [2012] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3498–3505. IEEE, 2012.
  • Peng et al. [2019] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In IEEE International Conference on Computer Vision, pages 1406–1415, 2019.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • Recht et al. [2019] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389–5400. PMLR, 2019.
  • Roberts et al. [2022] Manley Roberts, Pranav Mani, Saurabh Garg, and Zachary Lipton. Unsupervised learning under latent label shift. In Advances in Neural Information Processing Systems, pages 18763–18778, 2022.
  • Roth et al. [2023] Karsten Roth, Jae Myung Kim, A Koepke, Oriol Vinyals, Cordelia Schmid, and Zeynep Akata. Waffling around for performance: Visual classification with random words and broad concepts. arXiv preprint arXiv:2306.07282, 2023.
  • Russell et al. [2008] Bryan C Russell, Antonio Torralba, Kevin P Murphy, and William T Freeman. Labelme: a database and web-based tool for image annotation. International Journal of Computer Vision, 77(1-3):157–173, 2008.
  • Samadh et al. [2023] Jameel Hassan Abdul Samadh, Hanan Gani, Noor Hazim Hussein, Muhammad Uzair Khattak, Muzammal Naseer, Fahad Khan, and Salman Khan. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. In Advances in Neural Information Processing Systems, 2023.
  • Santurkar et al. [2020] Shibani Santurkar, Dimitris Tsipras, and Aleksander Madry. Breeds: Benchmarks for subpopulation shift. arXiv preprint arXiv:2008.04859, 2020.
  • Schneider et al. [2020] Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. Improving robustness against common corruptions by covariate shift adaptation. In Advances in Neural Information Processing Systems, pages 11539–11551, 2020.
  • Shen et al. [2022] Jiayi Shen, Zehao Xiao, Xiantong Zhen, Cees Snoek, and Marcel Worring. Association graph learning for multi-task classification with category shifts. In Advances in Neural Information Processing Systems, pages 4503–4516, 2022.
  • Shu et al. [2022] Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-time prompt tuning for zero-shot generalization in vision-language models. In Advances in Neural Information Processing Systems, pages 14274–14289, 2022.
  • Shu et al. [2021] Yang Shu, Zhangjie Cao, Chenyu Wang, Jianmin Wang, and Mingsheng Long. Open domain generalization with domain-augmented meta-learning. In IEEE Conference on Computer Vision and Pattern Recognition, pages 9624–9633, 2021.
  • Soomro et al. [2012] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • Sun et al. [2020] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning, pages 9229–9248. PMLR, 2020.
  • Tachet des Combes et al. [2020] Remi Tachet des Combes, Han Zhao, Yu-Xiang Wang, and Geoffrey J Gordon. Domain adaptation with conditional distribution matching and generalized label shift. In Advances in Neural Information Processing Systems, pages 19276–19289, 2020.
  • Venkateswara et al. [2017] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027, 2017.
  • Wang et al. [2021] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In International Conference on Learning Representations, 2021.
  • Wang et al. [2019] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, 2019.
  • Wang and Deng [2018] Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018.
  • Wu et al. [2021] Ruihan Wu, Chuan Guo, Yi Su, and Kilian Q Weinberger. Online adaptation to label distribution shift. In Advances in Neural Information Processing Systems, pages 11340–11351, 2021.
  • Xian et al. [2018] Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9):2251–2265, 2018.
  • Xiao et al. [2010] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492. IEEE, 2010.
  • Xiao et al. [2021] Zehao Xiao, Jiayi Shen, Xiantong Zhen, Ling Shao, and Cees G M Snoek. A bit more bayesian: Domain-invariant learning with uncertainty. In International Conference on Machine Learning. PMLR, 2021.
  • Xiao et al. [2022] Zehao Xiao, Xiantong Zhen, Ling Shao, and Cees G M Snoek. Learning to generalize across domains on single test samples. In International Conference on Learning Representations, 2022.
  • Yao et al. [2023] Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6757–6767, 2023.
  • Yi et al. [2023] Li Yi, Gezheng Xu, Pengcheng Xu, Jiaqi Li, Ruizhi Pu, Charles Ling, A Ian McLeod, and Boyu Wang. When source-free domain adaptation meets learning with noisy labels. In International Conference on Learning Representations, 2023.
  • Zhang et al. [2013] Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pages 819–827. PMLR, 2013.
  • Zhang et al. [2022] Marvin Zhang, Sergey Levine, and Chelsea Finn. Memo: Test time robustness via adaptation and augmentation. In Advances in Neural Information Processing Systems, pages 38629–38642, 2022.
  • Zhang et al. [2021] Xin Zhang, Shixiang Shane Gu, Yutaka Matsuo, and Yusuke Iwasawa. Domain prompt learning for efficiently adapting clip to unseen domains. arXiv e-prints, pages arXiv–2111, 2021.
  • Zhang et al. [2023] Yifan Zhang, Xue Wang, Kexin **, Kun Yuan, Zhang Zhang, Liang Wang, Rong **, and Tieniu Tan. Adanpc: Exploring non-parametric classifier for test-time adaptation. In International Conference on Machine Learning, pages 41647–41676. PMLR, 2023.
  • Zhou et al. [2022a] Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022a.
  • Zhou et al. [2022b] Kaiyang Zhou, **gkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In IEEE Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022b.
  • Zhou et al. [2022c] Kaiyang Zhou, **gkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022c.
  • Zhu et al. [2023] Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. In IEEE International Conference on Computer Vision, pages 15659–15669, 2023.

Appendix A Derivations of any-shift prompting

In the main paper, we provide the modeling of our any-shift prompting. Here we provide further derivations of the optimizations of the prior and posterior distributions.

To model the information of training and test distributions and their relationships, we propose any-shift prompting within a hierarchical framework. We introduce training and test prompts as latent variables in the hierarchical probabilistic architecture, the prediction function of the CLIP model is then formulated as:

pΦ,θ(𝐲t|𝐱t,𝒴t,𝒟s)subscript𝑝Φ𝜃conditionalsubscript𝐲𝑡subscript𝐱𝑡subscript𝒴𝑡subscript𝒟𝑠\displaystyle p_{\Phi,\theta}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s})italic_p start_POSTSUBSCRIPT roman_Φ , italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) (15)
=p(𝐲t,𝐯t,𝐯s|𝐱t,𝒴t,𝐱s,𝐲s,𝒴s)𝑑𝐯t𝑑𝐯sabsent𝑝subscript𝐲𝑡subscript𝐯𝑡conditionalsubscript𝐯𝑠subscript𝐱𝑡subscript𝒴𝑡subscript𝐱𝑠subscript𝐲𝑠subscript𝒴𝑠differential-dsubscript𝐯𝑡differential-dsubscript𝐯𝑠\displaystyle=\int\int p(\mathbf{y}_{t},\mathbf{v}_{t},\mathbf{v}_{s}|\mathbf{% x}_{t},\mathcal{Y}_{t},\mathbf{x}_{s},\mathbf{y}_{s},\mathcal{Y}_{s})d\mathbf{% v}_{t}d\mathbf{v}_{s}= ∫ ∫ italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_d bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT
=p(𝐲t|𝐱t,𝐯t,𝒴t)p(𝐯t,𝐯s|𝐱t,𝒴t,𝒟s)𝑑𝐯t𝑑𝐯sabsent𝑝conditionalsubscript𝐲𝑡subscript𝐱𝑡subscript𝐯𝑡subscript𝒴𝑡𝑝subscript𝐯𝑡conditionalsubscript𝐯𝑠subscript𝐱𝑡subscript𝒴𝑡subscript𝒟𝑠differential-dsubscript𝐯𝑡differential-dsubscript𝐯𝑠\displaystyle=\int\int p(\mathbf{y}_{t}|\mathbf{x}_{t},\mathbf{v}_{t},\mathcal% {Y}_{t})p(\mathbf{v}_{t},\mathbf{v}_{s}|\mathbf{x}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s})d\mathbf{v}_{t}d\mathbf{v}_{s}= ∫ ∫ italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_d bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT
=pΦ(𝐲t|𝐱t,𝐯t,𝒴t)p𝜽(𝐯t|𝐯s,𝐱t,𝒴t)p(𝐯s|𝒟s)𝑑𝐯t𝑑𝐯s,absentsubscript𝑝Φconditionalsubscript𝐲𝑡subscript𝐱𝑡subscript𝐯𝑡subscript𝒴𝑡subscript𝑝𝜽conditionalsubscript𝐯𝑡subscript𝐯𝑠subscript𝐱𝑡subscript𝒴𝑡𝑝conditionalsubscript𝐯𝑠subscript𝒟𝑠differential-dsubscript𝐯𝑡differential-dsubscript𝐯𝑠\displaystyle=\int\int p_{\Phi}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathbf{v}_{t},% \mathcal{Y}_{t})p_{\bm{\theta}}(\mathbf{v}_{t}|\mathbf{v}_{s},\mathbf{x}_{t},% \mathcal{Y}_{t})p(\mathbf{v}_{s}|\mathcal{D}_{s})d\mathbf{v}_{t}d\mathbf{v}_{s},= ∫ ∫ italic_p start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_d bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ,

where the prior distribution of the training and test prompts is factorized as

p(𝐯t,𝐯s|𝐱t,𝒴t,𝒟s)=p𝜽(𝐯t|𝐯s,𝐱t,𝒴t)p(𝐯s|𝒟s).𝑝subscript𝐯𝑡conditionalsubscript𝐯𝑠subscript𝐱𝑡subscript𝒴𝑡subscript𝒟𝑠subscript𝑝𝜽conditionalsubscript𝐯𝑡subscript𝐯𝑠subscript𝐱𝑡subscript𝒴𝑡𝑝conditionalsubscript𝐯𝑠subscript𝒟𝑠\displaystyle p(\mathbf{v}_{t},\mathbf{v}_{s}|\mathbf{x}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s}){=}p_{\bm{\theta}}(\mathbf{v}_{t}|\mathbf{v}_{s},\mathbf{x}_{t% },\mathcal{Y}_{t})p(\mathbf{v}_{s}|\mathcal{D}_{s}).italic_p ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) . (16)

p(𝐯s|𝒟s)𝑝conditionalsubscript𝐯𝑠subscript𝒟𝑠p(\mathbf{v}_{s}|\mathcal{D}_{s})italic_p ( bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is learned from the training data 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT sampled from training distribution p(𝐱s,𝐲s)𝑝subscript𝐱𝑠subscript𝐲𝑠p(\mathbf{x}_{s},\mathbf{y}_{s})italic_p ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). p𝜽(𝐯t|𝐯s,𝐱t,𝒴t)subscript𝑝𝜽conditionalsubscript𝐯𝑡subscript𝐯𝑠subscript𝐱𝑡subscript𝒴𝑡p_{\bm{\theta}}(\mathbf{v}_{t}|\mathbf{v}_{s},\mathbf{x}_{t},\mathcal{Y}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) denotes the test prompt, which aggregates both training information from 𝐯ssubscript𝐯𝑠\mathbf{v}_{s}bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and test information from the test image 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and class names 𝒴tsubscript𝒴𝑡\mathcal{Y}_{t}caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The test prompt exploits the relationships between training and test distributions by the transformer inference network 𝜽𝜽\bm{\theta}bold_italic_θ. 𝐯tsubscript𝐯𝑡\mathbf{v}_{t}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is then utilized into the frozen image and text encoders Φ={ΦI,ΦT}ΦsubscriptΦ𝐼subscriptΦ𝑇\Phi=\{\Phi_{I},\Phi_{T}\}roman_Φ = { roman_Φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } to generalize the CLIP model to the test data.

To optimize the model for generating the probabilistic training and test prompts, we further introduce variational inference to approximate the true posterior p(𝐯t,𝐯s|𝒟t,𝒴t,𝒟s)𝑝subscript𝐯𝑡conditionalsubscript𝐯𝑠subscript𝒟𝑡subscript𝒴𝑡subscript𝒟𝑠p(\mathbf{v}_{t},\mathbf{v}_{s}|\mathcal{D}_{t},\mathcal{Y}_{t},\mathcal{D}_{s})italic_p ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) into eq. (15), which is factorized as:

q𝜽(𝐯t,𝐯s|𝒟t,𝒴t,𝒟s)=q𝜽(𝐯t|𝐯s,𝒟t,𝒴t)p(𝐯s|𝒟s),subscript𝑞𝜽subscript𝐯𝑡conditionalsubscript𝐯𝑠subscript𝒟𝑡subscript𝒴𝑡subscript𝒟𝑠subscript𝑞𝜽conditionalsubscript𝐯𝑡subscript𝐯𝑠subscript𝒟𝑡subscript𝒴𝑡𝑝conditionalsubscript𝐯𝑠subscript𝒟𝑠q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s}|\mathcal{D}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s}){=}q_{\bm{\theta}}(\mathbf{v}_{t}|\mathbf{v}_{s},\mathcal{D}_{% t},\mathcal{Y}_{t})p(\mathbf{v}_{s}|\mathcal{D}_{s}),italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , (17)

where 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT consists of test input-output pairs sampled from the test distribution p(𝐱t,𝐲t)𝑝subscript𝐱𝑡subscript𝐲𝑡p(\mathbf{x}_{t},\mathbf{y}_{t})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The variational posterior shares the same inference model 𝜽𝜽\bm{\theta}bold_italic_θ with the prior distribution. By integrating eq. (17) into eq. (15), the evidence lower bound (ELBO) of the log-likelihood logpΦ,𝜽(𝐲t|𝐱t,𝒴t,𝒟s)subscript𝑝Φ𝜽conditionalsubscript𝐲𝑡subscript𝐱𝑡subscript𝒴𝑡subscript𝒟𝑠\log p_{\Phi,\bm{\theta}}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s})roman_log italic_p start_POSTSUBSCRIPT roman_Φ , bold_italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is derived as:

logpΦ,𝜽(𝐲t|𝐱t,𝒴t,𝒟s)subscript𝑝Φ𝜽conditionalsubscript𝐲𝑡subscript𝐱𝑡subscript𝒴𝑡subscript𝒟𝑠\displaystyle\log p_{\Phi,\bm{\theta}}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathcal{% Y}_{t},\mathcal{D}_{s})roman_log italic_p start_POSTSUBSCRIPT roman_Φ , bold_italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) (18)
=logp(𝐲t|𝐱t,𝐯t,𝒴t)p(𝐯t,𝐯s|𝐱t,𝒴t,𝒟s)𝑑𝐯t𝑑𝐯sabsent𝑝conditionalsubscript𝐲𝑡subscript𝐱𝑡subscript𝐯𝑡subscript𝒴𝑡𝑝subscript𝐯𝑡conditionalsubscript𝐯𝑠subscript𝐱𝑡subscript𝒴𝑡subscript𝒟𝑠differential-dsubscript𝐯𝑡differential-dsubscript𝐯𝑠\displaystyle=\log\int\int p(\mathbf{y}_{t}|\mathbf{x}_{t},\mathbf{v}_{t},% \mathcal{Y}_{t})p(\mathbf{v}_{t},\mathbf{v}_{s}|\mathbf{x}_{t},\mathcal{Y}_{t}% ,\mathcal{D}_{s})d\mathbf{v}_{t}d\mathbf{v}_{s}= roman_log ∫ ∫ italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_d bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT
=logp(𝐲t|𝐱t,𝐯t,𝒴t)q𝜽(𝐯t,𝐯s|𝒟t,𝒴t,𝒟s)absent𝑝conditionalsubscript𝐲superscript𝑡subscript𝐱𝑡subscript𝐯𝑡subscript𝒴𝑡subscript𝑞𝜽subscript𝐯𝑡conditionalsubscript𝐯𝑠subscript𝒟𝑡subscript𝒴𝑡subscript𝒟𝑠\displaystyle=\log\int\int p(\mathbf{y}_{t^{\prime}}|\mathbf{x}_{t},\mathbf{v}% _{t},\mathcal{Y}_{t})q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s}|\mathcal{D}% _{t},\mathcal{Y}_{t},\mathcal{D}_{s})= roman_log ∫ ∫ italic_p ( bold_y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )
p(𝐯t,𝐯s|𝐱t,𝒴t,𝒟s)q(𝐯t,𝐯s|𝒟t,𝒴t,𝒟s)d𝐯td𝐯s𝑝subscript𝐯𝑡conditionalsubscript𝐯𝑠subscript𝐱𝑡subscript𝒴𝑡subscript𝒟𝑠𝑞subscript𝐯𝑡conditionalsubscript𝐯𝑠subscript𝒟𝑡subscript𝒴𝑡subscript𝒟𝑠𝑑subscript𝐯𝑡𝑑subscript𝐯𝑠\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \frac{p(\mathbf{v}_{t},\mathbf{v}_{s}|\mathbf{x}_{t},% \mathcal{Y}_{t},\mathcal{D}_{s})}{q(\mathbf{v}_{t},\mathbf{v}_{s}|\mathcal{D}_% {t},\mathcal{Y}_{t},\mathcal{D}_{s})}d\mathbf{v}_{t}d\mathbf{v}_{s}divide start_ARG italic_p ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG italic_d bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT
=logp(𝐲t|𝐱t,𝐯t,𝒴t)q𝜽(𝐯t,𝐯s|𝒟t,𝒴t,𝒟s)absent𝑝conditionalsubscript𝐲superscript𝑡subscript𝐱𝑡subscript𝐯𝑡subscript𝒴𝑡subscript𝑞𝜽subscript𝐯𝑡conditionalsubscript𝐯𝑠subscript𝒟𝑡subscript𝒴𝑡subscript𝒟𝑠\displaystyle=\log\int\int p(\mathbf{y}_{t^{\prime}}|\mathbf{x}_{t},\mathbf{v}% _{t},\mathcal{Y}_{t})q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s}|\mathcal{D}% _{t},\mathcal{Y}_{t},\mathcal{D}_{s})= roman_log ∫ ∫ italic_p ( bold_y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )
p𝜽(𝐯t|𝐯s,𝐱t,𝒴t)p(𝐯s|𝒟s)q𝜽(𝐯t|𝐯s,𝒟t,𝒴t)p(𝐯s|𝒟s)d𝐯td𝐯ssubscript𝑝𝜽conditionalsubscript𝐯𝑡subscript𝐯𝑠subscript𝐱𝑡subscript𝒴𝑡𝑝conditionalsubscript𝐯𝑠subscript𝒟𝑠subscript𝑞𝜽conditionalsubscript𝐯𝑡subscript𝐯𝑠subscript𝒟𝑡subscript𝒴𝑡𝑝conditionalsubscript𝐯𝑠subscript𝒟𝑠𝑑subscript𝐯𝑡𝑑subscript𝐯𝑠\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \frac{p_{\bm{\theta}}(\mathbf{v}_{t}|\mathbf{v}_{s},% \mathbf{x}_{t},\mathcal{Y}_{t})p(\mathbf{v}_{s}|\mathcal{D}_{s})}{q_{\bm{% \theta}}(\mathbf{v}_{t}|\mathbf{v}_{s},\mathcal{D}_{t},\mathcal{Y}_{t})p(% \mathbf{v}_{s}|\mathcal{D}_{s})}d\mathbf{v}_{t}d\mathbf{v}_{s}divide start_ARG italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG italic_d bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT
=logp(𝐲t|𝐱t,𝐯t,𝒴t)q𝜽(𝐯t,𝐯s|𝒟t,𝒴t,𝒟s)absent𝑝conditionalsubscript𝐲superscript𝑡subscript𝐱𝑡subscript𝐯𝑡subscript𝒴𝑡subscript𝑞𝜽subscript𝐯𝑡conditionalsubscript𝐯𝑠subscript𝒟𝑡subscript𝒴𝑡subscript𝒟𝑠\displaystyle=\log\int\int p(\mathbf{y}_{t^{\prime}}|\mathbf{x}_{t},\mathbf{v}% _{t},\mathcal{Y}_{t})q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s}|\mathcal{D}% _{t},\mathcal{Y}_{t},\mathcal{D}_{s})= roman_log ∫ ∫ italic_p ( bold_y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )
p𝜽(𝐯t|𝐯s,𝐱t,𝒴t)q𝜽(𝐯t|𝐯s,𝒟t,𝒴t)d𝐯td𝐯ssubscript𝑝𝜽conditionalsubscript𝐯𝑡subscript𝐯𝑠subscript𝐱𝑡subscript𝒴𝑡subscript𝑞𝜽conditionalsubscript𝐯𝑡subscript𝐯𝑠subscript𝒟𝑡subscript𝒴𝑡𝑑subscript𝐯𝑡𝑑subscript𝐯𝑠\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \frac{p_{\bm{\theta}}(\mathbf{v}_{t}|\mathbf{v}_{s},% \mathbf{x}_{t},\mathcal{Y}_{t})}{q_{\bm{\theta}}(\mathbf{v}_{t}|\mathbf{v}_{s}% ,\mathcal{D}_{t},\mathcal{Y}_{t})}d\mathbf{v}_{t}d\mathbf{v}_{s}divide start_ARG italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG italic_d bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT
𝔼q𝜽(𝐯t,𝐯s)[logpΦ(𝐲t|𝐱t,𝐯t,𝒴t)]absentsubscript𝔼subscript𝑞𝜽subscript𝐯𝑡subscript𝐯𝑠delimited-[]subscript𝑝Φconditionalsubscript𝐲𝑡subscript𝐱𝑡subscript𝐯𝑡subscript𝒴𝑡\displaystyle\geq\mathbb{E}_{q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s})}% \big{[}\log p_{\Phi}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathbf{v}_{t},\mathcal{Y}_% {t})\big{]}≥ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ]
𝔻KL[q𝜽(𝐯t|𝐯s,𝒟t,𝒴t)||p𝜽(𝐯t|𝐯s,𝐱t,𝒴t)],\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ -\mathbb{D}_{\mathrm{KL}}\big{[}q_{\bm{\theta}}(\mathbf{v% }_{t}|\mathbf{v}_{s},\mathcal{D}_{t},\mathcal{Y}_{t})||p_{\bm{\theta}}(\mathbf% {v}_{t}|\mathbf{v}_{s},\mathbf{x}_{t},\mathcal{Y}_{t})\big{]},- blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ,

where the expectation of the log-likelihood is calculated on the variational posterior distribution q𝜽(𝐯t,𝐯s|𝒟t,𝒴t,𝒟s)subscript𝑞𝜽subscript𝐯𝑡conditionalsubscript𝐯𝑠subscript𝒟𝑡subscript𝒴𝑡subscript𝒟𝑠q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s}|\mathcal{D}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s})italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ).

Our goal is to maximize the log-likelihood of the test data logpΦ,𝜽(𝐲t|𝐱t,𝒴t,𝒟s)subscript𝑝Φ𝜽conditionalsubscript𝐲𝑡subscript𝐱𝑡subscript𝒴𝑡subscript𝒟𝑠\log p_{\Phi,\bm{\theta}}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s})roman_log italic_p start_POSTSUBSCRIPT roman_Φ , bold_italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), i.e., maximize the ELBO in eq. (18), which is equivalent to minimize the negative log-likelihood. Therefore, minimizing the loss function to optimize our any-shift prompting becomes minimizing:

logpΦ,𝜽(𝐲t|𝐱t,𝒴t,𝒟s)subscript𝑝Φ𝜽conditionalsubscript𝐲𝑡subscript𝐱𝑡subscript𝒴𝑡subscript𝒟𝑠\displaystyle-\log p_{\Phi,\bm{\theta}}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathcal% {Y}_{t},\mathcal{D}_{s})- roman_log italic_p start_POSTSUBSCRIPT roman_Φ , bold_italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) (19)
𝔼q𝜽(𝐯t,𝐯s)[logpΦ(𝐲t|𝐱t,𝐯t,𝒴t)]absentsubscript𝔼subscript𝑞𝜽subscript𝐯𝑡subscript𝐯𝑠delimited-[]subscript𝑝Φconditionalsubscript𝐲𝑡subscript𝐱𝑡subscript𝐯𝑡subscript𝒴𝑡\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leq\mathbb{E}_{q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}% _{s})}\big{[}-\log p_{\Phi}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathbf{v}_{t},% \mathcal{Y}_{t})\big{]}≤ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ - roman_log italic_p start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ]
+𝔻KL[q𝜽(𝐯t|𝐯s,𝒟t,𝒴t)||p𝜽(𝐯t|𝐯s,𝐱t,𝒴t)].\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ +\mathbb{D}_{\mathrm{KL}}\big{[}q_{\bm{\theta}}(\mathbf{v% }_{t}|\mathbf{v}_{s},\mathcal{D}_{t},\mathcal{Y}_{t})||p_{\bm{\theta}}(\mathbf% {v}_{t}|\mathbf{v}_{s},\mathbf{x}_{t},\mathcal{Y}_{t})\big{]}.+ blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] .

Appendix B Details of setting and implementations

B.1 Details of datasets and settings

Covariate shift. We conduct the experiments on covariate shifts in two settings, multiple training distributions and single training distributions. The experiments on multiple training distributions are conducted on domain generalization datasets PACS, VLCS, Office-Home, and DomainNet, which contain multiple domains of images with the same label space. PACS [34] includes images of 7 classes from four different domains, photo, art-painting, cartoon, and sketch. VLCS [14] consists of images of 5 classes and four different datasets, Pascal-VOC2007 [13], LabelMe [60], Caltech101 [20], and SUN [6]. Office-Home also contains four domains, art, clipart, product, and real-world, while the images are from 65 categories, which is much more than PACS and VLCS. DomainNet is even larger, which consists of images from six domains and 345 categories. The domains are clipart, inforgraph, painting, quickdraw, real, and sketch. We follow the ``leave-one-out protocol'' [34] on these datasets, where we select one domain as the test distribution, and the other domains are treated as the training distributions. The model is trained on the training distributions and evaluated on the test one. We treat each domain at the test distribution individually for evaluation and report the averaged results on all test distributions in Table 2 in the main paper. The detailed results of each test distribution are reported in the following section.

The experiments on single training distribution follow the domain generalization in Zhou et al. [86], where the model is trained on ImageNet (1,000 categories) and evaluated on the other four variants ImageNet-V2 [57], ImageNet-(S)ketch [72], ImageNet-A [26], and ImageNet-R [25] with the same label space. Most of the above datasets have shifts in the images, i.e., marginal input distributions p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ). Therefore, we use these datasets for the evaluation of our method across covariate shift.

Label shift. We conduct the experiments on label shift following the base-to-new classification setting in Zhou et al. [87]. In this case, the distribution shifts occur in the marginal output distribution p(𝐲)𝑝𝐲p(\mathbf{y})italic_p ( bold_y ), where the ``new'' classes have p(𝐲c)=0𝑝subscript𝐲𝑐0p(\mathbf{y}_{c}){=}0italic_p ( bold_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = 0 during training. We use eleven benchmarks with label shift. The benchmarks includes general classification datasets ImageNet [8] and Caltech101 [15]; fine-grained classification datasets OxfordPets [53], StanfordCars [31], Flowers102 [48], Food101 [5], and FGVCAircraft [45]; scene recognition dataset SUN397 [76]; action recognition dataset UCF101 [67]; texture classification dataset DTD [7]; and satellite image recognition EuroSAT [23]. We follow the same base-new classes split and evaluation set in Zhou et al. [86].

Concept shift. We approximate the concept shift by relabeling the ImageNet dataset with the superclasses in [62]. The model is trained on the original classes and evaluated on the superclasses. In this case, the marginal input distribution p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ) is the same while the conditional distributions p(𝐲|𝐱)𝑝conditional𝐲𝐱p(\mathbf{y}|\mathbf{x})italic_p ( bold_y | bold_x ) are different between training and test data.

Conditional shift. For conditional shift, we evaluate the proposed method on two subpopulation datasets, Living-17 and Entity-30 [62], which contain images of 17 animal categories and images of 30 entities, respectively. We follow the training and test split in [17], where the training and test distributions have the same overall classes but contain different subpopulations of those classes. In this case, the marginal output distributions p(𝐲)𝑝𝐲p(\mathbf{y})italic_p ( bold_y ) of training and test data are the same, while the input distributions are changed according to different categories, i.e., p(𝐱|𝐲)𝑝conditional𝐱𝐲p(\mathbf{x}|\mathbf{y})italic_p ( bold_x | bold_y ) are different. Therefore, we treat the setting as conditional shift.

Joint shift. To evaluate the proposed method on joint shift, we conduct experiments on Office-Home under the open domain generalization setting [66], which we refer to as Open-Office-Home. We split the label space of the 65 classes and make various label spaces across different domains. The split of classes is shown in Table 7. Therefore, there are both covariate shift and label shift between the training and test distributions, which we treat as the joint shift on p(𝐱,𝐲)𝑝𝐱𝐲p(\mathbf{x},\mathbf{y})italic_p ( bold_x , bold_y ).

Domains Classes
Source 1 0 - 2, 3 - 8, 9 - 14, 21 - 31
Source 2 0 - 2, 3 - 8, 15 - 20, 32 - 42
Source 3 0 - 2, 9 - 14, 15 - 20, 43 - 53
Target 0 - 64
Table 7: Classes split for joint distribution shifts on Open-Office-Home. We use the numbers to denote the class names. The setting contains both covariate and label shifts, leading to joint shifts on p(𝐱,𝐲)𝑝𝐱𝐲p(\mathbf{x},\mathbf{y})italic_p ( bold_x , bold_y ).

B.2 Implementations and hyperparameters

For all experiments, we train and evaluate the model on a single NVIDIA V100 GPU. We use the same backbone and transformer inference network for all datasets. The backbone is the frozen CLIP model with ViT-B/16 as the image encoder. The transformer inference network consists of a 2-layer transformer and 2 MLP layers to generate the distribution of the test prompt. There are also two trainable vectors as the mean and variance of the probabilistic training prompt and trainable position embeddings for image and text features respectively. The sampled test prompt is then fed into both the image and text encoders to generalize the features and classifiers. We provide an illustration in Figure 6. Note that the test prompt is utilized as tokens of the image and text encoders. To make it the same size as the inputs, we use two linear layers to project the test prompt to the image path and text embedding space, respectively.

Refer to caption
Figure 6: Overall framework of generating the any-shift prompt and generalizing the CLIP model.

ImageNet

Caltech101

OxfordPets

StanfordCars

Flowers102

Food101

FGVC

SUN397

DTD

EuroSAT

UCF101

Learning rate 2e32𝑒32e-32 italic_e - 3
Optimizer SGD
Batch Size 1 4 8 6 4 4 4 2 8 10 4
Epochs 10 30 30 30 30 30 30 30 30 30 30
Table 8: Dataset-specific hyper-parameters for label shift datasets and ImageNet-based datasets. The ImageNet-based covariate shift, label shift, and concept shift datasets use the same hyperparameters.

PACS

VLCS

Office-Home

Open-Office-Home

DomainNet

Living-17

Entity-30

Learning rate 5e45𝑒45e-45 italic_e - 4
Optimizer Adam
Training iterations 3,000 iterations 10,000 iterations 30 epochs
Batch Size 32 32 8 8 2 32 16
Table 9: Dataset-specific batch sizes for common domain generalization datasets and conditional shift datasets.
Accuracy
Method Iterations Art Clipart Product Real Mean
CLIP baseline - 79.32 67.70 86.93 87.46 80.35
Transformer adapter 20,000 78.76 64.62 87.98 84.83 79.05
Any-shift prompt 3,000 83.40 72.53 91.24 90.84 84.50
Table 10: Benefits of generalization with any-shift prompting. Directly training a transformer as an adapter of the image and textual features still easy to lead to overfitting. By aggregating the training, test, and relationship information into the prompt, any-shift prompting achieves better generalization.
Inference network Art Clipart Product Real Mean
CLIP baseline 79.32 67.70 86.93 87.46 80.35
Averaging 82.27 70.91 89.95 89.66 83.20
MLP 82.48 71.09 90.18 89.73 83.37
Transformer 83.40 72.53 91.24 90.84 84.50
Table 11: Ablations on the aggregation methods. The transformer inference network performs best since it better encodes the relationships between different information.
Source Target
ImageNet Caltech101 OxfordPets StanfordCars Flowers102 Food101 FGVCAircraft SUN397 DTD EuroSAT UCF101 Average
CoOp [87] 71.51 93.70 89.14 64.51 68.71 85.30 18.47 64.15 41.92 46.39 66.55 63.88
CoCoOp [86] 71.02 94.43 90.14 65.32 71.88 86.06 22.94 67.36 45.73 45.37 68.21 65.74
TPT [65] 68.98 68.98 47.75 87.79 66.87 68.04 94.16 84.67 65.50 24.78 42.44 65.10
BPL [9] 70.70 93.67 90.63 65.00 70.90 86.30 24.93 67.47 46.10 45.87 68.67 65.95
MaPLe [29] 70.72 93.53 90.49 65.57 72.23 86.20 24.74 67.01 46.49 48.06 68.69 66.30
This paper 71.05 94.57 90.79 66.90 72.30 86.17 25.16 67.32 47.35 50.25 69.52 67.03
Table 12: Comparison of prompt learning methods in the cross-dataset transfer setting. Our method achieves the best overall performance on 10 test datasets.

Except for the architecture and settings shared by all datasets, we also provide the specific hyperparameters for different datasets. Batch size is a hyperparameter that varies per dataset (Tables 8 and 9). For the experiments of label shift (eleven datasets) and the others based on ImageNet (ImageNet-based covariate shift and concept shift), we use the same learning rate 2e32𝑒32e-32 italic_e - 3 as Zhou et al. [86] with SGD. The dataset-specific batch size and epochs are provided in Table 8. For the covariate shift datasets PACS, VLCS, Office-Home, DomainNet and joint shift dataset Open-Office-Home, we train the model with 5e45𝑒45e-45 italic_e - 4 learning rate and 3000 iterations by Adam optimizer. For the conditional shift dataset conditional shift datasets Living-17 and Entity-30, we use the same learning rate 5e45𝑒45e-45 italic_e - 4 and Adam optimizers for 30 epochs. The details are shown in Table 9.

Appendix C More ablations and comparisons

Benefits of generalization with prompts

In our any-shift prompting, we generate the test prompt by aggregating the training information and the test information by a transformer inference network. The test information is from the image and textual features of the CLIP model. In addition to generating the prompt for the CLIP model, another way to achieve generalization is directly adapting the image and textual features by the transformer network and making predictions by the image and textual features. To show the benefits of generalization with our any-shift prompting, we conduct an experiment that adapts the image and textual features using the same transformer inference network, which we refer to as ``Transformer adapter''. The experimental results on Open-Office-Home are reported in Table 10. The transformer adapter performs even worse than the CLIP baseline since it is still easy to overfit the training distribution. Moreover, the transformer adapter requires much more training costs (20,000 iterations) than any-shift prompting (3,000 iterations). The results demonstrate both the effectiveness and efficiency of our any-shift prompting for generalization across distribution shifts.

Method Photo Art Cartoon Sketch Mean
CLIP 99.94 97.41 98.98 88.19 96.13
CLIP-D 99.94 97.61 99.02 90.03 96.65
CoOp 99.70 97.56 98.59 89.95 96.45
CoCoOp 99.94 98.09 99.19 90.77 97.00
TPT 99.82 97.68 98.92 92.58 97.25
This paper 99.94 98.86 99.32 94.53 98.16 ±plus-or-minus\pm± 0.4
Table 13: Detailed comparisons on PACS with covarate shift.
Method VOC LabelMe Caltech SUN Mean
CLIP 84.32 68.26 98.61 74.52 81.43
CLIP-D 82.60 68.76 98.76 72.68 80.70
CoOp 85.86 68.51 98.94 76.72 82.51
CoCoOp 86.03 70.45 99.12 77.96 83.39
TPT 86.20 71.05 99.46 80.60 84.33
This paper 88.14 72.65 100.00 85.37 86.54 ±plus-or-minus\pm± 0.4
Table 14: Detailed comparisons on VLCS with covarate shift.
Method Art Clipart Product Real Mean
CLIP 79.32 67.70 86.93 87.46 80.35
CLIP-D 80.47 68.83 87.93 88.80 81.51
CoOp 80.99 69.52 88.69 89.28 82.12
CoCoOp 81.78 70.09 89.32 89.89 82.77
TPT 82.45 71.18 90.03 90.15 83.45
This paper 83.70 73.00 92.50 91.44 85.16 ±plus-or-minus\pm± 0.6
Table 15: Detailed comparisons on Office-Home.
Method Clipart Painting Real Infograph Quickdraw Sketch Mean
CLIP 68.12 56.18 78.82 46.36 14.32 60.69 54.08
CLIP-D 70.83 58.02 80.52 48.85 16.39 62.84 56.24
CoOp 74.39 61.18 83.26 51.88 16.67 65.52 58.82
CoCoOp 74.82 61.56 83.98 52.68 17.47 66.10 59.43
TPT 75.09 62.77 84.67 52.65 17.28 66.98 59.90
This paper 76.08 66.62 85.03 52.56 18.05 67.26 60.93 ±plus-or-minus\pm± 0.4
Table 16: Detailed comparisons on DomainNet.

Benefits of the transformer inference network We also conduct experiments on Open-Office-Home with different methods for aggregating the training and test information. We generate the test prompt by directly averaging the training prompts, the test image feature, and textual features. In addition, we also use an MLP network to replace the transformer network to generate the test prompt from the averaged features. As shown in Table 11, the transformer inference network achieves the best performance, demonstrating the effectiveness of considering the relationships between different information for aggregation.

Comparison on cross-dataset shift. Following Zhou et al. [86], we conduct experiments on the cross-dataset setting, where the model trained on ImageNet is evaluated on the other 10 datasets shown in Table 12. In this case, there are different distribution shifts for different test datasets. Compared with the other prompt learning methods, e.g., CoOp [87], CoCoOp [86], BPL [9], MaPLe [29], and test-time tuning method TPT [65], our method shows improvement on 8 of the 10 datasets, as well as the averaged result.

Detailed results on covariate shift We also report the detailed comparisons of each test distribution on the four covariate shift datasets. The results of PACS, VLCS, Office-Home, and DomainNet are provided in Table 13, 14, 15, and 16, respectively. Our method achieves the best performance on most of the test distributions.

Inference efficiency. Since our method only uses a single feedforward pass for generating the test prompts and making predictions, the inference time cost per iteration on a single V100 GPU (0.13s) is slightly higher than other prompt tuning methods like CoOp (0.10s) and CoCoOp (0.11s), and faster than TPT (0.25s), which has 1-step optimization at test time.