Any-Shift Prompting for Generalization over Distributions

    Zehao Xiao¹        Jiayi Shen¹        Mohammad Mahdi Derakhshani¹
    Shengcai Liao²        Cees G. M. Snoek¹
   ¹University of Amsterdam    ²Core42

Abstract

Image-language models with prompt learning have shown remarkable advances in numerous downstream vision tasks. Nevertheless, conventional prompt learning methods overfit their training distribution and lose the generalization ability on test distributions. To improve generalization across various distribution shifts, we propose any-shift prompting: a general probabilistic inference framework that considers the relationship between training and test distributions during prompt learning. We explicitly connect training and test distributions in the latent space by constructing training and test prompts in a hierarchical architecture. Within this framework, the test prompt exploits the distribution relationships to guide the generalization of the CLIP image-language model from training to any test distribution. To effectively encode the distribution information and their relationships, we further introduce a transformer inference network with a pseudo-shift training mechanism. The network generates the tailored test prompt with both training and test information in a feedforward pass, avoiding extra training costs at test time. Extensive experiments on twenty-three datasets demonstrate the effectiveness of any-shift prompting on the generalization over various distribution shifts.

1 Introduction

Refer to caption — Figure 1: Any-shift prompting. (a) Various distribution shifts in real-world applications. (b) We propose any-shift prompting that aggregates training and test information for jointly handling individual distribution shifts and their combinations.

Recent image-language foundation models like CLIP [55] show remarkable advances in various computer vision tasks. Benefiting from large image-text pairing datasets for pre-training, these models perform well when adapting to downstream tasks by manual prompts [40, 51, 59, 56] and prompt learning [87, 86]. However, it is difficult for conventional prompt learning approaches to handle distribution shifts in downstream tasks [65, 9]. The learned prompts usually overfit their training data, leading to performance degradation on unseen test distributions.

To improve generalization of prompt learning, recent methods introduce uncertainty into the learnable prompt [9] or fine-tune the prompt on each test sample with extra unsupervised optimizations [65, 61]. Nevertheless, these methods do not explicitly consider the relationships between training and test distributions of the downstream tasks. However, in real-world applications, the distribution shifts are usually complex and unpredictable, where models may encounter different distribution shifts (Figure 1 (a)), and even their combinations. Hence, we deem it crucial to explore the relationships between training and test distributions for the generalization of prompting across different distribution shifts. To this end, we make three contributions in this paper.

First, we propose any-shift prompting, a general probabilistic inference framework that can explore distribution relationships in prompt learning. Specifically, we introduce probabilistic training and test prompts in a hierarchical architecture to explicitly connect the training and test distributions. Within this framework, the test prompt encodes the test information and the relationships of the training and test distributions, thereby improving the generalization ability on various test distributions (Figure 1 (b)).

Second, we propose a pseudo-shift training mechanism, where the hierarchical probabilistic model learns the ability to encode distribution relationships by simulating distribution shifts. Consequently, at test time, our method generalizes to any specific distribution by generating a tailored prompt on the fly in just one feedforward process, without the need for re-learning or fine-tuning.

Third, to effectively and comprehensively encode the distribution information and their relationships, we design a transformer inference network for prompt generation. The transformer takes test information of both image and label space features, as well as the training prompts, as inputs. It then aggregates the training and test information and their relationships into the test-specific prompt. The test prompt is utilized to guide both the feature extraction and classification processes to generate test-specific features and classifiers, which bolsters robust predictions across distribution shifts.

We validate our method through extensive experiments on twenty-three benchmarks with various distribution shifts, including covariate shift, label shift, conditional shift, concept shift, and even joint shift. The results demonstrate the effectiveness of the proposed method on generalization across various distribution shifts.

2 Preliminary

We propose any-shift prompting based on CLIP [55] to handle various distribution shifts in a general way. Here we provide the technical background on CLIP as well as definitions of distribution shifts considered.

CLIP model. Contrastive Language-Image Pre-training (CLIP) [55] consists of an image encoder $f_{\Phi_{I}}(\mathbf{x})$ and a text encoder $f_{\Phi_{T}}(\mathbf{l)}$ , which are trained by a contrastive loss on a large dataset of image-language ( $\mathbf{x},\mathbf{l}$ ) pairs. For a downstream classification task with an input image $\mathbf{x}$ and a set of class names $\mathcal{Y}{=}\{c_{i}\}_{i{=}1}^{C}$ , the image feature is extracted by $\mathbf{z}{=}f_{\Phi_{I}}(\mathbf{x})$ and the classifiers are composed of a set of text features $\{\mathbf{t}_{i}\}_{i{=}1}^{C}$ , where $\mathbf{t}_{i}{=}f_{\Phi_{T}}(\mathbf{l}_{i})$ . Here, $\mathbf{l}_{i}$ is a manually crafted prompt to describe the corresponding class name $c_{i}$ , e.g., ``an image of a [class].'' Thus, the prediction function of the CLIP model for downstream tasks without fine-tuning is formulated as:

p(\mathbf{y}|\mathbf{x},\mathcal{Y}){=}\mathrm{softmax}(\mathbf{z}^{\top}% \mathbf{t}).

(1)

This enables the pre-trained CLIP model to handle zero-shot learning classification in various downstream tasks.

Distribution shifts. A data distribution is generally denoted as $p(\mathbf{x},\mathbf{y})$ , which is a joint distribution of the input data $\mathbf{x}$ and the label $\mathbf{y}$ . The models are usually trained on a training distribution $p(\mathbf{x}_{s},\mathbf{y}_{s})$ and then deployed on test distributions $p(\mathbf{x}_{t},\mathbf{y}_{t})$ . In real-world applications, differences between the training and test distributions are known as the joint distribution shift:

p(\mathbf{x}_{s},\mathbf{y}_{s})\neq p(\mathbf{x}_{t},\mathbf{y}_{t}).

(2)

Joint distribution shift	$p(\mathbf{x}_{s},\mathbf{y}_{s})\neq p(\mathbf{x}_{t},\mathbf{y}_{t})$
Partial distribution shifts
Covariate shift	$p(\mathbf{x}_{s})\neq p(\mathbf{x}_{t})$ $p(\mathbf{y}_{s}\|\mathbf{x}_{s})=p(\mathbf{y}_{t}\|\mathbf{x}_{t})$
Label shift	$p(\mathbf{y}_{s})\neq p(\mathbf{y}_{t})$ $p(\mathbf{x}_{s}\|\mathbf{y}_{s})=p(\mathbf{x}_{t}\|\mathbf{y}_{t})$
Concept shift	$p(\mathbf{x}_{s})=p(\mathbf{x}_{t})$ $p(\mathbf{y}_{s}\|\mathbf{x}_{s})\neq p(\mathbf{y}_{t}\|\mathbf{x}_{t})$
Conditional shift	$p(\mathbf{y}_{s})=p(\mathbf{y}_{t})$ $p(\mathbf{x}_{s}\|\mathbf{y}_{s})\neq p(\mathbf{x}_{t}\|\mathbf{y}_{t})$

Table 1: Common distribution shifts. The joint distribution shift is usually decomposed into four partial shifts, which are investigated individually in the literature. By contrast, we focus in this paper on various shifts and even consider their combinations.

Common distribution shifts in the literature. Due to the joint distribution shift, the performance of the trained model degrades on the test data [71, 39], sometimes significantly so. Since the joint distribution shift is complex, previous methods limit the scope of the problem and simplify the joint distribution shift to different partial distribution shifts. From a Bayesian perspective, the joint distribution is decomposed into $p(\mathbf{x},\mathbf{y}){=}p(\mathbf{x})p(\mathbf{y}|\mathbf{x}){=}p(\mathbf{y% })p(\mathbf{x}|\mathbf{y})$ . According to the different components in the decomposition, we summarize the partial distribution shifts into four different definitions in Table 1 and detail them one by one.

Covariate shift [68, 34, 63] assumes the distribution shifts occur only in the input space $p(\mathbf{x})$ while the labels given the input features $p(\mathbf{y}|\mathbf{x})$ remain the same, e.g., by image corruptions [24] or changing image styles [34, 54]. Covariate shift is widely investigated by domain generalization [85, 34, 77] and domain adaptation methods [71, 39]. Label shift focuses on the opposite problem, where the label distributions $p(\mathbf{y})$ are different, but the label-conditional distributions $p(\mathbf{x}|\mathbf{y})$ are the same [69, 58]. Previous methods generate datasets with uniform distribution $p(\mathbf{y})$ during training and different distributions at test time [22, 2, 74]. The classification of unknown classes can be treated as a specific and worse case of the label shift [41, 64, 86], where $p(\mathbf{y}){=}0$ for the unknown classes. Concept shift treats the distribution of input $p(\mathbf{x})$ the same while the conditional distributions $p(\mathbf{y}|\mathbf{x})$ are different, indicating different annotation methods for the same data distribution [42]. Conditional shift assumes the label distribution is the same while the conditional distribution $p(\mathbf{x}|\mathbf{y})$ are different [41, 81, 18], where different classes can have their own shift protocols on the input data, e.g., sub-population problems [62, 32].

Distribution shifts in this paper. Conventional prompting methods [87, 86] learn the prompt on the training distribution of the downstream task, which is easy to overfit and vulnerable to the above shifts [9, 65]. Moreover, in real-world scenarios, all distribution shifts may happen unpredictably, and even simultaneously. Hence, we propose to encode test information and the training-test relationships for generalization over distributions. Our method is not designed for specific partial distribution shifts. Instead, it is proposed to handle various shifts, even when they occur simultaneously.

3 Any-Shift Prompting

3.1 Prompt modeling

We propose any-shift prompting, a general probabilistic inference framework to explore distribution relationships. Specifically, we introduce training and test prompts as latent variables in a hierarchical architecture. The graphical model of our method is provided in Figure 2.

Training prompt. The intuitive idea of adapting the CLIP model is to inject the downstream training data $\mathcal{D}_{s}$ in a training prompt for prediction (eq. 1). $\mathcal{D}_{s}$ consists of training input-output pairs sampled from the distribution $p(\mathbf{x}_{s},\mathbf{y}_{s})$ . The predictive function of CLIP for the test distribution $p(\mathbf{x}_{t},\mathbf{y}_{t})$ is then formulated as:

p_{\Phi}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathcal{Y}_{t},\mathcal{D}_{s})\propto p% _{\Phi}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathbf{v}_{s},\mathcal{Y}_{t})p(\mathbf% {v}_{s}|\mathcal{D}_{s}),

(3)

where $\Phi$ denotes the frozen parameters of the image and text encoders of the CLIP model. Here $\mathbf{v}_{s}$ is the training prompt that encodes the training downstream task information, which improves the performance of the CLIP model on the training distribution. However, the prompt $\mathbf{v}_{s}$ usually overfits the training data, which may not benefit and even harm the prediction on the unseen test distribution due to the distribution shifts at test time.

Probabilistic test prompt.

To generalize across distribution shifts in downstream tasks at test time, we further introduce a probabilistic test prompt within a hierarchical Bayes framework to encode the information of test distributions. Specifically, the test prompt $\mathbf{v}_{t}$ is inferred from the training prompt $\mathbf{v}_{s}$ and the accessible test information, i.e., a test image $\mathbf{x}_{t}$ and the class names $\mathcal{Y}_{t}$ . To build the connections between the training and test prompts, we take the training prompt $\mathbf{v}_{s}$ as a condition for the generation of the test prompt. This enables the method to generate the test prompt across different shifts by considering the relationships between training and test distributions and exploring relevant training information. By introducing $\mathbf{v}_{t}$ , the CLIP prediction function is formulated as:

		$\displaystyle p_{\Phi,\theta}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s})$		(4)
		$\displaystyle=\int\int p_{\Phi}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathbf{v}_{t},% \mathcal{Y}_{t})p_{\bm{\theta}}(\mathbf{v}_{t}\|\mathbf{v}_{s},\mathbf{x}_{t},% \mathcal{Y}_{t})p(\mathbf{v}_{s}\|\mathcal{D}_{s})d\mathbf{v}_{t}d\mathbf{v}_{s},$		(4)

where $\bm{\theta}$ denotes the learnable inference network for the test prompt. With the probabilistic test prompt, we provide a general way to incorporate the training and test information, as well as their relationships, into the prediction of the CLIP model, enabling it to generalize on any test distribution.

Variational test prompt. To optimize the model for generating the probabilistic test prompt in eq. (4), we use variational inference to approximate the true posterior $p(\mathbf{v}_{t},\mathbf{v}_{s}|\mathcal{D}_{t},\mathcal{Y}_{t},\mathcal{D}_{s})$ , which is factorized as:

q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s}|\mathcal{D}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s}){=}q_{\bm{\theta}}(\mathbf{v}_{t}|\mathbf{v}_{s},\mathcal{D}_{% t},\mathcal{Y}_{t})p(\mathbf{v}_{s}|\mathcal{D}_{s}),

(5)

where $\mathcal{D}_{t}$ consists of test input-output pairs sampled from the test distribution $p(\mathbf{x}_{t},\mathbf{y}_{t})$ . The variational posterior of the test prompt shares the same inference model $\bm{\theta}$ with its prior. By integrating eq. (5) into eq. (4), we derive the evidence lower bound (ELBO) of the predictive function as:

		$\displaystyle\log p_{\Phi,\bm{\theta}}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathcal{% Y}_{t},\mathcal{D}_{s})\geq\mathbb{E}_{q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{% v}_{s})}\big{[}\log p_{\Phi}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathbf{v}_{t},% \mathcal{Y}_{t})\big{]}$		(6)
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ -\mathbb{D}_{\mathrm{KL}}\big{[}q_{% \bm{\theta}}(\mathbf{v}_{t}\|\mathbf{v}_{s},\mathcal{D}_{t},\mathcal{Y}_{t})\|\|p% _{\bm{\theta}}(\mathbf{v}_{t}\|\mathbf{v}_{s},\mathbf{x}_{t},\mathcal{Y}_{t})% \big{]}.$		(6)

The variational posterior of the test prompt $q_{\bm{\theta}}(\mathbf{v}_{t})$ encodes more input-output information of the test distribution and their relationships, yielding a more representative test prompt for better generalization on the test distributions. We provide the step-by-step derivations in the supplemental material.

Notably, the variational posteriors and ELBO are intractable since large numbers of test samples and their ground truth labels in $\mathcal{D}_{t}$ are usually unavailable at test time. Thus, in the next section, we propose a pseudo-shift training setup to approximate the ELBO for any-shift prompting.

3.2 Training and inference

Pseudo-shift training mechanism.

To approximate the intractable ELBO in eq. (6), we develop a pseudo-shift training mechanism. Specifically, the mini-batch data in the current iteration is treated as the pseudo-test data $\mathcal{D}_{t^{\prime}}$ from the pseudo-test distribution $p(\mathbf{x}_{t^{\prime}},\mathbf{y}_{t^{\prime}})$ . Likewise, the mini-batches in previous iterations are treated as the pseudo-training data $\mathcal{D}_{s^{\prime}}$ from the pseudo-training distribution $p(\mathbf{x}_{s^{\prime}},\mathbf{y}_{s^{\prime}})$ . In this case, the ground truth labels of the pseudo-test data are available during training. We then approximate the ELBO and obtain the optimization function for any-shift prompting as:

		$\displaystyle\mathcal{L}{=}-\mathbb{E}_{q_{\bm{\theta}}(\mathbf{v}_{t^{\prime}% },\mathbf{v}_{s^{\prime}})}\big{[}\log p_{\Phi}(\mathbf{y}_{t^{\prime}}\|% \mathbf{x}_{t^{\prime}},\mathbf{v}_{t^{\prime}},\mathcal{Y}_{t^{\prime}})\big{]}$		(7)
		$\displaystyle+\mathbb{D}_{\mathrm{KL}}\big{[}q_{\bm{\theta}}(\mathbf{v}_{t^{% \prime}}\|\mathbf{v}_{s^{\prime}},\mathcal{D}_{t^{\prime}},\mathcal{Y}_{t^{% \prime}})\|\|p_{\bm{\theta}}(\mathbf{v}_{t^{\prime}}\|\mathbf{v}_{s^{\prime}},% \mathbf{x}_{t^{\prime}},\mathcal{Y}_{t^{\prime}})\big{]},$		(7)

where $\mathbf{v}_{t^{\prime}}$ and $\mathbf{v}_{s^{\prime}}$ denote the pseudo-test and pseudo-training prompts, respectively. In practice, we assume the prompts follow the standard Gaussian distributions. The negative log-likelihood in eq. (7) is implemented by a cross-entropy loss. The mini-batch training mechanism mimics the distribution shifts and trains the any-shift prompting to handle the distribution shifts during training, where the model never accesses any test data. Minimizing the KL terms encourages the prior to implicitly learn more comprehensive pseudo-test information from the variational posterior, which aggregates more data information together with the ground truth labels.

Transformer inference network. The pseudo-test prompt in eq. (7) is inferred from: the pseudo-training information in $\mathbf{v}_{s^{\prime}}$ , the pseudo-test image $\mathbf{x}_{t^{\prime}}$ , and the class names $\mathcal{Y}_{t^{\prime}}$ . To better aggregate the different information sources and consider their relationships, we introduce a transformer inference network to generate the pseudo-test prompt.

In our model, the prior $p_{\bm{\theta}}(\mathbf{v}_{t^{\prime}}|\mathbf{v}_{s^{\prime}},\mathbf{x}_{t^% {\prime}},\mathcal{Y}_{t^{\prime}})$ and variational posterior $q_{\bm{\theta}}(\mathbf{v}_{t^{\prime}}|\mathbf{v}_{s^{\prime}},\mathcal{D}_{t% ^{\prime}},\mathcal{Y}_{t^{\prime}})$ of the pseudo-test prompt share the same inference network to encode the different conditions. Compared with the prior, the variational posterior has access to one batch of pseudo-test images with the corresponding ground-truth labels. Figure 3 illustrates the deployment of the shared transformer inference network. In the following, we provide the detailed inference of the prior and variational posterior.

As shown in Figure 3 (a), the prior of the pseudo-test prompt is generated by the pseudo-training prompt $\mathbf{v}_{s^{\prime}}$ , the pseudo-test image $\mathbf{x}_{t}^{\prime}$ , and class names $\mathcal{Y}_{t}^{\prime}$ . Specifically, we sample a pseudo-training prompt $\mathbf{v}_{s^{\prime}}^{(j)}$ from a Gaussian distribution $\mathcal{N}(\mathbf{v}_{s^{\prime}};\mu_{s^{\prime}},\sigma_{s^{\prime}})$ by the reparameterization trick [30]. The mean $\mu_{s^{\prime}}$ and variance $\sigma_{s^{\prime}}$ are two sets of parameters trained with the pseudo-training data $\mathcal{D}_{s^{\prime}}$ in the previous iterations. The pseudo-test image is fed into the fixed CLIP image encoder to get the image feature $f_{\Phi_{I}}(\mathbf{x}_{t^{\prime}})$ . The class names of the pseudo-test distribution are processed by the fixed text encoder to extract the textual features $f_{\Phi_{T}}(\mathcal{Y}_{t^{\prime}})$ . After the pre-processing, we take the sampled pseudo-training prompt, pseudo-test image feature, and textual features as input tokens of our transformer inference network to generate the prior of the pseudo-test prompt:

[\widetilde{\mathbf{v}}_{t^{\prime}}^{p};\cdot;\cdot]=\texttt{Trans}([\mathbf{% v}^{(j)}_{s^{\prime}};f_{\Phi_{I}}(\mathbf{x}_{t^{\prime}});f_{\Phi_{T}}(% \mathcal{Y}_{t^{\prime}})]),

(8)

\mu_{t^{\prime}}^{p}=\texttt{MLP}_{\mu}(\widetilde{\mathbf{v}}_{t^{\prime}}^{p% }),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \sigma_{t^{\prime}}^{p}=\texttt{MLP}_{\sigma}(\widetilde{\mathbf{v}}% _{t^{\prime}}^{p}),

(9)

p_{\bm{\theta}}(\mathbf{v}_{t^{\prime}}|\mathbf{v}_{s^{\prime}},\mathbf{x}_{t^% {\prime}};\mathcal{Y}_{t^{\prime}})=\mathcal{N}(\mathbf{v}_{t^{\prime}};\mu_{t% ^{\prime}}^{p},\sigma_{t^{\prime}}^{p}).

(10)

The prior of the pseudo-test prompt follows the Gaussian distribution in eq. (10), whose mean and variance are obtained by two MLP networks on the output of the transformer $\widetilde{\mathbf{v}}_{t^{\prime}}^{p}$ .

In Figure 3 (b), with the pseudo-test data $\mathcal{D}_{t^{\prime}}$ , the variational posterior learns more distribution information as well as the relations between inputs and outputs. To be clearer, we rewrite the variational posterior $q_{\bm{\theta}}(\mathbf{v}_{t^{\prime}}|\mathbf{v}_{s^{\prime}},\mathcal{D}_{t% ^{\prime}},\mathcal{Y}_{t^{\prime}})$ as $q_{\bm{\theta}}(\mathbf{v}_{t^{\prime}}|\mathbf{v}_{s^{\prime}},X_{t^{\prime}}% ,Y_{t^{\prime}})$ , where $X_{t^{\prime}}$ contains a batch of pseudo-test images in $\mathcal{D}_{t^{\prime}}$ and $Y_{t^{\prime}}$ consists of the ground truth class names of $X_{t^{\prime}}$ in $\mathcal{Y}_{t^{\prime}}$ . Hence, the shared transformer takes all image features and their corresponding label features as input tokens to infer the variational posterior:

[\widetilde{\mathbf{v}}_{t^{\prime}}^{q};\cdot;\cdot]=\texttt{Trans}([\mathbf{% v}^{(j)}_{s^{\prime}};f_{\Phi_{I}}(X_{t^{\prime}});f_{\Phi_{T}}(Y_{t^{\prime}}% )]),

(11)

\mu_{t^{\prime}}^{q}=\texttt{MLP}_{\mu}(\widetilde{\mathbf{v}}_{t^{\prime}}^{q% }),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \sigma_{t^{\prime}}^{q}=\texttt{MLP}_{\sigma}(\widetilde{\mathbf{v}}% _{t^{\prime}}^{q}),

(12)

q_{\bm{\theta}}(\mathbf{v}_{t^{\prime}}|\mathbf{v}_{s^{\prime}},\mathcal{D}_{t% ^{\prime}},\mathcal{Y}_{t^{\prime}})=\mathcal{N}(\mathbf{v}_{t^{\prime}};\mu_{% t^{\prime}}^{q},\sigma_{t^{\prime}}^{q}).

(13)

With the inferred pseudo-test prompt, we take its samples from the variational posterior as the input tokens for both image and text encoders of CLIP to make predictions during training. Thus, although the encoders are fixed, the image and textual features are generalized by utilizing the distribution information in the prompts during the feature extraction and classification procedure, enabling the method to handle different distribution shifts.

Prediction. At test time, we make predictions on each test image $\mathbf{x}_{t}$ with the test prompt generated by the transformer inference network. Since the test data and labels in $\mathcal{D}_{t}$ are unavailable, the variational posterior becomes intractable. Thus, we sample the test prompt $\mathbf{v}_{t}^{(i)}$ from the prior distribution $p_{\bm{\theta}}(\mathbf{v}_{t}|\mathbf{v}_{s}^{(j)},\mathbf{x}_{t},\mathcal{Y}% _{t})$ , where $\mathbf{v}_{s}^{(j)}$ is a sample of the training prompt following $p(\mathbf{v}_{s}|\mathcal{D}_{s})$ . $\mathbf{v}_{t}^{(i)}$ is then introduced into both the image and text encoders of the CLIP model for generalization and prediction as:

	$\displaystyle p_{\Phi}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathcal{Y}_{t},\mathcal{% D}_{s})=\frac{1}{N_{t}}\frac{1}{N_{s}}\sum_{i=1}^{N_{t}}$	$\displaystyle\sum_{j=1}^{N_{s}}p_{\Phi}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathbf{% v}_{t}^{(i)},\mathcal{Y}_{t}),$		(14)
	$\displaystyle\mathbf{v}_{t}^{(i)}\sim p_{\bm{\theta}}(\mathbf{v}_{t}\|\mathbf{v% }^{(j)}_{s},\mathbf{x}_{t},\mathcal{Y}_{t}),\leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\$	$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \mathbf{v}_{s}^{(j)}\sim p(\mathbf{v}_{s}\|\mathcal{D}_{s}).$		(14)

Although the test data and their labels are not available at test time, the information in each test sample and all class names in the vocabulary of the test task are available to infer the prior of the test prompt. The ability to encode test information from a single test image and the class vocabulary is learned during training by minimizing the KL divergence between the prior and posterior. Note the CLIP image encoder and text encoder are always frozen. Only the test prompt changes for different test distributions by aggregating the training and test information in each test sample $\mathbf{x}_{t}$ and the class names $\mathcal{Y}_{t}$ . In this case, we utilize the original generalization ability of CLIP to generate the test prompt for generalization on downstream tasks across various distribution shifts.

4 Related Work

Prompt learning. Image-language foundation models such as CLIP [55] and ALIGN [28] achieve significant advances in various downstream tasks. To adapt the foundation models to downstream tasks, adapter [16] and prompt learning methods [33, 37, 87] are proposed. Zhou et al. [87] propose a learnable prompt as the input of the language model in CLIP. To avoid forgetting the original knowledge in the CLIP model, Zhu et al. [88] and Yao et al. [79] guide prompt learning with hand-crafted prompts. Instead of generating prompts for the language model, Bahng et al. [3] introduce prompting of the image model. Khattak et al. [29] learn a joint prompt for both image and language encoders. Zhou et al. [86] introduce the imaging conditions into the language prompt to enhance the generalization ability of zero-shot performance. To further improve the generalization ability, Derakhshani et al. [9] propose Bayesian prompt learning, which considers the uncertainty in the learned prompts for zero-shot generalization. Shu et al. [65] and Hassan et al. [61] fine-tune the prompt at test time to a specific distribution. We also improve the generalization of prompt learning. Different from previous methods that consider uncertainty or fine-tune the prompt for specific distributions, we propose any-shift prompting that explicitly explores distribution information and relationships within a hierarchical probabilistic framework. The method generates the test-specific prompt on the fly for any test distribution.

Distribution shift generalization. Domain generalization [47, 36, 85, 21] and domain adaptation [44, 73, 38, 80] are the most widely investigated methods for handling distribution shifts. Some domain generalization methods train invariant models on the training distributions [1, 77, 46], which are assumed to be invariant on the test distributions also. To further improve the generalization ability, some methods [35, 11, 4] introduce meta-learning in domain generalization to mimic domain shifts during training. In this paper, we also simulate the distribution shift by a pseudo-shift training mechanism, which uses different mini-batches as distributions. To better utilize the test information for generalization without accessing the test data during training, Sun et al. [68] and Wang et al. [71] propose test-time adaptation, which fine-tunes the trained model on test data with self-supervised losses. The method is followed by many methods [82, 49, 19, 50, 43] due to its good generalization ability on covariate shift. In addition, test-time adaptation is also investigated with other methods like normalization statistics re-estimation [63, 39], or classifier adjustment [27, 78, 84]. Most of these methods focus on covariate shift [71, 78, 19, 12], such as changes of the image styles [34, 54] and corruptions [24]. Some other methods work on the conditional shift [41, 81, 18, 17] or label shift [81, 69, 52, 17]. We also utilize the test information for generalization, but without any test-time optimization. Different from the previous methods, we explicitly bridge the training and test information and explore their relationships to address various distribution shifts in a general way.

5 Experiments

Method	PACS	VLCS	Office-Home	DomainNet	ImageNet-V2	ImageNet-S	ImageNet-A	ImageNet-R
Prompting without test-time optimization
CLIP [55]	96.13	81.43	80.35	54.08	60.83	46.15	47.77	73.96
CLIP-D [55]	96.65	80.70	81.51	56.24	-	-	-	-
CoOp [87]	96.45	82.51	82.12	58.82	64.20	47.99	49.71	75.21
CoCoOp [86]	97.00	83.89	82.77	59.43	64.07	48.75	50.63	76.18
DPL [83]	97.07	83.99	83.00	59.86	-	-	-	-
BPL [9]	-	-	-	-	64.23	49.20	51.33	77.00
This paper	98.16 $\pm$ 0.4	86.54 $\pm$ 0.4	85.16 $\pm$ 0.6	60.93 $\pm$ 0.6	64.53 $\pm$ 0.2	49.80 $\pm$ 0.5	51.52 $\pm$ 0.6	77.56 $\pm$ 0.4
Prompting with test-time optimization
TPT [65]	97.25	84.33	83.45	59.90	63.45	47.94	54.77	77.06
CoOp + TPT [65]	97.85	85.06	84.32	60.65	66.83	49.29	57.95	77.27
CoCoOp + TPT [65]	97.95	85.55	84.54	60.44	64.85	48.47	58.47	78.65
This paper + TPT	98.47 $\pm$ 0.4	86.98 $\pm$ 0.4	86.00 $\pm$ 0.8	61.75 $\pm$ 0.8	67.08 $\pm$ 0.6	50.83 $\pm$ 0.6	58.05 $\pm$ 0.5	79.23 $\pm$ 0.5

Table 2: Covariate shift comparison. The experiments are conducted on eight domain generalization datasets, with average classification accuracy reported. Any-shift prompting achieves the best results compared with the original CLIP and other prompt learning methods, which demonstrates the generalization ability of our method on covariate shift. When combined with TPT's test-time optimization, promting methods in general, as well as our method improves further.

Twenty-three datasets.

To demonstrate the generalization ability of any-shift prompting, we evaluate the method on datasets with different distribution shifts. For covariate shift, we conduct experiments on the common domain generalization datasets, PACS [34], Office-Home [70], VLCS [14], and DomainNet [54], which contain images from different domains such as image styles. We also evaluate the model on covariate shifts of ImageNet [8] following Zhou et al. [86], where the model is trained on ImageNet with 16-shot images and evaluated on other variants ImageNet-V2 [57], ImageNet-(S)ketch [72], ImageNet-A [26], and ImageNet-R [25]. For label shift, we follow the base-to-new class generalization from Zhou et al. [87], with 11 datasets that cover various tasks, ImageNet [8], Caltech101 [15], OxfordPets [53], StanfordCars [31], Flowers102 [48], Food101 [5], FGVCAircraft [45], SUN397 [76], DTD [7], EuroSAT [23], and UCF101 [67]. For concept shift, we build a ImageNet-Superclass dataset, where we evaluate the ImageNet-trained model on super-classes in [62]. For conditional shift, we evaluate on the sub-population datasets Living-17 and Entity-30 [62], where the training and test distributions consist of the same classes with different subpopulations. To evaluate our method on the combination of different distribution shifts, we follow the open-domain generalization setting [66] on the Office-Home dataset, which contains four domains, Art, Clipart, Product, and Real-world. We refer to it as Open-Office-Home, which combines covariate shift and label shift. The detailed settings are provided in the supplemental materials.

Implementation details. Our model consists of the pretrained image and language encoders of CLIP [55], and the proposed transformer inference network to generate the test prompt. We use the ViT-B/16 [10] as the image encoder following [86, 9]. The pretrained image and language encoders of CLIP are frozen during training and inference. To generate the prior and variational posterior of the prompt, we use a 2-layer transformer in the inference network. As shown in Figure 3, the inputs of the transformer include the training prompt, the image features, and the class-name features. The distribution of the training prompts consists of two trainable vectors as the mean and variance respectively. The class-name tokens are generated by the hand-crafted tokens ``an image of a [class]''. The transformer also contains two kinds of trainable position embeddings to indicate the image and language tokens. The introduced prompts are sampled from the corresponding distributions by the reparameterization trick [30]. More detailed implementations and hyperparameters are provided in the supplemental materials.

5.1 Results on various distribution shifts

Covariate shift. We conduct experiments on eight domain generalization datasets with covariate shift. The averaged results of classification accuracy for each dataset are provided in Table 2. We follow the leave-one-out protocol [34] for evaluation on the first four datasets, where the model evaluated on each test domain is trained on the other domains. The detailed results on each test domain are provided in the supplemental materials. For the last four datasets, we evaluate the same ImageNet-pretrained model on them individually. Our method outperforms the other prompt learning methods CoOp, CoCoOp, and DPL on all eight datasets. Note that the comparisons with the other prompt learning methods are fair since we generate the test prompt and make predictions in a single feedforward pass, without any optimization or backpropagation at test time. The proposed method also performs better on seven of the eight datasets compared with the test-time tuning method TPT, securing the second position on ImageNet-A. Moreover, since the proposed method learns the prompt and transformer network only during training, it can also be combined with test-time optimization. Then we obtain even better results, which are also competitive on ImageNet-A, indicating the effectiveness of any-shift prompting on covariate shift.

(a) Average over 11 datasets.

	Base	New	H
CLIP	69.34	74.22	71.70
CoOp	82.69	63.22	71.66
CoCoOp	80.47	71.69	75.83
BPL	80.10	74.94	77.43
MaPLe	82.28	75.14	78.55
This paper	82.36	76.30	79.21

(b) ImageNet.

	Base	New	H
CLIP	72.43	68.14	70.22
CoOp	76.47	67.88	71.92
CoCoOp	75.98	70.43	73.10
BPL	-	70.93	-
MaPLe	76.66	70.54	73.47
This paper	76.63	71.33	73.88

	Base	New	H
CLIP	96.84	94.00	95.40
CoOp	98.00	89.81	93.73
CoCoOp	97.96	93.81	95.84
BPL	-	94.93	-
MaPLe	97.74	94.36	96.02
This paper	98.28	94.27	96.23

(d) OxfordPets.

	Base	New	H
CLIP	91.17	97.26	94.12
CoOp	93.67	95.29	94.47
CoCoOp	95.20	97.69	96.43
BPL	-	98.00	-
MaPLe	95.43	97.76	96.58
This paper	95.78	97.80	96.78

(e) StanfordCars.

	Base	New	H
CLIP	63.37	74.89	68.65
CoOp	78.12	60.40	68.13
CoCoOp	70.49	73.59	72.01
BPL	-	73.23	-
MaPLe	72.94	74.00	73.47
This paper	73.05	75.83	74.41

(f) Flowers102.

	Base	New	H
CLIP	72.08	77.80	74.83
CoOp	97.60	59.67	74.06
CoCoOp	94.87	71.75	81.71
BPL	-	70.40	-
MaPLe	95.92	72.46	82.56
This paper	96.50	76.20	85.16

(g) Food101.

	Base	New	H
CLIP	90.10	91.22	90.66
CoOp	88.33	82.26	85.19
CoCoOp	90.70	91.29	90.99
BPL	-	92.13	-
MaPLe	90.71	92.05	91.38
This paper	90.87	91.35	91.11

(h) FGVCAircraft.

	Base	New	H
CLIP	27.19	36.29	31.09
CoOp	40.44	22.30	28.75
CoCoOp	33.41	23.71	27.74
BPL	-	35.00	-
MaPLe	37.44	35.61	36.50
This paper	37.10	35.70	36.39

(i) SUN397.

	Base	New	H
CLIP	69.36	75.35	72.23
CoOp	80.60	65.89	72.51
CoCoOp	79.74	76.86	78.27
BPL	-	77.87	-
MaPLe	80.82	78.70	79.75
This paper	80.50	78.50	79.48

(j) DTD.

	Base	New	H
CLIP	53.24	59.90	56.37
CoOp	79.44	41.18	54.24
CoCoOp	77.01	56.00	64.85
BPL	-	60.80	-
MaPLe	80.36	59.18	68.16
This paper	79.63	61.98	69.71

(k) EuroSAT.

	Base	New	H
CLIP	56.48	64.05	60.03
CoOp	92.19	54.74	68.69
CoCoOp	87.49	60.04	71.21
BPL	-	75.30	-
MaPLe	94.07	73.23	82.35
This paper	93.07	77.63	84.65

(l) UCF101.

	Base	New	H
CLIP	70.53	77.50	73.85
CoOp	84.69	56.05	67.46
CoCoOp	82.33	73.45	77.64
BPL	-	75.77	-
MaPLe	83.00	78.66	80.77
This paper	84.60	78.70	81.54

Table 3: Label shift comparison. The models are trained on the base classes with 16 shots and evaluated on both the base and new classes. We bold the best results and underline the runner-up. H denotes the Harmonic mean [75]. Our method performs well on both base and new classes, therefore achieving the best overall Harmonic mean, demonstrating the generalization ability across label shifts.

Label shift. We conduct the experiments on label shift following the base-to-new class generalization setting in Zhou et al. [86]. The results on eleven datasets and the averaged performance are provided in Table 3. Since our any-shift prompts encode both training and test information, as well as their relationships, it performs well in both base and new classes, therefore achieving the best overall Harmonic mean on the eleven datasets. Compared with the original CLIP model, the proposed method achieves better performance in the base classes, showing good adaptation to the downstream tasks with the training information. Compared with the other prompt learning methods CoOp [87], CoCoOp [86], BPL [9], and MaPLe [29], our method performs best in the new classes on seven of the eleven datasets and is competitive on the other four. This demonstrates the ability of the method to handle label shift by incorporating the distribution information and their relationships.

Concept shift. For concept shift, we conduct experiments on the introduced ImageNet-Superclass dataset, where the same images are assigned with different annotations. To do so, we evaluate the ImageNet-trained model on the validation set with the superclass annotations. As shown in Table 4, the prompt learning methods achieve similar performance compared with the original CLIP. By contrast, our method improves the performance of CLIP by about 2%, indicating the ability to handle concept shift.

Method	ImageNet-Superclass	Living-17	Entity-30
	Concept Shift	Conditional Shift
CLIP†	69.23	86.94	67.95
CoOp†	69.35	87.11	78.02
CoCoOp†	69.77	87.24	79.52
This paper	71.12 $\pm$ 0.6	88.41 $\pm$ 0.3	81.74 $\pm$ 0.4

Table 4: Concept shift and conditional shift comparison. The results of the compared methods are based on the author-provided code since the prompt learning methods do not provide results on these shifts.

Conditional shift. We also conduct experiments on two datasets with conditional shift. The results are also reported in Table 4. The prompt learning methods perform similarly to CLIP while achieving more improvement on Entity-30. The reason can be that the class names of Living-17 (e.g., wolf, fox) are more detailed than Entity-30 (e.g., crustacean, carnivore, insect), revealing the importance of adapting the original CLIP model to downstream tasks in specific scenarios. Moreover, compared with the conventional prompt learning methods CoOp and CoCoOp, our method consistently improves the performance on both datasets and performs better, demonstrating the effectiveness of any-shift prompting for the conditional shift.

Method	Art	Clipart	Product	Real	Mean
CLIP†	79.32	67.70	86.93	87.46	80.35
CLIP-D†	80.47	68.83	87.93	88.80	81.51
CoOp†	80.50	69.05	88.26	89.01	81.71
CoCoOp†	80.93	69.51	88.85	89.32	82.19
This paper	83.40 $\pm$ 0.8	72.53 $\pm$ 0.5	91.24 $\pm$ 0.6	90.84 $\pm$ 0.3	84.50 $\pm$ 0.4

Table 5: Multiple shifts comparison on Open-Office-Home, including both covariate and label shifts. The results of other methods are based on the author-provided code.

Joint distribution shift. In Table 5, we report the results on Open-Office-Home for the joint distribution shifts. Following Shu et al. [66], we assign data from different parts of classes in the training domains and evaluate the model on the test domain with both seen and unseen classes. Therefore, the model encounters covariate and label shifts jointly. As shown in Table 5, the CLIP-based zero-shot methods keep the same performance as the close-set generalization setting (Table 2) since they are kept frozen. The prompt learning methods perform slightly worse than the close-set setting. Our method outperforms the others on all test domains, showing the ability to handle joint distribution shifts.

Overall, our method achieves good performance on covariate, label, concept, conditional, and even joint shifts, demonstrating the effectiveness of handling various distribution shifts by considering the distribution information and their relationship with any-shift prompting.

5.2 Ablation studies

Effectiveness of training and test prompts. To investigate the benefits of the training and test prompts of any-shift prompting, we evaluate our method with training and test prompts separately. The experiments are conducted on Open-office-Home with joint distribution shift. We compare the prompts with the original CLIP model as well as CoOp and CoCoOp in Figure 4, and provide the accuracy on all classes, seen classes, and unseen classes, respectively. CoOp and CoCoOp show better performance on seen classes across covariate shift but struggle in the unseen classes where both covariate shift and label shift exist. The training prompt in our method encounters the same problem since it encodes the training information with seen classes but also tends to overfit the training distribution. The performance is slightly better since it considers uncertainty in the prompt. By contrast, the test prompt in our method encodes the test information with the relationships between the training and test distribution. This enables the method to achieve good generalization across different shifts, leading to higher performance on both seen (covariate shift) and unseen classes (both covariate shift and label shift).

Training prompt $\mathbf{v}_{s}$	Test text feature of $\mathcal{Y}_{t}$	Test image feature of $\mathbf{x}_{t}$	Accuracy
✓			82.62
	✓		82.67
		✓	83.11
	✓	✓	83.63
✓	✓	✓	84.50

Table 6: Benefits of training and test information in any-shift prompt. The experiments are conducted across the joint shifts on Open-Office-Home. Both training and test information in the prompt benefit the method across joint shifts.

Visualization of generalization effect. To further show the benefits of generalization with our method, we visualize the image and text features before and after generalization by any-shift prompting. The experiments are conducted on the ``Art" domain under Open-Office-Home. The image and text features before generalization are generated by the fixed CLIP image and language encoders respectively. As shown in Figure 5, after generalization by any-shift prompting, the image features get closer to the text features of the corresponding ground truth labels, which leads to more accurate predictions.

Benefits of training and test information in any-shift prompt. To show the benefits of considering different information in the test prompt, we conduct experiments on Open-Office-Home, which contains both covariate and label shifts. As shown in Table 6, using only the training prompt achieves better performance than CLIP (80.35) and we get similar results with only test text features or test image features. The information from the test images gains more improvement. The reason can be that test images include more unseen information in this setting. The test prompt generated by both image and text information further improves the generalization of test distributions, indicating the importance of considering test information for generalization. Moreover, including the training prompt provides the relationships and shift information between training and test distribution in the prompt, leading to the best performance.

6 Conclusion

We propose any-shift prompting to adapt the large image-language model (CLIP) to downstream tasks while enhancing the generalization ability across different distribution shifts at test time. The proposed method bridges the training and test distributions under a hierarchical probabilistic framework, which generates the specific prompt for each test sample by encoding the distribution information and relationships of the training and test distributions. Once trained, we generate the test-specific prompt across any distribution shift in a single feedforward pass without any fine-tuning or backpropagation. The test prompt generalizes both the image and language encoders of CLIP to the specific test distribution. Experiments on various distribution shifts, including covariate shift, label shift, conditional shift, concept shift, and joint shift, demonstrate the effectiveness of the proposed method on the generalization of any test distribution.

Acknowledgment

This work is financially supported by the Inception Institute of Artificial Intelligence, the University of Amsterdam, and the allowance Top consortia for Knowledge and Innovation (TKIs) from the Netherlands Ministry of Economic Affairs and Climate Policy.

References

Arjovsky et al. [2019] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
Azizzadenesheli et al. [2019] Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized learning for domain adaptation under label shifts. arXiv preprint arXiv:1903.09734, 2019.
Bahng et al. [2022] Hyo** Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274, 2022.
Balaji et al. [2018] Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. MetaReg: Towards domain generalization using meta-regularization. In Advances in Neural Information Processing Systems, pages 998–1008, 2018.
Bossard et al. [2014] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In European Conference on Computer Vision, pages 446–461. Springer, 2014.
Choi et al. [2010] Myung ** Choi, Joseph J Lim, Antonio Torralba, and Alan S Willsky. Exploiting hierarchical context on a large database of object categories. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 129–136. IEEE, 2010.
Cimpoi et al. [2014] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
Derakhshani et al. [2023] Mohammad Mahdi Derakhshani, Enrique Sanchez, Adrian Bulat, Victor G Turrisi da Costa, Cees GM Snoek, Georgios Tzimiropoulos, and Brais Martinez. Bayesian prompt learning for image-language model generalization. In IEEE International Conference on Computer Vision, pages 15237–15246, 2023.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
Dou et al. [2019] Qi Dou, Daniel C Castro, Konstantinos Kamnitsas, and Ben Glocker. Domain generalization via model-agnostic learning of semantic features. In Advances in Neural Information Processing Systems, 2019.
Dubey et al. [2021] Abhimanyu Dubey, Vignesh Ramanathan, Alex Pentland, and Dhruv Mahajan. Adaptive methods for real-world domain generalization. In IEEE Conference on Computer Vision and Pattern Recognition, pages 14340–14349, 2021.
Everingham et al. [2010] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
Fang et al. [2013] Yuming Fang, Weisi Lin, Zhenzhong Chen, Chia-Ming Tsai, and Chia-Wen Lin. A video saliency detection model in compressed domain. IEEE Transactions on Circuits and Systems for Video Technology, 24(1):27–38, 2013.
Fei-Fei et al. [2004] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In IEEE Conference on Computer Vision and Pattern Recognition Workshop, pages 178–178. IEEE, 2004.
Gao et al. [2023] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, pages 1–15, 2023.
Garg et al. [2023] Saurabh Garg, Nick Erickson, James Sharpnack, Alex Smola, Sivaraman Balakrishnan, and Zachary Chase Lipton. Rlsbench: Domain adaptation under relaxed label shift. In International Conference on Machine Learning, pages 10879–10928. PMLR, 2023.
Gong et al. [2016] Mingming Gong, Kun Zhang, Tongliang Liu, Dacheng Tao, Clark Glymour, and Bernhard Schölkopf. Domain adaptation with conditional transferable components. In International Conference on Machine Learning, pages 2839–2848. PMLR, 2016.
Goyal et al. [2022] Sachin Goyal, Mingjie Sun, Aditi Raghunathan, and Zico Kolter. Test-time adaptation via conjugate pseudo-labels. In Advances in Neural Information Processing Systems, 2022.
Griffin et al. [2007] Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256 object category dataset. 2007.
Gulrajani and Lopez-Paz [2020] Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. In International Conference on Learning Representations, 2020.
Guo et al. [2020] Jiaxian Guo, Mingming Gong, Tongliang Liu, Kun Zhang, and Dacheng Tao. Ltf: A label transformation framework for correcting label shift. In International Conference on Machine Learning, pages 3843–3853. PMLR, 2020.
Helber et al. [2019] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
Hendrycks and Dietterich [2019] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
Hendrycks et al. [2021a] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In IEEE International Conference on Computer Vision, pages 8340–8349, 2021a.
Hendrycks et al. [2021b] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In IEEE Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021b.
Iwasawa et al. [2021] Yusuke Iwasawa et al. Test-time classifier adjustment module for model-agnostic domain generalization. In Advances in Neural Information Processing Systems, 2021.
Jia et al. [2021] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
Khattak et al. [2023] Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In IEEE Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023.
Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Krause et al. [2013] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013.
Lee et al. [2023] Yoonho Lee, Annie S Chen, Fahim Tajwar, Ananya Kumar, Huaxiu Yao, Percy Liang, and Chelsea Finn. Surgical fine-tuning improves adaptation to distribution shifts. In International Conference on Learning Representations, 2023.
Lester et al. [2021] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
Li et al. [2017] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In IEEE International Conference on Computer Vision, pages 5542–5550, 2017.
Li et al. [2018a] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy Hospedales. Learning to generalize: Meta-learning for domain generalization. In AAAI Conference on Artificial Intelligence, 2018a.
Li et al. [2018b] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5400–5409, 2018b.
Li and Liang [2021] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
Liang et al. [2020] Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning, pages 6028–6039. PMLR, 2020.
Lim et al. [2023] Hyesu Lim, Byeonggeun Kim, Jaegul Choo, and Sungha Choi. Ttn: A domain-shift aware batch normalization in test-time adaptation. In International Conference on Learning Representations, 2023.
Liu et al. [2023] Pengfei Liu, Weizhe Yuan, **lan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
Liu et al. [2021a] Xiaofeng Liu, Zhenhua Guo, Site Li, Fangxu Xing, Jane You, C-C Jay Kuo, Georges El Fakhri, and Jonghye Woo. Adversarial unsupervised domain adaptation with conditional and label shift: Infer, align and iterate. In IEEE International Conference on Computer Vision, pages 10367–10376, 2021a.
Liu et al. [2022] Xiaofeng Liu, Chaehwa Yoo, Fangxu Xing, Hye** Oh, Georges El Fakhri, Je-Won Kang, Jonghye Woo, et al. Deep unsupervised domain adaptation: A review of recent advances and perspectives. APSIPA Transactions on Signal and Information Processing, 11(1), 2022.
Liu et al. [2021b] Yuejiang Liu, Parth Kothari, Bastien van Delft, Baptiste Bellot-Gurlet, Taylor Mordan, and Alexandre Alahi. Ttt++: When does self-supervised test-time training fail or thrive? In Advances in Neural Information Processing Systems, 2021b.
Long et al. [2015] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pages 97–105. PMLR, 2015.
Maji et al. [2013] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
Motiian et al. [2017] Saeid Motiian, Marco Piccirilli, Donald A Adjeroh, and Gianfranco Doretto. Unified deep supervised domain adaptation and generalization. In IEEE International Conference on Computer Vision, pages 5715–5725, 2017.
Muandet et al. [2013] Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pages 10–18. PMLR, 2013.
Nilsback and Zisserman [2008] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
Niu et al. [2022] Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. Efficient test-time model adaptation without forgetting. In International Conference on Machine Learning, pages 16888–16905. PMLR, 2022.
Niu et al. [2023] Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world. In International Conference on Learning Representations, 2023.
Novack et al. [2023] Zachary Novack, Julian McAuley, Zachary Chase Lipton, and Saurabh Garg. Chils: Zero-shot image classification with hierarchical label sets. In International Conference on Machine Learning, pages 26342–26362. PMLR, 2023.
Park et al. [2023] Sunghyun Park, Seunghan Yang, Jaegul Choo, and Sungrack Yun. Label shift adapter for test-time adaptation under covariate and label shifts. In IEEE International Conference on Computer Vision, pages 16421–16431, 2023.
Parkhi et al. [2012] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3498–3505. IEEE, 2012.
Peng et al. [2019] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In IEEE International Conference on Computer Vision, pages 1406–1415, 2019.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
Recht et al. [2019] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389–5400. PMLR, 2019.
Roberts et al. [2022] Manley Roberts, Pranav Mani, Saurabh Garg, and Zachary Lipton. Unsupervised learning under latent label shift. In Advances in Neural Information Processing Systems, pages 18763–18778, 2022.
Roth et al. [2023] Karsten Roth, Jae Myung Kim, A Koepke, Oriol Vinyals, Cordelia Schmid, and Zeynep Akata. Waffling around for performance: Visual classification with random words and broad concepts. arXiv preprint arXiv:2306.07282, 2023.
Russell et al. [2008] Bryan C Russell, Antonio Torralba, Kevin P Murphy, and William T Freeman. Labelme: a database and web-based tool for image annotation. International Journal of Computer Vision, 77(1-3):157–173, 2008.
Samadh et al. [2023] Jameel Hassan Abdul Samadh, Hanan Gani, Noor Hazim Hussein, Muhammad Uzair Khattak, Muzammal Naseer, Fahad Khan, and Salman Khan. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. In Advances in Neural Information Processing Systems, 2023.
Santurkar et al. [2020] Shibani Santurkar, Dimitris Tsipras, and Aleksander Madry. Breeds: Benchmarks for subpopulation shift. arXiv preprint arXiv:2008.04859, 2020.
Schneider et al. [2020] Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. Improving robustness against common corruptions by covariate shift adaptation. In Advances in Neural Information Processing Systems, pages 11539–11551, 2020.
Shen et al. [2022] Jiayi Shen, Zehao Xiao, Xiantong Zhen, Cees Snoek, and Marcel Worring. Association graph learning for multi-task classification with category shifts. In Advances in Neural Information Processing Systems, pages 4503–4516, 2022.
Shu et al. [2022] Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test-time prompt tuning for zero-shot generalization in vision-language models. In Advances in Neural Information Processing Systems, pages 14274–14289, 2022.
Shu et al. [2021] Yang Shu, Zhangjie Cao, Chenyu Wang, Jianmin Wang, and Mingsheng Long. Open domain generalization with domain-augmented meta-learning. In IEEE Conference on Computer Vision and Pattern Recognition, pages 9624–9633, 2021.
Soomro et al. [2012] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
Sun et al. [2020] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning, pages 9229–9248. PMLR, 2020.
Tachet des Combes et al. [2020] Remi Tachet des Combes, Han Zhao, Yu-Xiang Wang, and Geoffrey J Gordon. Domain adaptation with conditional distribution matching and generalized label shift. In Advances in Neural Information Processing Systems, pages 19276–19289, 2020.
Venkateswara et al. [2017] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027, 2017.
Wang et al. [2021] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In International Conference on Learning Representations, 2021.
Wang et al. [2019] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, 2019.
Wang and Deng [2018] Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018.
Wu et al. [2021] Ruihan Wu, Chuan Guo, Yi Su, and Kilian Q Weinberger. Online adaptation to label distribution shift. In Advances in Neural Information Processing Systems, pages 11340–11351, 2021.
Xian et al. [2018] Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9):2251–2265, 2018.
Xiao et al. [2010] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492. IEEE, 2010.
Xiao et al. [2021] Zehao Xiao, Jiayi Shen, Xiantong Zhen, Ling Shao, and Cees G M Snoek. A bit more bayesian: Domain-invariant learning with uncertainty. In International Conference on Machine Learning. PMLR, 2021.
Xiao et al. [2022] Zehao Xiao, Xiantong Zhen, Ling Shao, and Cees G M Snoek. Learning to generalize across domains on single test samples. In International Conference on Learning Representations, 2022.
Yao et al. [2023] Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6757–6767, 2023.
Yi et al. [2023] Li Yi, Gezheng Xu, Pengcheng Xu, Jiaqi Li, Ruizhi Pu, Charles Ling, A Ian McLeod, and Boyu Wang. When source-free domain adaptation meets learning with noisy labels. In International Conference on Learning Representations, 2023.
Zhang et al. [2013] Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pages 819–827. PMLR, 2013.
Zhang et al. [2022] Marvin Zhang, Sergey Levine, and Chelsea Finn. Memo: Test time robustness via adaptation and augmentation. In Advances in Neural Information Processing Systems, pages 38629–38642, 2022.
Zhang et al. [2021] Xin Zhang, Shixiang Shane Gu, Yutaka Matsuo, and Yusuke Iwasawa. Domain prompt learning for efficiently adapting clip to unseen domains. arXiv e-prints, pages arXiv–2111, 2021.
Zhang et al. [2023] Yifan Zhang, Xue Wang, Kexin **, Kun Yuan, Zhang Zhang, Liang Wang, Rong **, and Tieniu Tan. Adanpc: Exploring non-parametric classifier for test-time adaptation. In International Conference on Machine Learning, pages 41647–41676. PMLR, 2023.
Zhou et al. [2022a] Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022a.
Zhou et al. [2022b] Kaiyang Zhou, **gkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In IEEE Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022b.
Zhou et al. [2022c] Kaiyang Zhou, **gkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022c.
Zhu et al. [2023] Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. In IEEE International Conference on Computer Vision, pages 15659–15669, 2023.

Appendix A Derivations of any-shift prompting

In the main paper, we provide the modeling of our any-shift prompting. Here we provide further derivations of the optimizations of the prior and posterior distributions.

To model the information of training and test distributions and their relationships, we propose any-shift prompting within a hierarchical framework. We introduce training and test prompts as latent variables in the hierarchical probabilistic architecture, the prediction function of the CLIP model is then formulated as:

		$\displaystyle p_{\Phi,\theta}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s})$		(15)
		$\displaystyle=\int\int p(\mathbf{y}_{t},\mathbf{v}_{t},\mathbf{v}_{s}\|\mathbf{% x}_{t},\mathcal{Y}_{t},\mathbf{x}_{s},\mathbf{y}_{s},\mathcal{Y}_{s})d\mathbf{% v}_{t}d\mathbf{v}_{s}$
		$\displaystyle=\int\int p(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathbf{v}_{t},\mathcal% {Y}_{t})p(\mathbf{v}_{t},\mathbf{v}_{s}\|\mathbf{x}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s})d\mathbf{v}_{t}d\mathbf{v}_{s}$
		$\displaystyle=\int\int p_{\Phi}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathbf{v}_{t},% \mathcal{Y}_{t})p_{\bm{\theta}}(\mathbf{v}_{t}\|\mathbf{v}_{s},\mathbf{x}_{t},% \mathcal{Y}_{t})p(\mathbf{v}_{s}\|\mathcal{D}_{s})d\mathbf{v}_{t}d\mathbf{v}_{s},$

where the prior distribution of the training and test prompts is factorized as

\displaystyle p(\mathbf{v}_{t},\mathbf{v}_{s}|\mathbf{x}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s}){=}p_{\bm{\theta}}(\mathbf{v}_{t}|\mathbf{v}_{s},\mathbf{x}_{t% },\mathcal{Y}_{t})p(\mathbf{v}_{s}|\mathcal{D}_{s}).

(16)

$p(\mathbf{v}_{s}|\mathcal{D}_{s})$ is learned from the training data $\mathcal{D}_{s}$ sampled from training distribution $p(\mathbf{x}_{s},\mathbf{y}_{s})$ . $p_{\bm{\theta}}(\mathbf{v}_{t}|\mathbf{v}_{s},\mathbf{x}_{t},\mathcal{Y}_{t})$ denotes the test prompt, which aggregates both training information from $\mathbf{v}_{s}$ and test information from the test image $\mathbf{x}_{t}$ and class names $\mathcal{Y}_{t}$ . The test prompt exploits the relationships between training and test distributions by the transformer inference network $\bm{\theta}$ . $\mathbf{v}_{t}$ is then utilized into the frozen image and text encoders $\Phi=\{\Phi_{I},\Phi_{T}\}$ to generalize the CLIP model to the test data.

To optimize the model for generating the probabilistic training and test prompts, we further introduce variational inference to approximate the true posterior $p(\mathbf{v}_{t},\mathbf{v}_{s}|\mathcal{D}_{t},\mathcal{Y}_{t},\mathcal{D}_{s})$ into eq. (15), which is factorized as:

q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s}|\mathcal{D}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s}){=}q_{\bm{\theta}}(\mathbf{v}_{t}|\mathbf{v}_{s},\mathcal{D}_{% t},\mathcal{Y}_{t})p(\mathbf{v}_{s}|\mathcal{D}_{s}),

(17)

where $\mathcal{D}_{t}$ consists of test input-output pairs sampled from the test distribution $p(\mathbf{x}_{t},\mathbf{y}_{t})$ . The variational posterior shares the same inference model $\bm{\theta}$ with the prior distribution. By integrating eq. (17) into eq. (15), the evidence lower bound (ELBO) of the log-likelihood $\log p_{\Phi,\bm{\theta}}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s})$ is derived as:

		$\displaystyle\log p_{\Phi,\bm{\theta}}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathcal{% Y}_{t},\mathcal{D}_{s})$		(18)
		$\displaystyle=\log\int\int p(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathbf{v}_{t},% \mathcal{Y}_{t})p(\mathbf{v}_{t},\mathbf{v}_{s}\|\mathbf{x}_{t},\mathcal{Y}_{t}% ,\mathcal{D}_{s})d\mathbf{v}_{t}d\mathbf{v}_{s}$
		$\displaystyle=\log\int\int p(\mathbf{y}_{t^{\prime}}\|\mathbf{x}_{t},\mathbf{v}% _{t},\mathcal{Y}_{t})q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s}\|\mathcal{D}% _{t},\mathcal{Y}_{t},\mathcal{D}_{s})$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \frac{p(\mathbf{v}_{t},\mathbf{v}_{s}\|\mathbf{x}_{t},% \mathcal{Y}_{t},\mathcal{D}_{s})}{q(\mathbf{v}_{t},\mathbf{v}_{s}\|\mathcal{D}_% {t},\mathcal{Y}_{t},\mathcal{D}_{s})}d\mathbf{v}_{t}d\mathbf{v}_{s}$
		$\displaystyle=\log\int\int p(\mathbf{y}_{t^{\prime}}\|\mathbf{x}_{t},\mathbf{v}% _{t},\mathcal{Y}_{t})q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s}\|\mathcal{D}% _{t},\mathcal{Y}_{t},\mathcal{D}_{s})$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \frac{p_{\bm{\theta}}(\mathbf{v}_{t}\|\mathbf{v}_{s},% \mathbf{x}_{t},\mathcal{Y}_{t})p(\mathbf{v}_{s}\|\mathcal{D}_{s})}{q_{\bm{% \theta}}(\mathbf{v}_{t}\|\mathbf{v}_{s},\mathcal{D}_{t},\mathcal{Y}_{t})p(% \mathbf{v}_{s}\|\mathcal{D}_{s})}d\mathbf{v}_{t}d\mathbf{v}_{s}$
		$\displaystyle=\log\int\int p(\mathbf{y}_{t^{\prime}}\|\mathbf{x}_{t},\mathbf{v}% _{t},\mathcal{Y}_{t})q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s}\|\mathcal{D}% _{t},\mathcal{Y}_{t},\mathcal{D}_{s})$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \frac{p_{\bm{\theta}}(\mathbf{v}_{t}\|\mathbf{v}_{s},% \mathbf{x}_{t},\mathcal{Y}_{t})}{q_{\bm{\theta}}(\mathbf{v}_{t}\|\mathbf{v}_{s}% ,\mathcal{D}_{t},\mathcal{Y}_{t})}d\mathbf{v}_{t}d\mathbf{v}_{s}$
		$\displaystyle\geq\mathbb{E}_{q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s})}% \big{[}\log p_{\Phi}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathbf{v}_{t},\mathcal{Y}_% {t})\big{]}$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ -\mathbb{D}_{\mathrm{KL}}\big{[}q_{\bm{\theta}}(\mathbf{v% }_{t}\|\mathbf{v}_{s},\mathcal{D}_{t},\mathcal{Y}_{t})\|\|p_{\bm{\theta}}(\mathbf% {v}_{t}\|\mathbf{v}_{s},\mathbf{x}_{t},\mathcal{Y}_{t})\big{]},$

where the expectation of the log-likelihood is calculated on the variational posterior distribution $q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s}|\mathcal{D}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s})$ .

Our goal is to maximize the log-likelihood of the test data $\log p_{\Phi,\bm{\theta}}(\mathbf{y}_{t}|\mathbf{x}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s})$ , i.e., maximize the ELBO in eq. (18), which is equivalent to minimize the negative log-likelihood. Therefore, minimizing the loss function to optimize our any-shift prompting becomes minimizing:

		$\displaystyle-\log p_{\Phi,\bm{\theta}}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathcal% {Y}_{t},\mathcal{D}_{s})$		(19)
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leq\mathbb{E}_{q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}% _{s})}\big{[}-\log p_{\Phi}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathbf{v}_{t},% \mathcal{Y}_{t})\big{]}$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ +\mathbb{D}_{\mathrm{KL}}\big{[}q_{\bm{\theta}}(\mathbf{v% }_{t}\|\mathbf{v}_{s},\mathcal{D}_{t},\mathcal{Y}_{t})\|\|p_{\bm{\theta}}(\mathbf% {v}_{t}\|\mathbf{v}_{s},\mathbf{x}_{t},\mathcal{Y}_{t})\big{]}.$

Appendix B Details of setting and implementations

B.1 Details of datasets and settings

Covariate shift. We conduct the experiments on covariate shifts in two settings, multiple training distributions and single training distributions. The experiments on multiple training distributions are conducted on domain generalization datasets PACS, VLCS, Office-Home, and DomainNet, which contain multiple domains of images with the same label space. PACS [34] includes images of 7 classes from four different domains, photo, art-painting, cartoon, and sketch. VLCS [14] consists of images of 5 classes and four different datasets, Pascal-VOC2007 [13], LabelMe [60], Caltech101 [20], and SUN [6]. Office-Home also contains four domains, art, clipart, product, and real-world, while the images are from 65 categories, which is much more than PACS and VLCS. DomainNet is even larger, which consists of images from six domains and 345 categories. The domains are clipart, inforgraph, painting, quickdraw, real, and sketch. We follow the ``leave-one-out protocol'' [34] on these datasets, where we select one domain as the test distribution, and the other domains are treated as the training distributions. The model is trained on the training distributions and evaluated on the test one. We treat each domain at the test distribution individually for evaluation and report the averaged results on all test distributions in Table 2 in the main paper. The detailed results of each test distribution are reported in the following section.

The experiments on single training distribution follow the domain generalization in Zhou et al. [86], where the model is trained on ImageNet (1,000 categories) and evaluated on the other four variants ImageNet-V2 [57], ImageNet-(S)ketch [72], ImageNet-A [26], and ImageNet-R [25] with the same label space. Most of the above datasets have shifts in the images, i.e., marginal input distributions $p(\mathbf{x})$ . Therefore, we use these datasets for the evaluation of our method across covariate shift.

Label shift. We conduct the experiments on label shift following the base-to-new classification setting in Zhou et al. [87]. In this case, the distribution shifts occur in the marginal output distribution $p(\mathbf{y})$ , where the ``new'' classes have $p(\mathbf{y}_{c}){=}0$ during training. We use eleven benchmarks with label shift. The benchmarks includes general classification datasets ImageNet [8] and Caltech101 [15]; fine-grained classification datasets OxfordPets [53], StanfordCars [31], Flowers102 [48], Food101 [5], and FGVCAircraft [45]; scene recognition dataset SUN397 [76]; action recognition dataset UCF101 [67]; texture classification dataset DTD [7]; and satellite image recognition EuroSAT [23]. We follow the same base-new classes split and evaluation set in Zhou et al. [86].

Concept shift. We approximate the concept shift by relabeling the ImageNet dataset with the superclasses in [62]. The model is trained on the original classes and evaluated on the superclasses. In this case, the marginal input distribution $p(\mathbf{x})$ is the same while the conditional distributions $p(\mathbf{y}|\mathbf{x})$ are different between training and test data.

Conditional shift. For conditional shift, we evaluate the proposed method on two subpopulation datasets, Living-17 and Entity-30 [62], which contain images of 17 animal categories and images of 30 entities, respectively. We follow the training and test split in [17], where the training and test distributions have the same overall classes but contain different subpopulations of those classes. In this case, the marginal output distributions $p(\mathbf{y})$ of training and test data are the same, while the input distributions are changed according to different categories, i.e., $p(\mathbf{x}|\mathbf{y})$ are different. Therefore, we treat the setting as conditional shift.

Joint shift. To evaluate the proposed method on joint shift, we conduct experiments on Office-Home under the open domain generalization setting [66], which we refer to as Open-Office-Home. We split the label space of the 65 classes and make various label spaces across different domains. The split of classes is shown in Table 7. Therefore, there are both covariate shift and label shift between the training and test distributions, which we treat as the joint shift on $p(\mathbf{x},\mathbf{y})$ .

Domains	Classes
Source 1	0 - 2, 3 - 8, 9 - 14, 21 - 31
Source 2	0 - 2, 3 - 8, 15 - 20, 32 - 42
Source 3	0 - 2, 9 - 14, 15 - 20, 43 - 53
Target	0 - 64

Table 7: Classes split for joint distribution shifts on Open-Office-Home. We use the numbers to denote the class names. The setting contains both covariate and label shifts, leading to joint shifts on

p(\mathbf{x},\mathbf{y})

B.2 Implementations and hyperparameters

For all experiments, we train and evaluate the model on a single NVIDIA V100 GPU. We use the same backbone and transformer inference network for all datasets. The backbone is the frozen CLIP model with ViT-B/16 as the image encoder. The transformer inference network consists of a 2-layer transformer and 2 MLP layers to generate the distribution of the test prompt. There are also two trainable vectors as the mean and variance of the probabilistic training prompt and trainable position embeddings for image and text features respectively. The sampled test prompt is then fed into both the image and text encoders to generalize the features and classifiers. We provide an illustration in Figure 6. Note that the test prompt is utilized as tokens of the image and text encoders. To make it the same size as the inputs, we use two linear layers to project the test prompt to the image path and text embedding space, respectively.

	ImageNet	Caltech101	OxfordPets	StanfordCars	Flowers102	Food101	FGVC	SUN397	DTD	EuroSAT	UCF101
Learning rate	$2e-3$
Optimizer	SGD
Batch Size	1	4	8	6	4	4	4	2	8	10	4
Epochs	10	30	30	30	30	30	30	30	30	30	30

Table 8: Dataset-specific hyper-parameters for label shift datasets and ImageNet-based datasets. The ImageNet-based covariate shift, label shift, and concept shift datasets use the same hyperparameters.

	PACS	VLCS	Office-Home	Open-Office-Home	DomainNet	Living-17	Entity-30
Learning rate	$5e-4$
Optimizer	Adam
Training iterations	3,000 iterations				10,000 iterations	30 epochs
Batch Size	32	32	8	8	2	32	16

Table 9: Dataset-specific batch sizes for common domain generalization datasets and conditional shift datasets.

		Accuracy
Method	Iterations	Art	Clipart	Product	Real	Mean
CLIP baseline	-	79.32	67.70	86.93	87.46	80.35
Transformer adapter	20,000	78.76	64.62	87.98	84.83	79.05
Any-shift prompt	3,000	83.40	72.53	91.24	90.84	84.50

Table 10: Benefits of generalization with any-shift prompting. Directly training a transformer as an adapter of the image and textual features still easy to lead to overfitting. By aggregating the training, test, and relationship information into the prompt, any-shift prompting achieves better generalization.

Inference network	Art	Clipart	Product	Real	Mean
CLIP baseline	79.32	67.70	86.93	87.46	80.35
Averaging	82.27	70.91	89.95	89.66	83.20
MLP	82.48	71.09	90.18	89.73	83.37
Transformer	83.40	72.53	91.24	90.84	84.50

Table 11: Ablations on the aggregation methods. The transformer inference network performs best since it better encodes the relationships between different information.

	Source	Target
	ImageNet	Caltech101	OxfordPets	StanfordCars	Flowers102	Food101	FGVCAircraft	SUN397	DTD	EuroSAT	UCF101	Average
CoOp [87]	71.51	93.70	89.14	64.51	68.71	85.30	18.47	64.15	41.92	46.39	66.55	63.88
CoCoOp [86]	71.02	94.43	90.14	65.32	71.88	86.06	22.94	67.36	45.73	45.37	68.21	65.74
TPT [65]	68.98	68.98	47.75	87.79	66.87	68.04	94.16	84.67	65.50	24.78	42.44	65.10
BPL [9]	70.70	93.67	90.63	65.00	70.90	86.30	24.93	67.47	46.10	45.87	68.67	65.95
MaPLe [29]	70.72	93.53	90.49	65.57	72.23	86.20	24.74	67.01	46.49	48.06	68.69	66.30
This paper	71.05	94.57	90.79	66.90	72.30	86.17	25.16	67.32	47.35	50.25	69.52	67.03

Table 12: Comparison of prompt learning methods in the cross-dataset transfer setting. Our method achieves the best overall performance on 10 test datasets.

Except for the architecture and settings shared by all datasets, we also provide the specific hyperparameters for different datasets. Batch size is a hyperparameter that varies per dataset (Tables 8 and 9). For the experiments of label shift (eleven datasets) and the others based on ImageNet (ImageNet-based covariate shift and concept shift), we use the same learning rate $2e-3$ as Zhou et al. [86] with SGD. The dataset-specific batch size and epochs are provided in Table 8. For the covariate shift datasets PACS, VLCS, Office-Home, DomainNet and joint shift dataset Open-Office-Home, we train the model with $5e-4$ learning rate and 3000 iterations by Adam optimizer. For the conditional shift dataset conditional shift datasets Living-17 and Entity-30, we use the same learning rate $5e-4$ and Adam optimizers for 30 epochs. The details are shown in Table 9.

Appendix C More ablations and comparisons

Benefits of generalization with prompts

In our any-shift prompting, we generate the test prompt by aggregating the training information and the test information by a transformer inference network. The test information is from the image and textual features of the CLIP model. In addition to generating the prompt for the CLIP model, another way to achieve generalization is directly adapting the image and textual features by the transformer network and making predictions by the image and textual features. To show the benefits of generalization with our any-shift prompting, we conduct an experiment that adapts the image and textual features using the same transformer inference network, which we refer to as ``Transformer adapter''. The experimental results on Open-Office-Home are reported in Table 10. The transformer adapter performs even worse than the CLIP baseline since it is still easy to overfit the training distribution. Moreover, the transformer adapter requires much more training costs (20,000 iterations) than any-shift prompting (3,000 iterations). The results demonstrate both the effectiveness and efficiency of our any-shift prompting for generalization across distribution shifts.

Method	Photo	Art	Cartoon	Sketch	Mean
CLIP	99.94	97.41	98.98	88.19	96.13
CLIP-D	99.94	97.61	99.02	90.03	96.65
CoOp	99.70	97.56	98.59	89.95	96.45
CoCoOp	99.94	98.09	99.19	90.77	97.00
TPT	99.82	97.68	98.92	92.58	97.25
This paper	99.94	98.86	99.32	94.53	98.16 $\pm$ 0.4

Table 13: Detailed comparisons on PACS with covarate shift.

Method	VOC	LabelMe	Caltech	SUN	Mean
CLIP	84.32	68.26	98.61	74.52	81.43
CLIP-D	82.60	68.76	98.76	72.68	80.70
CoOp	85.86	68.51	98.94	76.72	82.51
CoCoOp	86.03	70.45	99.12	77.96	83.39
TPT	86.20	71.05	99.46	80.60	84.33
This paper	88.14	72.65	100.00	85.37	86.54 $\pm$ 0.4

Table 14: Detailed comparisons on VLCS with covarate shift.

Method	Art	Clipart	Product	Real	Mean
CLIP	79.32	67.70	86.93	87.46	80.35
CLIP-D	80.47	68.83	87.93	88.80	81.51
CoOp	80.99	69.52	88.69	89.28	82.12
CoCoOp	81.78	70.09	89.32	89.89	82.77
TPT	82.45	71.18	90.03	90.15	83.45
This paper	83.70	73.00	92.50	91.44	85.16 $\pm$ 0.6

Table 15: Detailed comparisons on Office-Home.

Method	Clipart	Painting	Real	Infograph	Quickdraw	Sketch	Mean
CLIP	68.12	56.18	78.82	46.36	14.32	60.69	54.08
CLIP-D	70.83	58.02	80.52	48.85	16.39	62.84	56.24
CoOp	74.39	61.18	83.26	51.88	16.67	65.52	58.82
CoCoOp	74.82	61.56	83.98	52.68	17.47	66.10	59.43
TPT	75.09	62.77	84.67	52.65	17.28	66.98	59.90
This paper	76.08	66.62	85.03	52.56	18.05	67.26	60.93 $\pm$ 0.4

Table 16: Detailed comparisons on DomainNet.

Benefits of the transformer inference network We also conduct experiments on Open-Office-Home with different methods for aggregating the training and test information. We generate the test prompt by directly averaging the training prompts, the test image feature, and textual features. In addition, we also use an MLP network to replace the transformer network to generate the test prompt from the averaged features. As shown in Table 11, the transformer inference network achieves the best performance, demonstrating the effectiveness of considering the relationships between different information for aggregation.

Comparison on cross-dataset shift. Following Zhou et al. [86], we conduct experiments on the cross-dataset setting, where the model trained on ImageNet is evaluated on the other 10 datasets shown in Table 12. In this case, there are different distribution shifts for different test datasets. Compared with the other prompt learning methods, e.g., CoOp [87], CoCoOp [86], BPL [9], MaPLe [29], and test-time tuning method TPT [65], our method shows improvement on 8 of the 10 datasets, as well as the averaged result.

Detailed results on covariate shift We also report the detailed comparisons of each test distribution on the four covariate shift datasets. The results of PACS, VLCS, Office-Home, and DomainNet are provided in Table 13, 14, 15, and 16, respectively. Our method achieves the best performance on most of the test distributions.

Inference efficiency. Since our method only uses a single feedforward pass for generating the test prompts and making predictions, the inference time cost per iteration on a single V100 GPU (0.13s) is slightly higher than other prompt tuning methods like CoOp (0.10s) and CoCoOp (0.11s), and faster than TPT (0.25s), which has 1-step optimization at test time.

	$\displaystyle p_{\Phi}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathcal{Y}_{t},\mathcal{% D}_{s})=\frac{1}{N_{t}}\frac{1}{N_{s}}\sum_{i=1}^{N_{t}}$	$\displaystyle\sum_{j=1}^{N_{s}}p_{\Phi}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathbf{% v}_{t}^{(i)},\mathcal{Y}_{t}),$		(14)
	$\displaystyle\mathbf{v}_{t}^{(i)}\sim p_{\bm{\theta}}(\mathbf{v}_{t}\|\mathbf{v% }^{(j)}_{s},\mathbf{x}_{t},\mathcal{Y}_{t}),\leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\$	$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \mathbf{v}_{s}^{(j)}\sim p(\mathbf{v}_{s}\|\mathcal{D}_{s}).$		(14)

		$\displaystyle p_{\Phi,\theta}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s})$		(15)
		$\displaystyle=\int\int p(\mathbf{y}_{t},\mathbf{v}_{t},\mathbf{v}_{s}\|\mathbf{% x}_{t},\mathcal{Y}_{t},\mathbf{x}_{s},\mathbf{y}_{s},\mathcal{Y}_{s})d\mathbf{% v}_{t}d\mathbf{v}_{s}$
		$\displaystyle=\int\int p(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathbf{v}_{t},\mathcal% {Y}_{t})p(\mathbf{v}_{t},\mathbf{v}_{s}\|\mathbf{x}_{t},\mathcal{Y}_{t},% \mathcal{D}_{s})d\mathbf{v}_{t}d\mathbf{v}_{s}$
		$\displaystyle=\int\int p_{\Phi}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathbf{v}_{t},% \mathcal{Y}_{t})p_{\bm{\theta}}(\mathbf{v}_{t}\|\mathbf{v}_{s},\mathbf{x}_{t},% \mathcal{Y}_{t})p(\mathbf{v}_{s}\|\mathcal{D}_{s})d\mathbf{v}_{t}d\mathbf{v}_{s},$

		$\displaystyle\log p_{\Phi,\bm{\theta}}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathcal{% Y}_{t},\mathcal{D}_{s})$		(18)
		$\displaystyle=\log\int\int p(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathbf{v}_{t},% \mathcal{Y}_{t})p(\mathbf{v}_{t},\mathbf{v}_{s}\|\mathbf{x}_{t},\mathcal{Y}_{t}% ,\mathcal{D}_{s})d\mathbf{v}_{t}d\mathbf{v}_{s}$
		$\displaystyle=\log\int\int p(\mathbf{y}_{t^{\prime}}\|\mathbf{x}_{t},\mathbf{v}% _{t},\mathcal{Y}_{t})q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s}\|\mathcal{D}% _{t},\mathcal{Y}_{t},\mathcal{D}_{s})$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \frac{p(\mathbf{v}_{t},\mathbf{v}_{s}\|\mathbf{x}_{t},% \mathcal{Y}_{t},\mathcal{D}_{s})}{q(\mathbf{v}_{t},\mathbf{v}_{s}\|\mathcal{D}_% {t},\mathcal{Y}_{t},\mathcal{D}_{s})}d\mathbf{v}_{t}d\mathbf{v}_{s}$
		$\displaystyle=\log\int\int p(\mathbf{y}_{t^{\prime}}\|\mathbf{x}_{t},\mathbf{v}% _{t},\mathcal{Y}_{t})q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s}\|\mathcal{D}% _{t},\mathcal{Y}_{t},\mathcal{D}_{s})$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \frac{p_{\bm{\theta}}(\mathbf{v}_{t}\|\mathbf{v}_{s},% \mathbf{x}_{t},\mathcal{Y}_{t})p(\mathbf{v}_{s}\|\mathcal{D}_{s})}{q_{\bm{% \theta}}(\mathbf{v}_{t}\|\mathbf{v}_{s},\mathcal{D}_{t},\mathcal{Y}_{t})p(% \mathbf{v}_{s}\|\mathcal{D}_{s})}d\mathbf{v}_{t}d\mathbf{v}_{s}$
		$\displaystyle=\log\int\int p(\mathbf{y}_{t^{\prime}}\|\mathbf{x}_{t},\mathbf{v}% _{t},\mathcal{Y}_{t})q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s}\|\mathcal{D}% _{t},\mathcal{Y}_{t},\mathcal{D}_{s})$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \frac{p_{\bm{\theta}}(\mathbf{v}_{t}\|\mathbf{v}_{s},% \mathbf{x}_{t},\mathcal{Y}_{t})}{q_{\bm{\theta}}(\mathbf{v}_{t}\|\mathbf{v}_{s}% ,\mathcal{D}_{t},\mathcal{Y}_{t})}d\mathbf{v}_{t}d\mathbf{v}_{s}$
		$\displaystyle\geq\mathbb{E}_{q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}_{s})}% \big{[}\log p_{\Phi}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathbf{v}_{t},\mathcal{Y}_% {t})\big{]}$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ -\mathbb{D}_{\mathrm{KL}}\big{[}q_{\bm{\theta}}(\mathbf{v% }_{t}\|\mathbf{v}_{s},\mathcal{D}_{t},\mathcal{Y}_{t})\|\|p_{\bm{\theta}}(\mathbf% {v}_{t}\|\mathbf{v}_{s},\mathbf{x}_{t},\mathcal{Y}_{t})\big{]},$

		$\displaystyle-\log p_{\Phi,\bm{\theta}}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathcal% {Y}_{t},\mathcal{D}_{s})$		(19)
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leq\mathbb{E}_{q_{\bm{\theta}}(\mathbf{v}_{t},\mathbf{v}% _{s})}\big{[}-\log p_{\Phi}(\mathbf{y}_{t}\|\mathbf{x}_{t},\mathbf{v}_{t},% \mathcal{Y}_{t})\big{]}$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ +\mathbb{D}_{\mathrm{KL}}\big{[}q_{\bm{\theta}}(\mathbf{v% }_{t}\|\mathbf{v}_{s},\mathcal{D}_{t},\mathcal{Y}_{t})\|\|p_{\bm{\theta}}(\mathbf% {v}_{t}\|\mathbf{v}_{s},\mathbf{x}_{t},\mathcal{Y}_{t})\big{]}.$