SPTNet: An Efficient Alternative Framework for Generalized Category Discovery with Spatial Prompt Tuning

Hongjun Wang1     Sagar Vaze2     Kai Han1
1Visual AI Lab, The University of Hong Kong
2Visual Geometry Group, University of Oxford
[email protected]   [email protected]   [email protected]
Corresponding author.
Abstract

Generalized Category Discovery (GCD) aims to classify unlabelled images from both ‘seen’ and ‘unseen’ classes by transferring knowledge from a set of labelled ‘seen’ class images. A key theme in existing GCD approaches is adapting large-scale pre-trained models for the GCD task. An alternate perspective, however, is to adapt the data representation itself for better alignment with the pre-trained model. As such, in this paper, we introduce a two-stage adaptation approach termed SPTNet, which iteratively optimizes model parameters (i.e., model-finetuning) and data parameters (i.e., prompt learning). Furthermore, we propose a novel spatial prompt tuning method (SPT) which considers the spatial property of image data, enabling the method to better focus on object parts, which can transfer between seen and unseen classes. We thoroughly evaluate our SPTNet on standard benchmarks and demonstrate that our method outperforms existing GCD methods. Notably, we find our method achieves an average accuracy of 61.4% on the SSB, surpassing prior state-of-the-art methods by approximately 10%. The improvement is particularly remarkable as our method yields extra parameters amounting to only 0.117% of those in the backbone architecture. Project page: https://visual-ai.github.io/sptnet.

1 Introduction

Deep learning models have been extensively studied in image recognition He et al. (2016); Krizhevsky et al. (2017), typically relying on large-scale annotated data, as well as a ‘closed-world’ assumption: that the data to be classified shares the same classes as the labelled training data. However, this assumption limits application to real-world scenarios where the target data contains ‘unseen’ classes images alongside ‘seen’ ones Han et al. (2019; 2020; 2021); Fini et al. (2021); Wen et al. (2023); Jia et al. (2021); Zhao & Han (2021). Recently, Category Discovery (CD) has emerged as a practical open-world learning problem, where a model trained using partially labelled data is tasked to categorize unlabelled data that may originate from unseen categories. Initially, it was studied as Novel Category Discovery (NCD) Han et al. (2019) focusing on unlabelled data exclusively from unseen categories. Subsequently, it was extended to Generalized Category Discovery (GCD) Vaze et al. (2022) encompassing unlabelled data from both seen and unseen categories.

State-of-the-art GCD methods Vaze et al. (2022); Cao et al. (2022); Wen et al. (2023) employ pre-trained self-supervised models, such as DINO Caron et al. (2021), and partially fine-tune their parameters on the target task, taking advantage of the strong generalization properties of these representations. In this paradigm, data remains fixed while iterating over the model. However, fully fine-tuning a large pre-trained model can lead to overfitting to the labelled data, and is computationally expensive. Instead of focusing solely on the model, we find that alternatives which manipulate the data to cater to the model, are both more efficient and can also achieve better GCD performance. Specifically, visual prompting methods (e.g.,  Jia et al. (2022); Bahng et al. (2022)), have recently been explored to improve model capability by modifying the input or intermediate features through the addition of extra learnable tokens. Although these methods are effective in fully supervised learning, they do not improve representations for generalization and struggle to achieve satisfactory performance in the open-world GCD task. A natural approach to integrating both advantages is to simultaneously optimize the model and data parameters. However, this non-convex bilevel optimization often leads to sub-optimal solutions for both sets of parameters.

Inspired by the expectation–maximization (EM) algorithm Dempster et al. (1977) and decomposition techniques for bilevel optimization Engelmann et al. (2020); Byeon & Van Hentenryck (2022), we introduce a two-stage iterative learning framework called SPTNet for GCD, optimizing both model parameters (i.e. model-finetuning) and data parameters (i.e. prompt learning). More specifically, the framework includes two phases: (1) In the first phase, the backbone model is frozen, and only the prompts are adjusted. (2) In the second phase, we fix the prompt parameters and update the backbone model with a contrastive loss, using an augmented data pair constructed by the raw image together with its prompted version. The prompts and model are alternately trained until convergence. In this way, our learned prompt can be considered as a learned augmentation, targeted for the downstream recognition task (see Fig. 1).

Following arguments in the GCD literature Vaze et al. (2022), that object parts are an effective vehicle to transfer knowledge between ‘seen’ and ‘unseen’ categories, we propose Spatial Prompt Tuning (SPT), which learns pixel-level prompts around local image regions. Unlike previous methods (e.g.Jia et al. (2022); Bahng et al. (2022)) that introduce learnable tokens to the hidden model space, or wrap prompts around the entire image border, SPT divides the original image into patches and attaches prompts to each patch in pixel space. The objective of SPT is to achieve improved alignment between the large pre-trained model and discriminative image regions in the target task. We conduct experiments on seven datasets using the standard evaluation protocol in the GCD setting. Our method achieves an average accuracy of 61.4%, which is higher than the previous state-of-the-art methods by around 10%, in proportional terms, on the SSB benchmark Vaze et al. (2021). Remarkably, this improvement is achieved by introducing only 0.117% extra parameters compared to all ViT-Base parameters, demonstrating the efficiency and effectiveness of our approach.

Our contributions can be summarized as follows: (1) We introduce a two-stage iterative learning framework called SPTNet, integrating advantages of both model parameters (i.e., model-finetuning) and data parameters (i.e., prompt learning) learning for GCD. (2) We propose a new spatial prompt method (called SPT) to adapt the data representation for better alignment with the pre-trained model. The method learns independent prompts for different spatial regions and introduces only 0.039% additional parameters compared to all ViT-Base parameters. (3) We conduct comprehensive evaluations of our method on seven datasets, including three generic (i.e., CIFAR-10, CIFAR-100, and ImageNet-100) and four fine-grained benchmarks (CUB, Stanford Cars, FGVC-Aircraft, and Herbarium19). Our method outperforms state-of-the-art methods in most cases.

2 Related work

Semi-supervised learning (SSL) alleviates the issue of inadequacy of labelled data for training, which learns from both labelled and unlabelled data from predefined classes to get a strong classification model. Consistency-based approaches, including Mean-teacher Tarvainen & Valpola (2017), Mixmatch Berthelot et al. (2019) and Fixmatch Sohn et al. (2020), operate by enforcing model prediction consistency under various perturbations of the unlabelled data or over the course of training. Recent methods, such as  Chen et al. (2020b; c; 2021), have shown improved SSL performance by introducing contrastive learning (e.g.Chen et al. (2020a)He et al. (2020)). Several works Wang et al. (2022b); Rizve et al. (2022); Wang et al. (2024); Sun et al. (2024) extend the standard SSL to an open-world setting.

Novel category discovery (NCD) aims at categorizing unlabelled images from unseen classes by transferring knowledge from labelled data of seen classes Han et al. (2019). Various approaches have been proposed to address NCD, for example, Han et al. (2019) introduces a two-stage training method, which first utilizes metric learning, followed by learning to cluster the unlabelled data. Han et al. (2020; 2021); Zhao & Han (2021) utilize ranking statistics to generate pseudo positives among unlabelled novel classes. Zhong et al. (2021b) transfers semantic knowledge through MixUp augmentation between seen and novel classes, as well as reliable novel anchors with other examples. Zhong et al. (2021a) proposes a neighborhood contrastive loss and hard-negative generation process by mixing novel and seen classes. Fini et al. (2021) reformulates the NCD problem into classification based on dynamic class assignments using Sinkhorn-Knopp algorithm. Jia et al. (2021) addresses multi-modal NCD by inter- and intra-modal contrastive learning with permutation-ensembled ranking statistics. Gu et al. (2023) proposes a novel knowledge distillation framework, which utilizes our class-relation representation to regularize the learning of novel classes.

Generalized Category Discovery (GCD) extends NCD by categorizing unlabelled images from both seen and unseen categories (Vaze et al. (2022)).  Vaze et al. (2022) tackles this issue by tuning the representation of the pre-trained ViT model with DINO (Caron et al. (2021); Oquab et al. (2024)) with contrastive learning, followed by semi-supervised k𝑘kitalic_k-means clustering. ORCA Cao et al. (2022) considers the problem from a semi-supervised learning perspective and introduces an adaptive margin loss for better intra-class separability for both seen and unseen classes. CiPR Hao et al. (2024) introduces a method for more effective contrastive learning and a hierarchical clustering method for GCD without requiring the category number in the unlabelled data to be known a priori. SimGCD Wen et al. (2023) proposes a parametric method with entropy regularization to improve performance. DCCL Pu et al. (2023) improves clustering accuracy by alternating between estimating underlying visual conceptions and learning conceptional representations. They also introduce a dynamic conception generation and update mechanism to ensure consistent conception learning. PromptCAL Zhang et al. (2023) introduces a two-stage framework that iteratively generates and refines affinity graphs based on the model’s current understanding of the data to enhance the semantic discriminativeness of pre-trained vision transformers. GPC Zhao et al. (2023) proposes a GMM-based method that can jointly learn robust representation for GCD and estimate the unknown category number. We also note the concurrent work Vaze et al. (2023) which improves GCD performance with a student-teacher mechanism.

Prompt learning, as the representative of data parameters learning methods, targets at simply prepending a few extra tokens to the input and provides an effective and efficient solution that matches the performance of fully fine-tuning, commonly used in Natural Language Processing (NLP). Recently, prompting learning has been used in vision tasks. Particularly, Visual Prompt Learning (VPT) Jia et al. (2022) has been introduced to optimize extra visual prompts on top of a pre-trained ViT backbone to achieve strong object recognition performance. Bahng et al. (2022) learns an additional “border” of input images as prompts to adapt large-scale pre-trained models, which improves the models’ classification accuracy. There are also some works which utilize prompts to deal with different tasks, such as classification with imbalanced data Dong et al. (2022) or domain shift Wang et al. (2022a). Shtedritski et al. (2023); Khattak et al. (2023) offer the possibility of manipulating both textual and visual modalities through prompting.

Refer to caption
Figure 1: The overall framework of SPTNet. SPTNet alternates between data parameter tuning (stage one) and model parameter tuning (stage two). The data parameters are learnable prompts, for which we introduce spatial prompts Pssubscript𝑃𝑠P_{s}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. The model parameters include the parameters of the top layer of the Transformer backbone \mathcal{F}caligraphic_F and a projection head \mathcal{H}caligraphic_H.

3 Methods

3.1 Preliminaries

Problem statement. Assume that we have the open-world dataset 𝒟𝒟\mathcal{D}caligraphic_D, comprising two subsets: a labelled set 𝒟l={(Xi,yi)}i=1Nl𝒳l×𝒴lsubscript𝒟𝑙superscriptsubscriptsubscript𝑋𝑖subscript𝑦𝑖𝑖1subscript𝑁𝑙subscript𝒳𝑙subscript𝒴𝑙\mathcal{D}_{l}=\{(X_{i},y_{i})\}_{i=1}^{N_{l}}\subset\mathcal{X}_{l}\times% \mathcal{Y}_{l}caligraphic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = { ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊂ caligraphic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × caligraphic_Y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and an unlabelled set 𝒟u={Xi}i=1Nu𝒳usubscript𝒟𝑢superscriptsubscriptsubscript𝑋𝑖𝑖1subscript𝑁𝑢subscript𝒳𝑢\mathcal{D}_{u}=\{X_{i}\}_{i=1}^{N_{u}}\subset\mathcal{X}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊂ caligraphic_X start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, where Xi3×H×Wsubscript𝑋𝑖superscript3𝐻𝑊X_{i}\in\mathbb{R}^{3\times H\times W}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_H × italic_W end_POSTSUPERSCRIPT. H𝐻Hitalic_H and W𝑊Witalic_W are the height and width of the image. 𝒴l=𝒞1subscript𝒴𝑙subscript𝒞1\mathcal{Y}_{l}=\mathcal{C}_{1}caligraphic_Y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒴u=𝒞=𝒞1𝒞2subscript𝒴𝑢𝒞subscript𝒞1subscript𝒞2\mathcal{Y}_{u}=\mathcal{C}=\mathcal{C}_{1}\cup\mathcal{C}_{2}caligraphic_Y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = caligraphic_C = caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ caligraphic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the label space of labelled and the unlabelled samples. 𝒞𝒞\mathcal{C}caligraphic_C, 𝒞1subscript𝒞1\mathcal{C}_{1}caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and 𝒞2subscript𝒞2\mathcal{C}_{2}caligraphic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denote the label set for ‘All’, ‘Old’, and ‘New’ categories, respectively. The objective of GCD is to categorize all the unlabelled images in 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, having access to labels only in 𝒟lsubscript𝒟𝑙\mathcal{D}_{l}caligraphic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. For simplicity, hereafter we omit the subscript for each image Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Architecture. We consider a parametric model consisting of a feature extractor \mathcal{F}caligraphic_F and a projection head \mathcal{H}caligraphic_H. For an image X𝑋Xitalic_X from 𝒟𝒟\mathcal{D}caligraphic_D, we can obtain its class prediction by y^=((X))^𝑦𝑋\hat{y}=\mathcal{H}(\mathcal{F}(X))over^ start_ARG italic_y end_ARG = caligraphic_H ( caligraphic_F ( italic_X ) ). Specifically, we employ a Vision Transformer (ViT) Dosovitskiy et al. (2020) as the architecture. We consider the Transformer encoder as \mathcal{F}caligraphic_F and a simple multilayer perceptron (MLP) as \mathcal{H}caligraphic_H. In ViT, an image X𝑋Xitalic_X is first divided into n𝑛nitalic_n patches (x1,,xn)superscript𝑥1superscript𝑥𝑛(x^{1},\cdots,x^{n})( italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ), where xi3×h×wsuperscript𝑥𝑖superscript3𝑤x^{i}\in\mathbb{R}^{3\times h\times w}italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_h × italic_w end_POSTSUPERSCRIPT and n=(H×W)/(h×w)𝑛𝐻𝑊𝑤n=(H\times W)/(h\times w)italic_n = ( italic_H × italic_W ) / ( italic_h × italic_w ). The patches are then mapped into d𝑑ditalic_d-dimensional latent space by a shared linear projection layer ee\operatorname{e}roman_e. Together with an extra learnable classification token CLS, the full model is formulated as:

(x1,,xn)superscript𝑥1superscript𝑥𝑛\displaystyle(x^{1},\cdots,x^{n})( italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) =ϕ(X)absentitalic-ϕ𝑋\displaystyle=\phi(X)= italic_ϕ ( italic_X ) (1)
E0subscript𝐸0\displaystyle E_{0}italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =[e(x1);;e(xn)]absentesuperscript𝑥1esuperscript𝑥𝑛\displaystyle=[\operatorname{e}(x^{1});...;\operatorname{e}(x^{n})]= [ roman_e ( italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) ; … ; roman_e ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ]
[CLSi;Ei]subscriptCLS𝑖subscript𝐸𝑖\displaystyle[\textit{CLS}_{i};E_{i}][ CLS start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] =Li([CLSi1;Ei1])absentsubscript𝐿𝑖subscriptCLS𝑖1subscript𝐸𝑖1\displaystyle=L_{i}([\textit{CLS}_{i-1};E_{i-1}])= italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( [ CLS start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ; italic_E start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ] )
y^^𝑦\displaystyle\hat{y}over^ start_ARG italic_y end_ARG =(CLSM),absentsubscriptCLS𝑀\displaystyle=\mathcal{H}(\textit{CLS}_{M}),= caligraphic_H ( CLS start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) ,

where ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ) denotes the “pathcify” operator to divide the input image X𝑋Xitalic_X into patches (x1,,xn)superscript𝑥1superscript𝑥𝑛(x^{1},\cdots,x^{n})( italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ), Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the i𝑖iitalic_i-th layer of the ViT and []delimited-[][\cdot][ ⋅ ] denotes concatenation.

Model/Data parameters. We consider optimizing both model parameters and data parameters for GCD. Optimizing the model parameters is the most common way to train or fine-tune a model by minimizing the loss function on a given dataset. Previous GCD methods, such as  Vaze et al. (2022); Wen et al. (2023); Pu et al. (2023); Zhang et al. (2023), fine-tune the final transformer block and the linear projection layer of a pre-trained ViT model. In our method, the model parameters are in {\color[rgb]{0.83203125,0.359375,0.26171875}\mathcal{F}}caligraphic_F and {\color[rgb]{0.83203125,0.359375,0.26171875}\mathcal{H}}caligraphic_H. Recently, visual prompt learning techniques have been introduced to effectively adapt pre-trained large-scale models to different downstream tasks, without the need of tuning the model parameters. We refer to such techniques as optimizing data parameters. Particularly, VPT Jia et al. (2022) inserts a sequence of learnable embeddings in the input for each Transformer encoder layer Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the ViT model. Specifically, VPT freezes the feature extractor {\color[rgb]{0.12109375,0.46875,0.70703125}\mathcal{F}}caligraphic_F but learns a set of prompt tokens Pi={pij;j=1,,b}{\color[rgb]{0.83203125,0.359375,0.26171875}P_{i}}=\{p^{j}_{i};j=1,\cdots,b\}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_p start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_j = 1 , ⋯ , italic_b } with pijdsubscriptsuperscript𝑝𝑗𝑖superscript𝑑p^{j}_{i}\in\mathbb{R}^{d}italic_p start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT as part of the input for layer Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The input can be denoted as [CLSi;Pi;Ei]subscriptCLS𝑖subscript𝑃𝑖subscript𝐸𝑖[\textit{CLS}_{i};{\color[rgb]{0.83203125,0.359375,0.26171875}P_{i}};E_{i}][ CLS start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ]. Instead of inserting several tunable parameters to each layer, Bahng et al. (2022) attaches learnable parameters Pgsubscript𝑃𝑔{\color[rgb]{0.83203125,0.359375,0.26171875}P_{g}}italic_P start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT to the border of the raw input image as (x1,,xn)=ϕ(X+Pg)superscript𝑥1superscript𝑥𝑛italic-ϕ𝑋subscript𝑃𝑔(x^{\prime 1},\cdots,x^{\prime n})=\phi(X+{\color[rgb]{% 0.83203125,0.359375,0.26171875}P_{g}})( italic_x start_POSTSUPERSCRIPT ′ 1 end_POSTSUPERSCRIPT , ⋯ , italic_x start_POSTSUPERSCRIPT ′ italic_n end_POSTSUPERSCRIPT ) = italic_ϕ ( italic_X + italic_P start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ). We propose Spatial Prompt Tuning (SPT) for GCD, as will be described in Section 3.3, which attaches a learnable prompt for each image patch. Namely, we learn a set of prompts Ps={pj;j=1,,n}{\color[rgb]{0.83203125,0.359375,0.26171875}P_{s}}=\{p^{j};j=1,\cdots,n\}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { italic_p start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ; italic_j = 1 , ⋯ , italic_n } and attach the prompts to the input image patches by (x1+p1,,xn+pn)=ϕ(X)+Pssuperscript𝑥1superscript𝑝1superscript𝑥𝑛superscript𝑝𝑛italic-ϕ𝑋subscript𝑃𝑠(x^{1}+p^{1},\cdots,x^{n}+p^{n})=\phi(X)+{\color[rgb]{% 0.83203125,0.359375,0.26171875}P_{s}}( italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + italic_p start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT + italic_p start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = italic_ϕ ( italic_X ) + italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

3.2 SPTNet: an alternate prompt learning framework for GCD

Large-scale pretraining (e.g., DINO Caron et al. (2021) self-supervision) is the key ingredient in existing GCD methods Vaze et al. (2022); Wen et al. (2023); Pu et al. (2023); Zhang et al. (2023). As GCD requires learning from unlabelled data, contrastive self-supervised learning is the natural choice, which uses data augmentations to create different views of the same input image. These augmentations provide an inductive bias as to what is (not) semantically meaningful in an image. In this context, prompt tuning is a clear but unexplored option that enables efficient adaptation of pre-trained models. Our insight is that the learned prompt can also be used to generate a novel view, making it a suitable choice for the contrastive framework. Simultaneously optimizing the model and prompts seems appealing, but it results in instability and sub-optimal solutions for both data and model parameters.

To mitigate the issue, inspired by EM algorithm Dempster et al. (1977), we propose SPTNet, a two-stage alternative learning framework for GCD. The overall framework is illustrated in Fig. 1. The learning objective includes both representation learning and parametric classification, while our framework alternates between data parameter and model parameter optimization using the same learning objective.

Specifically, in vanilla contrastive learning, two different views, X𝑋Xitalic_X and Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, of the same input image are constructed as a positive pair. A set of other images are drawn from the dataset as negative samples 𝒩(X)={Xq;q=1,,Q}\mathcal{N}(X)=\{X^{-}_{q};q=1,\cdots,Q\}caligraphic_N ( italic_X ) = { italic_X start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ; italic_q = 1 , ⋯ , italic_Q }. Then, the parameters of \mathcal{F}caligraphic_F can be updated by the InfoNCE loss Oord et al. (2018) using the data triplet (X,X,𝒩(X))𝑋superscript𝑋𝒩𝑋(X,X^{\prime},\mathcal{N}(X))( italic_X , italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_N ( italic_X ) ):

nceun(X,X,𝒩(X);,τu)=logexp(cos((X),(X))/τu)q=1Qexp(cos((X),(Xq))/τu),subscriptsuperscriptunnce𝑋superscript𝑋𝒩𝑋subscript𝜏𝑢cos𝑋superscript𝑋subscript𝜏𝑢superscriptsubscript𝑞1𝑄cos𝑋subscriptsuperscript𝑋𝑞subscript𝜏𝑢\mathcal{L}^{\rm un}_{\rm nce}(X,X^{\prime},\mathcal{N}(X);\mathcal{F},\tau_{u% })=-\log\frac{\exp(\operatorname{cos}(\mathcal{F}(X),\mathcal{F}(X^{\prime}))/% \tau_{u})}{\sum_{q=1}^{Q}\exp(\operatorname{cos}(\mathcal{F}(X),\mathcal{F}(X^% {-}_{q}))/\tau_{u})},caligraphic_L start_POSTSUPERSCRIPT roman_un end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_nce end_POSTSUBSCRIPT ( italic_X , italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_N ( italic_X ) ; caligraphic_F , italic_τ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) = - roman_log divide start_ARG roman_exp ( roman_cos ( caligraphic_F ( italic_X ) , caligraphic_F ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) / italic_τ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT roman_exp ( roman_cos ( caligraphic_F ( italic_X ) , caligraphic_F ( italic_X start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) ) / italic_τ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) end_ARG , (2)

where cos(,\operatorname{cos}(\cdot,\cdotroman_cos ( ⋅ , ⋅) is the cosine similarity between embedding feature vectors and τusubscript𝜏𝑢\tau_{u}italic_τ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is a temperature hyperparameter. Analogous to Eq. 2, supervised contrastive loss Khosla et al. (2020) ncesup(X,𝒫(X),𝒩(X),y;,τc)subscriptsuperscriptsupnce𝑋𝒫𝑋𝒩𝑋𝑦subscript𝜏𝑐\mathcal{L}^{\rm sup}_{\rm nce}(X,\mathcal{P}(X),\mathcal{N}(X),y;\mathcal{F},% \tau_{c})caligraphic_L start_POSTSUPERSCRIPT roman_sup end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_nce end_POSTSUBSCRIPT ( italic_X , caligraphic_P ( italic_X ) , caligraphic_N ( italic_X ) , italic_y ; caligraphic_F , italic_τ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) utilizes a set of positive samples 𝒫(X)𝒫𝑋\mathcal{P}(X)caligraphic_P ( italic_X ) having the same class label y𝑦yitalic_y in the mini-batch.

Next, to assign labels to input instances, we use parametric methods to classify them into seen or new classes, as commonly done in image recognition. In supervised contrastive learning, this is achieved through the simultaneous optimization of \mathcal{F}caligraphic_F and \mathcal{H}caligraphic_H using the cosine-softmax cross-entropy loss Gidaris & Komodakis (2018):

clssup(X,y;,,τs)=κyκlogexp(cos(((X)),Wκ)/τs)κ=1|C|exp(cos(((X)),Wκ)/τs),subscriptsuperscriptsupcls𝑋𝑦subscript𝜏𝑠subscript𝜅subscript𝑦𝜅cos𝑋subscript𝑊𝜅subscript𝜏𝑠superscriptsubscriptsuperscript𝜅1𝐶cos𝑋subscript𝑊superscript𝜅subscript𝜏𝑠\displaystyle\mathcal{L}^{\rm sup}_{\rm cls}(X,y;\mathcal{H},\mathcal{F},\tau_% {s})=-\sum_{\kappa}y_{\kappa}\log\frac{\exp(\operatorname{cos}(\mathcal{H}(% \mathcal{F}(X)),W_{\kappa})/\tau_{s})}{\sum_{\kappa^{\prime}=1}^{|C|}\exp(% \operatorname{cos}(\mathcal{H}(\mathcal{F}(X)),W_{\kappa^{\prime}})/\tau_{s})},caligraphic_L start_POSTSUPERSCRIPT roman_sup end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT ( italic_X , italic_y ; caligraphic_H , caligraphic_F , italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( roman_cos ( caligraphic_H ( caligraphic_F ( italic_X ) ) , italic_W start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ) / italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_κ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_C | end_POSTSUPERSCRIPT roman_exp ( roman_cos ( caligraphic_H ( caligraphic_F ( italic_X ) ) , italic_W start_POSTSUBSCRIPT italic_κ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) / italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG , (3)

where ((X))𝑋\mathcal{H}(\mathcal{F}(X))caligraphic_H ( caligraphic_F ( italic_X ) ) and Wκsubscript𝑊𝜅W_{\kappa}italic_W start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT are the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-normalized feature and the prototype vector of class κ𝜅\kappaitalic_κ respectively. Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is also used in the above loss, as an additional augmented version of X𝑋Xitalic_X. For the unsupervised counterpart, Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is removed from the input while the prediction y^=((X))^𝑦superscript𝑋\hat{y}=\mathcal{H}(\mathcal{F}(X^{\prime}))over^ start_ARG italic_y end_ARG = caligraphic_H ( caligraphic_F ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) is used as a pseudo label for self-distillation. The loss can be denoted as clsun(X,X;,,τt)subscriptsuperscriptuncls𝑋superscript𝑋subscript𝜏𝑡\mathcal{L}^{\rm un}_{\rm cls}(X,X^{\prime};\mathcal{H},\mathcal{F},\tau_{t})caligraphic_L start_POSTSUPERSCRIPT roman_un end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT ( italic_X , italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; caligraphic_H , caligraphic_F , italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Therefore, the overall loss \mathcal{L}caligraphic_L can be written as:

=(1λ)(nceun+clsun)+λ(ncesup+clssup)ϵΔ,1𝜆subscriptsuperscriptunncesubscriptsuperscriptuncls𝜆subscriptsuperscriptsupncesubscriptsuperscriptsupclsitalic-ϵΔ\mathcal{L}=(1-\lambda)(\mathcal{L}^{\rm un}_{\rm nce}+\mathcal{L}^{\rm un}_{% \rm cls})+\lambda(\mathcal{L}^{\rm sup}_{\rm nce}+\mathcal{L}^{\rm sup}_{\rm cls% })-\epsilon\Delta,caligraphic_L = ( 1 - italic_λ ) ( caligraphic_L start_POSTSUPERSCRIPT roman_un end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_nce end_POSTSUBSCRIPT + caligraphic_L start_POSTSUPERSCRIPT roman_un end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT ) + italic_λ ( caligraphic_L start_POSTSUPERSCRIPT roman_sup end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_nce end_POSTSUBSCRIPT + caligraphic_L start_POSTSUPERSCRIPT roman_sup end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT ) - italic_ϵ roman_Δ , (4)

where λ,ϵ𝜆italic-ϵ\lambda,\epsilonitalic_λ , italic_ϵ are the balance factors, and ΔΔ\Deltaroman_Δ represents the the mean-entropy-maximisation regulariser Assran et al. (2022), computed by taking the entropy of the mean prediction of all samples in a mini-batch.

Our SPTNet alternates between optimizing data parameters and model parameters as follows:

Stage one: Fix &{\color[rgb]{0.12109375,0.46875,0.70703125}\mathcal{F}\&\mathcal{H}}caligraphic_F & caligraphic_H and update Pssubscript𝑃𝑠{\color[rgb]{0.83203125,0.359375,0.26171875}P_{s}}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. In the first stage, we attach the same set of spatial prompts Pssubscript𝑃𝑠P_{s}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to the input images, X𝑋Xitalic_X, Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and 𝒩(X)𝒩𝑋\mathcal{N}(X)caligraphic_N ( italic_X ). The framework is trained with the loss in Eq. 4, while the image patches are replaced by their ‘prompted’ version. For example, ϕ(X)italic-ϕ𝑋\phi(X)italic_ϕ ( italic_X ) in Eq. 1 is replaced by ϕ(X)+Psitalic-ϕ𝑋subscript𝑃𝑠\phi(X)+P_{s}italic_ϕ ( italic_X ) + italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and the same applies to Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝒩(X)𝒩𝑋\mathcal{N}(X)caligraphic_N ( italic_X ). During training, we freeze the parameters of &{\mathcal{F}\&\mathcal{H}}caligraphic_F & caligraphic_H and only update the prompt parameters of Pssubscript𝑃𝑠P_{s}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. It is worth noting that, to facilitate the generalization, the weight decay for optimizing Pssubscript𝑃𝑠P_{s}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is set to zero to prevent prompts from being sparse. Meanwhile, during the learning in Stage two, our spatial prompting acts as a strong data augmentation. Increasing the variation in the parameters of Pssubscript𝑃𝑠P_{s}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT leads to more diverse ‘prompted’ image pairs to benefit the representation learning. It was also noted in the literature that more diverse data augmentation is helpful for representation learning based on contrastive learning (e.g.HaoChen et al. (2021)).

Stage two: Fix Pssubscript𝑃𝑠{\color[rgb]{0.12109375,0.46875,0.70703125}P_{s}}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and update &{\color[rgb]{0.83203125,0.359375,0.26171875}\mathcal{F}\&\mathcal{H}}caligraphic_F & caligraphic_H. In the second stage, we freeze prompt parameters Pssubscript𝑃𝑠P_{s}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and learn the parameters of \mathcal{H}caligraphic_H and the top layer in \mathcal{F}caligraphic_F, again, with the loss in Eq. 4. With our spatial prompt learning as a strong augmentation, we aim to obtain a representation that can better distinguish samples from different classes, as the core mechanism of contrastive learning involves implicitly clustering samples from the same class together.

Different from prior works that apply only hand-crafted augmentations, we propose to consider prompting the input with learnable prompts, i.e., ϕ(X)+Psitalic-ϕ𝑋subscript𝑃𝑠\phi(X)+P_{s}italic_ϕ ( italic_X ) + italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, as a new type of augmentation. The ‘prompted’ version of the input can be adopted by all loss terms. In this way, our framework can enjoy a learned augmentation that varies throughout the training process, enabling \mathcal{F}caligraphic_F to learn discriminative representations. Each stage optimizes the parameters for k𝑘kitalic_k iterations.

3.3 Spatial Prompt Tuning

Naively applying existing prompt tuning methods does not lead to satisfying performance (see 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT and 3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT rows in Table 4). We speculate that prompts in the hidden model space rather than input space make it harder to align inputs within the contrastive framework, as evidenced by our empirical results in Fig. 5. Besides, a key insight in GCD is that object parts are effective in transferring knowledge between old and new categories Vaze et al. (2022).

Refer to caption
Figure 2: (a) An example of applying Spatial Prompt Tuning (SPT) to an image with a height H𝐻Hitalic_H and width W𝑊Witalic_W. For each image patch xjsuperscript𝑥𝑗x^{j}italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT with a height hhitalic_h and width w𝑤witalic_w, we attach spatial prompts Pssubscript𝑃𝑠P_{s}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT of size m𝑚mitalic_m to it. (b) Joint spatial and global prompts for SPTNet.

Therefore, we propose Spatial Prompt Tuning (SPT) to serve as a learned data augmentation that enables the model to focus on local image object regions, while adapting the data representation from the pre-trained ViT model and maintaining the alignment with it. In SPT, we inject a small number of learnable parameters into the input image patches and keep the backbone {\color[rgb]{0.12109375,0.46875,0.70703125}\mathcal{F}}caligraphic_F and the projection head {\color[rgb]{0.12109375,0.46875,0.70703125}\mathcal{H}}caligraphic_H frozen during training. Unlike existing methods that introduce learnable tokens to hidden model space Jia et al. (2022) or wrap prompts around the entire image border Bahng et al. (2022), SPT divides the image into patches and attaches learnable prompts to each partch. Specifically, let (x1,,xn)=ϕ(X)superscript𝑥1superscript𝑥𝑛italic-ϕ𝑋(x^{1},\cdots,x^{n})=\phi(X)( italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = italic_ϕ ( italic_X ) be the set of image patches divided from image X𝑋Xitalic_X. For each patch xj3×h×wsuperscript𝑥𝑗superscript3𝑤x^{j}\in\mathbb{R}^{3\times h\times w}italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_h × italic_w end_POSTSUPERSCRIPT, SPT wraps instance-agnostic prompts Pssubscript𝑃𝑠P_{s}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT around it in a rectangular shape with a width of m𝑚mitalic_m, as illustrated in Fig. 2 (a). Thus, there are 6m(h+w2m)6𝑚𝑤2𝑚6m(h+w-2m)6 italic_m ( italic_h + italic_w - 2 italic_m ) learnable parameters for the prompts of each patch. Our SPTNet proceeds alternatively between the two stages and gradually learns the spatial prompts shared across all images. As revealed in Zhao & Han (2021), both global and local spatial information benefits novel category discovery. Therefore, apart from SPT tokens, SPTNet also wraps an additional global prompt around the entire image like Bahng et al. (2022), as illustrated in Fig. 2 (b).

4 Experiments

4.1 Experimental setup

Datasets. We evaluate the effectiveness of SPT on three generic image recognition datasets (i.e., CIFAR-10/100 Krizhevsky et al. (2009) and ImageNet-100 Tian et al. (2020)), three fine-grained datasets (i.e., CUB Welinder et al. (2010), Stanford Cars Krause et al. (2013), and FGVC-Aircraft Maji et al. (2013)) contained in Semantic Shift Benchmark (SSB) Vaze et al. (2021), and the challenging large-scale fine-grained dataset Herbarium-19 Tan et al. (2019). For each dataset, we first subsample |𝒞1|subscript𝒞1|\mathcal{C}_{1}|| caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | seen (labelled) classes from all classes. Following Vaze et al. (2022), we subsample 80% samples in CIFAR-100 and 50% samples in all other datasets from the seen classes to construct 𝒟lsubscript𝒟𝑙\mathcal{D}_{l}caligraphic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, while the remaining images are treated as 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. The statistics of the datasets can be found in Table 1.

Table 1: Dataset statistics and training configurations.
Labelled Unlabelled Configs
Dataset #Num #Class #Num #Class lrbsubscriptlr𝑏\texttt{lr}_{b}lr start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT wdbsubscriptwd𝑏\texttt{wd}_{b}wd start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT lrpsubscriptlr𝑝\texttt{lr}_{p}lr start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT wdpsubscriptwd𝑝\texttt{wd}_{p}wd start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT k𝑘kitalic_k m𝑚mitalic_m
CIFAR10 Krizhevsky et al. (2009) 12.5K 5 37.5K 10 3e-3 5e-4 1.0 0 20 1
CIFAR100 Krizhevsky et al. (2009) 20.0K 80 30.0K 100 1e-3 5e-4 1.0 0 20 1
ImageNet-100 Tian et al. (2020) 31.9K 50 95.3K 100 3e-3 5e-4 10.0 0 20 1
Herbarium 19 Tan et al. (2019) 8.9K 341 25.4K 683 3e-3 5e-4 10.0 0 20 1
CUB Welinder et al. (2010) 1.5K 100 4.5K 200 0.05 5e-4 25.0 0 20 1
Stanford Cars Krause et al. (2013) 2.0K 98 6.1K 196 0.05 5e-4 25.0 0 20 1
FGVC-Aircraft Maji et al. (2013) 1.7K 50 5.0K 50 0.05 5e-4 25.0 0 20 1

Evaluation protocol. We use clustering accuracy (ACC𝐴𝐶𝐶ACCitalic_A italic_C italic_C) to evaluate the model performance, as per standard practice. During the evaluation, we compare the ground-truth labels yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the predicted labels y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and measure the ACC by ACC=1|𝒟u|i=1|𝒟u|𝟙(yi=𝒢(y^i))𝐴𝐶𝐶1subscript𝒟𝑢superscriptsubscript𝑖1subscript𝒟𝑢1subscript𝑦𝑖𝒢subscript^𝑦𝑖ACC=\frac{1}{|\mathcal{D}_{u}|}\sum_{i=1}^{|\mathcal{D}_{u}|}\mathds{1}(y_{i}=% \mathcal{G}(\hat{y}_{i}))italic_A italic_C italic_C = divide start_ARG 1 end_ARG start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_G ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ), where 𝒢𝒢\mathcal{G}caligraphic_G represents the optimal permutation that gives the matching between the predicted labels with the ground truth.

Implementation details. We develop our SPTNet upon the SimGCD Wen et al. (2023) baseline and apply the spatial prompt tuning on the pre-trained ViT-B/16 backbone Caron et al. (2021). Specifically, we take the final feature corresponding to the CLS token from the backbone as the image feature, which has a dimension of 768. For the feature extractor \mathcal{F}caligraphic_F, we only fine-tune the last block. We set the spatial prompt size m𝑚mitalic_m to 1, while the globe prompt size to 30 which is the default value in Bahng et al. (2022). It is worth noting that our method yields extra parameters amounting to only 0.117% of those in the backbone architecture (see Appendix A for details). The two stages alternate every k=20𝑘20k=20italic_k = 20 iterations. All prompts are trained for 1,000 epochs with a batch size of 128. We utilize SGD as the optimizer for training, employing different learning rates (lrp, lrb) and weight decay parameters (wdp, wdb) to update prompts and the model. The training hyper-parameters, determined on the validation data splits, are shown in Table 1. We set the balancing factor λ𝜆\lambdaitalic_λ to 0.35 and the temperature values τusubscript𝜏𝑢\tau_{u}italic_τ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and τcsubscript𝜏𝑐\tau_{c}italic_τ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to 0.07 and 1.0, respectively, following Wen et al. (2023). For the temperature values τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and τssubscript𝜏𝑠\tau_{s}italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT in the classification losses, we also set them to 0.07 and 0.1. All experiments are conducted using an NVIDIA GeForce RTX 3090 GPU.

4.2 Main results

Evaluation on generic datasets. We evaluate SPTNet on three generic datasets, CIFAR-10, CIFAR-100 and ImageNet-100. We compare SPTNet with previous state-of-the-art methods and two concurrent methods (DCCL Pu et al. (2023) and PromptCAL Zhang et al. (2023)). The results are shown in Table 2. We can see that our method consistently outperforms previous state-of-the-art methods. Specifically, SPTNet surpasses the baseline SimGCD by 0.4%percent0.40.4\%0.4 % on CIFAR-10, 1.9%percent1.91.9\%1.9 % on CIFAR-100, and 2.5%percent2.52.5\%2.5 % on ImageNet-100 for ‘All’ classes; it also outperforms concurrent methods on both CIFAR-100 and ImageNet-100. SPTNet performs on par with PromptCAL on CIFAR-10 but with much fewer learnable parameters and shorter training time (see Table 13). Note that for CIFAR-10 and CIFAR-100, the images are of extremely low-resolution (32×32)3232(32\times 32)( 32 × 32 ). As such, limited information is provided in each patch, leading to limited gains from our proposed (local) spatial prompting. On ImageNet-100, performance boosts are difficult to yield, as the original DINO backbone is already highly tuned for this dataset. This is evidenced by the gains (usually) being substantially less between the previous state-of-the-art and the simple k𝑘kitalic_k-means on raw DINO features Vaze et al. (2022).

Table 2: Evaluation on three generic image recognition datasets. Bold values represent the best results, while underlined values represent the second best results.
CIFAR-10 CIFAR-100 ImageNet-100
Method All Old New All Old New All Old New
k𝑘kitalic_k-means Arthur & Vassilvitskii (2006) 83.6 85.7 82.5 52.0 52.2 50.8 72.7 75.5 71.3
RankStats+ Han et al. (2021) 46.8 19.2 60.5 58.2 77.6 19.3 37.1 61.6 24.8
UNO+ Fini et al. (2021) 68.6 98.3 53.8 69.5 80.6 47.2 70.3 95.0 57.9
GCD Vaze et al. (2022) 91.5 97.9 88.2 73.0 76.2 66.5 74.1 89.8 66.3
ORCA Cao et al. (2022) 96.9 95.1 97.8 74.2 82.1 67.2 79.2 93.2 72.1
SimGCD Wen et al. (2023) 97.1 95.1 98.1 80.1 81.2 77.8 83.0 93.1 77.9
DCCL Pu et al. (2023) 96.3 96.5 96.9 75.3 76.8 70.2 80.5 90.5 76.2
PromptCAL Zhang et al. (2023) 97.9 96.6 98.5 81.2 84.2 75.3 83.1 92.7 78.3
SPTNet (Ours) 97.3 95.0 98.6 81.3 84.3 75.6 85.4 93.2 81.4

Evaluation on fine-grained datasets. Table 3 presents the results on fine-grained datasets including the SSB benchmark and Herbarium 19 dataset. The unsatisfactory performance of k𝑘kitalic_k-means and ORCA highlights the difficulty in discovering fine-grained categories due to large intra-class and small inter-class variations. In contrast, SPTNet demonstrates superior performance to SimGCD, DCCL, and PromptCAL, achieving an average absolute improvement of similar-to\sim5% and an average proportional improvement of similar-to\sim10% across all evaluated datasets in SSB, specifically on ‘All’ classes. As there is a clear semantic axis in SSB benchmark, and data augmentations implicitly define this ‘semantic axis’ or taxonomy in contrastive learning, SPT as a learned data augmentation ultimately enhances the GCD performance. This indicates that global and local prompts assist the model in focusing on details that dominate correctness in fine-grained recognition in GCD.

Table 3: Evaluation on the Semantic Shift Benchmark (SSB) and Herbarium 19. Bold values represent the best results, while underlined values represent the second best results.
CUB Stanford Cars FGVC-Aircraft Herbarium19
Method All Old New All Old New All Old New All Old New
k𝑘kitalic_k-means Arthur & Vassilvitskii (2006) 34.3 38.9 32.1 12.8 10.6 13.8 12.9 12.9 12.8 13.0 12.2 13.4
RankStats+ Han et al. (2021) 33.3 51.6 24.2 28.3 61.8 12.1 27.9 55.8 12.8 27.9 55.8 12.8
UNO+ Fini et al. (2021) 35.1 49.0 28.1 35.5 70.5 18.6 28.3 53.7 14.7 28.3 53.7 14.7
GCD Vaze et al. (2022) 51.3 56.6 48.7 39.0 57.6 29.9 45.0 41.1 46.9 35.4 51.0 27.0
ORCA Cao et al. (2022) 36.3 43.8 32.6 31.9 42.2 26.9 31.6 32.0 31.4 20.9 30.9 15.5
SimGCD Wen et al. (2023) 60.3 65.6 57.7 53.8 71.9 45.0 54.2 59.1 51.8 43.0 58.0 35.1
DCCL Pu et al. (2023) 63.5 60.8 64.9 43.1 55.7 36.2 - - - - - -
PromptCAL Zhang et al. (2023) 62.9 64.4 62.1 50.2 70.1 40.6 52.2 52.2 52.3 37.0 52.0 28.9
SPTNet (Ours) 65.8 68.8 65.1 59.0 79.2 49.3 59.3 61.8 58.1 43.4 58.7 35.2

4.3 Ablation study

In this part, we primarily focus on the challenging SSB to assess the effectiveness of different components and report the averaged results among CUB, Stanford Cars, and FGVC-Aircraft. As our method employs the same pre-trained model and objective function as SimGCD, we consider SimGCD as the baseline method for comparison.

Effect of prompt-related techniques. We first experiment with the strong SimGCD baseline using the recommended configuration in Wen et al. (2023). Table 4 presents the results of the ablation study on the components of our SPT. The 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT row shows the performance of adopting the VPT method on the pre-trained SimGCD. Comparing with the raw SimGCD base in the 1stsuperscript1𝑠𝑡1^{st}1 start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT row, the performance after adopting VPT is dropped. After adopting the global prompt (3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT row) on the pre-trained SimGCD, the performance is increased by 0.6% on ‘All’ classes. This indicates that naively applying existing prompt tuning methods does not yield satisfactory performance on GCD; the improvement by the global prompt, though marginal, is still encouraging, as it suggests that the pixel-level prompt method is suitable for the GCD setting when compared with VPT. Adopting our SPT (4thsuperscript4𝑡4^{th}4 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row) on pre-trained SimGCD gives a relatively larger improvement of 1.8% on ‘All’ classes. The effectiveness of our proposed method may be attributed to the spatial design for exploring semantic discrimination in local regions. Our alternate training strategy (5thsuperscript5𝑡5^{th}5 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT & 7thsuperscript7𝑡7^{th}7 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT rows) can effectively improve the performance, demonstrating its effectiveness. We also explore a variant of SPT by using shared prompts across all patches (6thsuperscript6𝑡6^{th}6 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row), which also demonstrates promising performance. After further introducing the global prompts (7thsuperscript7𝑡7^{th}7 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT & 8thsuperscript8𝑡8^{th}8 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row), the performance is further improved. The 8thsuperscript8𝑡8^{th}8 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row corresponds to our default SPTNet, which achieves the best performance. We refer to the variant in the 5thsuperscript5𝑡5^{th}5 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row as SPTNet-S (Shared), and the variant in the 6thsuperscript6𝑡6^{th}6 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row as SPTNet-P (Patch). Both of these variants are relatively more parameter efficient (see Appendix A for details), spatially SPTNet-S, while obtaining superior performance (more results of these two variants can be found in Appendix C).

Table 4: Comparison on effectiveness of different prompting methods on SSB. We report the average test accuracy score over all component datasets of SSB (i.e., CUB, Stanford Cars and FGVC-Aircraft). ‘Shared’ and ‘Alter’ refer to a single shared prompt for all patches and alternative learning. Row (9) represents SPTNet and rows (5) and (6) represent its two variants SPTNet-S and SPTNet-P.
No Method config Prompt config All Old New
(1) SimGCD Wen et al. (2023) None (baseline) 56.1 65.5 51.5
(2) +VPT Jia et al. (2022) 54.4-1.7superscript54.4-1.7\text{54.4}^{\text{{\color[rgb]{1,0,0}{-1.7}}}}54.4 start_POSTSUPERSCRIPT -1.7 end_POSTSUPERSCRIPT 64.7-0.8superscript64.7-0.8\text{64.7}^{\text{{\color[rgb]{1,0,0}{-0.8}}}}64.7 start_POSTSUPERSCRIPT -0.8 end_POSTSUPERSCRIPT 49.1-2.4superscript49.1-2.4\text{49.1}^{\text{{\color[rgb]{1,0,0}{-2.4}}}}49.1 start_POSTSUPERSCRIPT -2.4 end_POSTSUPERSCRIPT
(3) +Global Bahng et al. (2022) 56.7+0.6superscript56.7+0.6\text{56.7}^{\text{{\color[rgb]{0,.7,.0}{+0.6}}}}56.7 start_POSTSUPERSCRIPT +0.6 end_POSTSUPERSCRIPT 64.6-0.9superscript64.6-0.9\text{64.6}^{\text{{\color[rgb]{1,0,0}{-0.9}}}}64.6 start_POSTSUPERSCRIPT -0.9 end_POSTSUPERSCRIPT 53.5+2.0superscript53.5+2.0\text{53.5}^{\text{{\color[rgb]{0,.7,.0}{+2.0}}}}53.5 start_POSTSUPERSCRIPT +2.0 end_POSTSUPERSCRIPT
(4) +SPT 57.9+1.8superscript57.9+1.8\text{57.9}^{\text{{\color[rgb]{0,.7,.0}{+1.8}}}}57.9 start_POSTSUPERSCRIPT +1.8 end_POSTSUPERSCRIPT 67.2+1.7superscript67.2+1.7\text{67.2}^{\text{{\color[rgb]{0,.7,.0}{+1.7}}}}67.2 start_POSTSUPERSCRIPT +1.7 end_POSTSUPERSCRIPT 53.3+1.8superscript53.3+1.8\text{53.3}^{\text{{\color[rgb]{0,.7,.0}{+1.8}}}}53.3 start_POSTSUPERSCRIPT +1.8 end_POSTSUPERSCRIPT
(4) +Alter +Global Bahng et al. (2022) 57.8+1.7superscript57.8+1.7\text{57.8}^{\text{{\color[rgb]{0,.7,.0}{+1.7}}}}57.8 start_POSTSUPERSCRIPT +1.7 end_POSTSUPERSCRIPT 66.3+0.8superscript66.3+0.8\text{66.3}^{\text{{\color[rgb]{0,.7,.0}{+0.8}}}}66.3 start_POSTSUPERSCRIPT +0.8 end_POSTSUPERSCRIPT 53.8+2.3superscript53.8+2.3\text{53.8}^{\text{{\color[rgb]{0,.7,.0}{+2.3}}}}53.8 start_POSTSUPERSCRIPT +2.3 end_POSTSUPERSCRIPT
(5) +Shared 60.5+4.4superscript60.5+4.4\text{60.5}^{\text{{\color[rgb]{0,.7,.0}{+4.4}}}}60.5 start_POSTSUPERSCRIPT +4.4 end_POSTSUPERSCRIPT 68.6+3.1superscript68.6+3.1\text{68.6}^{\text{{\color[rgb]{0,.7,.0}{+3.1}}}}68.6 start_POSTSUPERSCRIPT +3.1 end_POSTSUPERSCRIPT 56.5+5.0superscript56.5+5.0\text{56.5}^{\text{{\color[rgb]{0,.7,.0}{+5.0}}}}56.5 start_POSTSUPERSCRIPT +5.0 end_POSTSUPERSCRIPT
(6) +SPT 59.1+3.0superscript59.1+3.0\text{59.1}^{\text{{\color[rgb]{0,.7,.0}{+3.0}}}}59.1 start_POSTSUPERSCRIPT +3.0 end_POSTSUPERSCRIPT 68.5+3.0superscript68.5+3.0\text{68.5}^{\text{{\color[rgb]{0,.7,.0}{+3.0}}}}68.5 start_POSTSUPERSCRIPT +3.0 end_POSTSUPERSCRIPT 54.5+3.0superscript54.5+3.0\text{54.5}^{\text{{\color[rgb]{0,.7,.0}{+3.0}}}}54.5 start_POSTSUPERSCRIPT +3.0 end_POSTSUPERSCRIPT
(7) +Alter +Shared & Global Bahng et al. (2022) 60.9+4.8superscript60.9+4.8\text{60.9}^{\text{{\color[rgb]{0,.7,.0}{+4.8}}}}60.9 start_POSTSUPERSCRIPT +4.8 end_POSTSUPERSCRIPT 69.0+3.5superscript69.0+3.5\text{69.0}^{\text{{\color[rgb]{0,.7,.0}{+3.5}}}}69.0 start_POSTSUPERSCRIPT +3.5 end_POSTSUPERSCRIPT 57.3+5.8superscript57.3+5.8\text{57.3}^{\text{{\color[rgb]{0,.7,.0}{+5.8}}}}57.3 start_POSTSUPERSCRIPT +5.8 end_POSTSUPERSCRIPT
(8) +SPT & Global Bahng et al. (2022) 61.4+5.3superscript61.4+5.3\text{61.4}^{\text{{\color[rgb]{0,.7,.0}{+5.3}}}}61.4 start_POSTSUPERSCRIPT +5.3 end_POSTSUPERSCRIPT 69.9+4.4superscript69.9+4.4\text{69.9}^{\text{{\color[rgb]{0,.7,.0}{+4.4}}}}69.9 start_POSTSUPERSCRIPT +4.4 end_POSTSUPERSCRIPT 57.5+6.0superscript57.5+6.0\text{57.5}^{\text{{\color[rgb]{0,.7,.0}{+6.0}}}}57.5 start_POSTSUPERSCRIPT +6.0 end_POSTSUPERSCRIPT

Effect of different training strategies. To investigate the impact of different training strategies, we conduct additional experiments on both generic and fine-grained datasets. We consider two different training strategies, namely, (i) end-to-end (3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT row): both the data parameters and the model parameters are jointly trained in an end-to-end fashion; (ii) data first (4thsuperscript4𝑡4^{th}4 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row): the prompt parameters are optimized first, followed by the model parameters; (iii) model first (5thsuperscript5𝑡5^{th}5 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row): the model parameters are optimized first, followed by the prompt parameters; and (iv) alternative (6thsuperscript6𝑡6^{th}6 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row): our alternative training strategy, which optimizes model and data parameters alternatively, every other k𝑘kitalic_k iterations. The results are presented in Table 5. Comparing rows (3)-(6) with the SimGCD baseline in row (1), we can see that SPTNet consistently outperforms SimGCD and our alternative training strategy leads to the best performance. Since the SPTNet is built upon the pre-trained SimGCD, one might wonder about the performance of further fine-tuning SimGCD. In the 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT row, we show the results after further fine-tuning the pre-trained SimGCD. An improvement can be achieved, while the margin is significantly smaller compared with the improvement achieved by SPTNet. This suggests that both our SPT and alternate training strategy are beneficial for GCD.

Table 5: Evaluation on ImageNet-100 and SSB using different training strategies.
ImageNet-100 SSB
No Methods All Old New All Old New
(1) SimGCD Wen et al. (2023) 83.0 93.1 77.9 56.1 65.5 51.5
(2) SimGCD (further fine-tune) 84.3 93.1 79.7 57.0 66.0 52.3
(3) SPTNet (end-to-end) 84.1 92.8 80.0 58.6 67.4 53.2
(4) SPTNet (data first) 83.5 92.9 77.7 58.0 66.4 51.9
(5) SPTNet (model first) 84.8 93.3 80.6 59.2 67.8 54.9
(6) SPTNet (alternative) 85.4 93.2 81.4 61.4 69.9 57.5

Effects of alternating frequency and prompt size. In our alternative training strategy, we alternate between the data and model parameter optimization every other k𝑘kitalic_k iterations. Meanwhile, we also need to determine the spatial prompt size m𝑚mitalic_m. In Fig. 3, we present the average ACC on SSB with varying k𝑘kitalic_k and m𝑚mitalic_m respectively. For k𝑘kitalic_k, we do not observe significant differences among different choices, and thus use a moderate value of 20 as our default choice. For m𝑚mitalic_m, we find that a smaller value generally leads to better performance. When m𝑚mitalic_m is too large, the image content might be over-occluded, causing difficulty for the model to properly recognize the object. We also show the effects of global prompt size in Appendix D.

Refer to caption
Figure 3: Effects of different choices of alternating frequency (a) and prompt size (b) on SSB (i.e., CUB, Stanford Cars and FGVC-Aircraft). We report the averaged results and show the influence on ‘All’, ‘Old’ and ‘New’ classes.

4.4 Qualitative comparison

Refer to caption
Figure 4: t-SNE visualization of representations on CIFAR-10. SPTNet produces the most discriminative representations among all compared methods.

How do prompts affect the representations? To investigate the influence of different visual prompts, we visualize representations on CIFAR-10 through t-SNE Van der Maaten & Hinton (2008) in Fig. 4. We compare representations of the SimGCD baseline, SimGCD+VPT, SPTNet, as well as SPTNet-P (which contains only the spatial prompts). They correspond to the models in row (1), row (2), row (8), and row (6) in Table 4. Comparing the representations of SimGCD and SimGCD+VPT, VPT appears to have a negative impact on the representation, leading to clutter between seen and unseen classes (e.g., bird and dog) in the GCD setting. This is also aligned with the deteriorated performance of SimGCD+VPT in Table 4. Both SPTNet-P and SPTNet produce more discriminative features and more compact clusters than SimGCD. Thanks to the global prompt, SPTNet further improves the representation over STPNet-P.

How do prompts affect the model’s attention? The attention map provides very helpful clues to understanding the Transformer-based models’ focus on the input. We extract the attention maps for the CLS token from different attention heads in the last layer of the ViT backbone and show the top 10% most attended patches in Fig. 5. We observe that for SimGCD and SimGCD+Global (i.e., row (4) in Table 4), different heads may focus on the same region (e.g., in the ‘seen’ example, h2subscript2h_{2}italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT/h3subscript3h_{3}italic_h start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT/h10subscript10h_{10}italic_h start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT of SimGCD and h4subscript4h_{4}italic_h start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT/h5subscript5h_{5}italic_h start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT/h9subscript9h_{9}italic_h start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT of SimGCD+Global) and some heads may attend to the background regions (e.g., in the ‘unseen’ example, h2subscript2h_{2}italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT/h4subscript4h_{4}italic_h start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT/h7subscript7h_{7}italic_h start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT for SimGCD and h1subscript1h_{1}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT/h5subscript5h_{5}italic_h start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT for SimGCD+Global). In contrast, SPT and SPT&Global attend to more diverse regions of the object and focus more on the foreground object regions, with the latter performing better.

Refer to caption
Figure 5: Attention visualization of different heads (numbered as h1subscript1h_{1}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to h12subscript12h_{12}italic_h start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT). The top 10% attended patches are shown in red.

More results and analysis can be found in the Appendix.

5 Conclusion

In conclusion, we have introduced SPTNet, an efficient framework for Generalized Category Discovery (GCD). We propose a two-stage alternative optimization scheme, optimizing both model and data parameters, to enhance alignment between the pre-trained model and the target task. Additionally, we introduce spatial prompt tuning (SPT) as a method to focus on object parts and facilitate knowledge transfer between seen and unseen classes. Experimental evaluations demonstrate the superiority of SPTNet over existing methods.

Acknowledgements

This work is supported by Hong Kong Research Grant Council - Early Career Scheme (Grant No. 27208022), National Natural Science Foundation of China (Grant No. 62306251), and HKU Seed Fund for Basic Research.

References

  • Arthur & Vassilvitskii (2006) David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. Technical report, Stanford, 2006.
  • Assran et al. (2022) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Mike Rabbat, and Nicolas Ballas. Masked siamese networks for label-efficient learning. In ECCV, 2022.
  • Bahng et al. (2022) Hyo** Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv: 2203.17274, 2022.
  • Berthelot et al. (2019) David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. In NeurIPS, 2019.
  • Byeon & Van Hentenryck (2022) Geunyeong Byeon and Pascal Van Hentenryck. Benders subproblem decomposition for bilevel problems with convex follower. INFORMS Journal on Computing, 2022.
  • Cao et al. (2022) Kaidi Cao, Maria Brbic, and Jure Leskovec. Open-world semi-supervised learning. In ICLR, 2022.
  • Caron et al. (2021) Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  • Chen et al. (2020a) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020a.
  • Chen et al. (2020b) Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-supervised models are strong semi-supervised learners. In NeurIPS, 2020b.
  • Chen et al. (2020c) Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020c.
  • Chen et al. (2021) Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In ICCV, 2021.
  • Dempster et al. (1977) Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 1977.
  • Dong et al. (2022) Bowen Dong, Pan Zhou, Shuicheng Yan, and Wangmeng Zuo. Lpt: Long-tailed prompt tuning for image classification. In ICLR, 2022.
  • Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  • Engelmann et al. (2020) Alexander Engelmann, Yuning Jiang, Boris Houska, and Timm Faulwasser. Decomposition of nonconvex optimization via bi-level distributed aladin. IEEE Transactions on Control of Network Systems, 2020.
  • Fini et al. (2021) Enrico Fini, Enver Sangineto, Stéphane Lathuilière, Zhun Zhong, Moin Nabi, and Elisa Ricci. A unified objective for novel class discovery. In ICCV, 2021.
  • Gidaris & Komodakis (2018) Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In CVPR, 2018.
  • Gu et al. (2023) Peiyan Gu, Chuyu Zhang, Ruijie Xu, and Xuming He. Class-relation knowledge distillation for novel class discovery. In ICCV, 2023.
  • Han et al. (2019) Kai Han, Andrea Vedaldi, and Andrew Zisserman. Learning to discover novel visual categories via deep transfer clustering. In ICCV, 2019.
  • Han et al. (2020) Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Andrea Vedaldi, and Andrew Zisserman. Automatically discovering and learning new visual categories with ranking statistics. In ICLR, 2020.
  • Han et al. (2021) Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Andrea Vedaldi, and Andrew Zisserman. Autonovel: Automatically discovering and learning novel visual categories. IEEE TPAMI, 2021.
  • Hao et al. (2024) Shaozhe Hao, Kai Han, and Kwan-Yee K Wong. Cipr: An efficient framework with cross-instance positive relations for generalized category discovery. TMLR, 2024.
  • HaoChen et al. (2021) Jeff Z HaoChen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for self-supervised deep learning with spectral contrastive loss. In NeurIPS, 2021.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  • Jia et al. (2022) Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In ECCV, 2022.
  • Jia et al. (2021) Xuhui Jia, Kai Han, Yukun Zhu, and Bradley Green. Joint representation learning and novel category discovery on single- and multi-modal data. In ICCV, 2021.
  • Khattak et al. (2023) Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In CVPR, 2023.
  • Khosla et al. (2020) Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. In NeurIPS, 2020.
  • Krause et al. (2013) Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCV workshop, 2013.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Krizhevsky et al. (2017) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 2017.
  • Maji et al. (2013) Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  • Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • Oquab et al. (2024) Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. DINOv2: Learning robust visual features without supervision. TMLR, 2024.
  • Peng et al. (2019) Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In ICCV, 2019.
  • Pu et al. (2023) Nan Pu, Zhun Zhong, and Nicu Sebe. Dynamic conceptional contrastive learning for generalized category discovery. In CVPR, 2023.
  • Rizve et al. (2022) Mamshad Nayeem Rizve, Navid Kardan, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah. Openldn: Learning to discover novel classes for open-world semi-supervised learning. In ECCV, 2022.
  • Shtedritski et al. (2023) Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms. In ICCV, 2023.
  • Sohn et al. (2020) Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In NeurIPS, 2020.
  • Sun et al. (2024) Yiyou Sun, Zhenmei Shi, and Yixuan Li. A graph-theoretic framework for understanding open-world semi-supervised learning. In NeurIPS, 2024.
  • Tan et al. (2019) Kiat Chuan Tan, Yulong Liu, Barbara Ambrose, Melissa Tulig, and Serge Belongie. The herbarium challenge 2019 dataset. arXiv preprint arXiv: 1906.05372, 2019.
  • Tarvainen & Valpola (2017) Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS, 2017.
  • Tian et al. (2020) Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In ECCV, 2020.
  • Van der Maaten & Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.
  • Vaze et al. (2021) Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman. Open-set recognition: A good closed-set classifier is all you need. In ICLR, 2021.
  • Vaze et al. (2022) Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman. Generalized category discovery. In CVPR, 2022.
  • Vaze et al. (2023) Sagar Vaze, Andrea Vedaldi, and Andrew Zisserman. No representation rules them all in category discovery. In NeurIPS, 2023.
  • Wang et al. (2022a) Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning. In NeurIPS, 2022a.
  • Wang et al. (2022b) Yidong Wang, Hao Chen, Yue Fan, Wang Sun, Ran Tao, Wenxin Hou, Renjie Wang, Linyi Yang, Zhi Zhou, Lan-Zhe Guo, et al. Usb: A unified semi-supervised learning benchmark for classification. In NeurIPS, 2022b.
  • Wang et al. (2024) Yu Wang, Zhun Zhong, Pengchong Qiao, Xuxin Cheng, Xiawu Zheng, Chang Liu, Nicu Sebe, Rongrong Ji, and Jie Chen. Discover and align taxonomic context priors for open-world semi-supervised learning. In NeurIPS, 2024.
  • Welinder et al. (2010) Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. Caltech-ucsd birds 200. 2010.
  • Wen et al. (2023) Xin Wen, Bingchen Zhao, and Xiaojuan Qi. Parametric classification for generalized category discovery: A baseline study. ICCV, 2023.
  • Zhang et al. (2023) Sheng Zhang, Salman Khan, Zhiqiang Shen, Muzammal Naseer, Guangyi Chen, and Fahad Khan. Promptcal: Contrastive affinity learning via auxiliary prompts for generalized novel category discovery. In CVPR, 2023.
  • Zhao & Han (2021) Bingchen Zhao and Kai Han. Novel visual category discovery with dual ranking statistics and mutual knowledge distillation. In NeurIPS, 2021.
  • Zhao et al. (2023) Bingchen Zhao, Xin Wen, and Kai Han. Learning semi-supervised gaussian mixture models for generalized category discovery. In ICCV, 2023.
  • Zhong et al. (2021a) Zhun Zhong, Enrico Fini, Subhankar Roy, Zhiming Luo, Elisa Ricci, and Nicu Sebe. Neighborhood contrastive learning for novel class discovery. In CVPR, 2021a.
  • Zhong et al. (2021b) Zhun Zhong, Linchao Zhu, Zhiming Luo, Shaozi Li, Yi Yang, and Nicu Sebe. Openmix: Reviving known knowledge for discovering novel visual categories in an open world. In CVPR, 2021b.

Appendix

\startcontents

[sections] \printcontents[sections]l1

Appendix A Discussion on different variants of SPTNet

As discussed in Section 3.3, our default SPTNet has both spatial and global prompts. In Section 4.3, we also introduce two variants of SPTNet with reduced prompt parameters, SPTNet-P (Patch) and SPTNet-S (Shared). SPTNet-P attaches only the spatial prompts without the global prompt to the input (row 6 in Table 4). The spatial prompts vary for different patches. SPTNet-S attaches a single shared spatial prompt without the global prompt to the input (row 7 in Table 4). In Figure 6, we compare the prompts of SPTNet, SPTNet-P and SPTNet-S.

As SPT wraps a small number of parameters around the raw input image in a rectangular shape with a width of m𝑚mitalic_m. As also discussed in Section 3.3, the number of parameters for the spatial prompt of each patch is 6m(h+w2m)6𝑚𝑤2𝑚6m(h+w-2m)6 italic_m ( italic_h + italic_w - 2 italic_m ). Let h=w=16𝑤16h=w=16italic_h = italic_w = 16, m=1𝑚1m=1italic_m = 1, and the number of patches n=196𝑛196n=196italic_n = 196. The number of parameters for a single spatial prompt is 6×1×(16+162)=5,880611616258806\times 1\times(16+16-2)=5,8806 × 1 × ( 16 + 16 - 2 ) = 5 , 880. 196 such prompts give 196×5,880=35,2800.034formulae-sequence1965880352800.034196\times 5,880=35,280\approx 0.034196 × 5 , 880 = 35 , 280 ≈ 0.034M parameters. As for the global padding, the number of parameters is 6m+(H+W2m+)6superscript𝑚𝐻𝑊2superscript𝑚6m^{+}(H+W-2m^{+})6 italic_m start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_H + italic_W - 2 italic_m start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ). Let H=W=224𝐻𝑊224H=W=224italic_H = italic_W = 224. The number of parameters for the global prompt is 6×30×(224+22460)=69,84063022422460698406\times 30\times(224+224-60)=69,8406 × 30 × ( 224 + 224 - 60 ) = 69 , 840. Therefore, the total number of parameters for SPT & Global is 35,280+69,840=105,120formulae-sequence352806984010512035,280+69,840=105,12035 , 280 + 69 , 840 = 105 , 120. As the backbone model, ViT-B, has 86M parameters, SPTNet, SPTNet-P, and SPTNet-S only introduce 0.117%, 0.039%, and 0.0002% extra parameters compared to ViT-B, respectively.

Refer to caption
Figure 6: Prompts of SPTNet, SPTNet-P and SPTNet-S. SPTNet has both distinct spatial prompts and a global prompt; SPTNet-P has multiple distinct spatial prompts; SPTNet-S has a single shared spatial prompt.

Appendix B Training configurations for SPTNet-P and SPTNet-S

We show the training configurations for SPTNet-P and SPTNet-S in Table 6 and Table 7 respectively. The prompt size of SPT m𝑚mitalic_m is set to 1 by default for both SPTNet-P and SPTNet-S. We set the global prompt size m+superscript𝑚m^{+}italic_m start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT in SPTNet to 30 (see Appendix D). For the ViT model, specifically, we resize all the input images into 224×224224224224\times 224224 × 224, so we have h×w=14×14=196𝑤1414196h\times w=14\times 14=196italic_h × italic_w = 14 × 14 = 196 patches with a resolution of 16×16161616\times 1616 × 16 pixels.

Table 6: Training configurations for SPTNet-P / SPTNet-S.
Configs
Dataset lrbsubscriptlr𝑏\texttt{lr}_{b}lr start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT wdbsubscriptwd𝑏\texttt{wd}_{b}wd start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT lrpsubscriptlr𝑝\texttt{lr}_{p}lr start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT wdpsubscriptwd𝑝\texttt{wd}_{p}wd start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT k𝑘kitalic_k m𝑚mitalic_m
CIFAR10 Krizhevsky et al. (2009) 3e-3 5e-4 20.0 0 20 1
CIFAR100 Krizhevsky et al. (2009) 1e-3 5e-4 1.0 0 20 1
ImageNet-100 Tian et al. (2020) 3e-3 5e-4 10.0 0 20 1
Herbarium 19 Tan et al. (2019) 5e-3 5e-4 1.0 0 20 1
CUB Welinder et al. (2010) 0.05 5e-4 25.0 0 20 1
Stanford Cars Krause et al. (2013) 0.05 5e-4 10.0 0 20 1
FGVC-Aircraft Maji et al. (2013) 0.05 5e-4 1.0 0 20 1
Table 7: Training configurations for SPTNet.
Configs
Dataset lrbsubscriptlr𝑏\texttt{lr}_{b}lr start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT wdbsubscriptwd𝑏\texttt{wd}_{b}wd start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT lrpsubscriptlr𝑝\texttt{lr}_{p}lr start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT wdpsubscriptwd𝑝\texttt{wd}_{p}wd start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT k𝑘kitalic_k m𝑚mitalic_m m+superscript𝑚m^{+}italic_m start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT
CIFAR10 Krizhevsky et al. (2009) 3e-3 5e-4 1.0 0 20 1 30
CIFAR100 Krizhevsky et al. (2009) 3e-3 5e-4 5.0 0 20 1 30
ImageNet-100 Tian et al. (2020) 3e-3 5e-4 5.0 0 20 1 30
Herbarium 19 Tan et al. (2019) 5e-3 5e-4 1.0 0 20 1 30
CUB Welinder et al. (2010) 0.05 5e-4 25.0 0 20 1 30
Stanford Cars Krause et al. (2013) 0.05 5e-4 10.0 0 20 1 30
FGVC-Aircraft Maji et al. (2013) 0.05 5e-4 1.0 0 20 1 30

Appendix C Benchmarking results of SPTNet-P and SPTNet-S

We further evaluate the performance of the more parameter-efficient SPTNet variants, SPTNet-P and SPTNet-S, in Table 8 and Table 9. As can be seen, SPTNet, SPTNet-P and SPTNet-S consistently outperform the baseline in all cases.

Table 8: Evaluation on three generic image recognition datasets. Bold values represent the best results, while underlined values represent the second best results.
CIFAR-10 CIFAR-100 ImageNet-100
Method All Old New All Old New All Old New
SimGCD Wen et al. (2023) 97.1 95.1 98.1 80.1 81.2 77.8 83.0 93.1 77.9
SPTNet-P (Ours) 97.5 95.2 98.5 82.0 85.5 75.0 85.5 93.9 81.2
SPTNet-S (Ours) 97.5 95.9 98.3 81.0 83.8 75.4 85.5 94.1 81.2
SPTNet (Ours) 97.3 95.0 98.6 81.3 84.3 75.6 85.4 93.2 81.4
Table 9: Evaluation on the Semantic Shift Benchmark (SSB) and Herbarium 19. Bold values represent the best results, while underlined values represent the second best results.
CUB Stanford Cars FGVC-Aircraft Herbarium19
Method All Old New All Old New All Old New All Old New
SimGCD Wen et al. (2023) 60.3 65.6 57.7 53.8 71.9 45.0 54.2 59.1 51.8 43.0 58.0 35.1
SPTNet-P (Ours) 64.6 70.5 61.6 55.6 74.4 46.5 57.2 60.6 55.5 43.3 58.0 35.5
SPTNet-S (Ours) 65.0 69.1 62.9 60.1 75.3 52.8 56.3 61.4 53.8 43.4 58.6 35.2
SPTNet (Ours) 65.8 68.8 65.1 59.0 79.2 49.3 59.3 61.8 58.1 43.4 58.7 35.2

Appendix D Effects of global prompt size m+superscript𝑚m^{+}italic_m start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT

As the global prompt size m+superscript𝑚m^{+}italic_m start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT may affect the performance of SPTNet, we experiment with different global prompt sizes, namely m+=1,10,20,30,40,50superscript𝑚11020304050m^{+}=1,10,20,30,40,50italic_m start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = 1 , 10 , 20 , 30 , 40 , 50. We measure accuracy on the CUB dataset using the same architecture and configurations in the main paper. Fig. 7 demonstrates that m+=30superscript𝑚30m^{+}=30italic_m start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = 30 yields the best overall performance, which is the default setting in our main paper.

Refer to caption
Figure 7: Performance of SPTNet with different global prompt sizes m+superscript𝑚m^{+}italic_m start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT on CUB. We show the influence on ‘All’, ‘Old’ and ‘New’ classes. When m+superscript𝑚m^{+}italic_m start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is set to 30303030, SPTNet achieves the best performance on ‘All’ categories.

Appendix E Visualization of learned prompts

We visualize different prompts after convergence in Fig. 8. Except for the end-to-end strategy, SPT and its variants commonly exhibit active parameters in prompts. This is attributed to their instance-agnostic nature, which enables them to handle variations in object locations across the dataset. Consequently, parameter values do not degrade to zero for a region patch that is background in one input but foreground in another. It is also worth noting that most prompts (particularly located at the borders) are deactivated when employing the end-to-end strategy. We hypothesize that this is due to the network unintentionally adopting a “shortcut" approach, where it only updates model parameters to achieve invariance to learned augmentation while optimizing both model and data (prompt) parameters simultaneously. This also validates the need for alternate training when learned augmentation is applied.

Refer to caption
Figure 8: Visualization of different learned prompts. Parameters of SPT and its variants are mostly active, while most prompts are much less active when employing the end-to-end strategy.

Appendix F Results based on DINOv2

Here, we replace the pre-trained DINO model with the recently improved DINOv2 model Oquab et al. (2024) which is empowered with stronger representation capacity from unlabelled data. Results are shown in Table 10. We can observe that the stronger DINOv2 representation indeed enhances model performance as expected, especially in the ‘New’ categories, and SPTNet and the two variants still consistently outperform SimGCD.

Table 10: Evaluation on CUB and ImageNet-100 using the pre-trained DINOv2 model. Bold values represent the best results, while underlined values represent the second best results.
CUB ImageNet-100
Method All Old New All Old New
DINO+SimGCD Wen et al. (2023) 60.3 65.6 57.7 83.0 93.1 77.9
DINO+SPTNet-P (Ours) 64.6 70.5 61.6 85.5 93.9 81.2
DINO+SPTNet-S (Ours) 65.0 69.1 62.9 85.5 94.1 81.2
DINO+SPTNet (Ours) 65.8 68.8 65.1 85.4 93.2 81.4
DINOv2+SimGCD Wen et al. (2023) 74.9 78.5 73.1 88.5 96.2 84.6
DINOv2+SPTNet-P (Ours) 75.5 79.0 74.3 90.5 96.3 87.5
DINOv2+SPTNet-S (Ours) 75.0 78.3 73.4 90.6 96.4 87.6
DINOv2+SPTNet (Ours) 76.3 79.5 74.6 90.1 96.1 87.1

Appendix G Robustness of SPTNet for GCD with domain shifts

To validate the robustness of SPTNet, we test our method in a more challenging setting, GCD with domain shifts. We conduct experiments on the largest UDA dataset DomainNet Peng et al. (2019), containing about 0.6 million images with 345 categories distributed among six domains. We apply the data construction process in Vaze et al. (2022) to construct the‘Old’, ‘New’ and ‘All’ splits based on DomainNet and evaluate different methods on our constructed data. To account for domain shifts that were not considered in Vaze et al. (2022), we construct the partially labelled data by using images from both the ‘real’ domain and the ’painting’ domain to train the model. Specifically, we utilise a subset of labelled images from select classes in the ‘real’ domain, along with unlabelled images from all classes in the ‘painting’ domain. We assess the model’s performance on both the ‘real’ and ‘painting’ domains. Additionally, we evaluate the model on images from other previously unseen domains, including ‘quickdraw’, ‘sketch’, ‘infograph’, and ‘clipart’. Results are shown in Table 11. Compared with other baseline methods in the vanilla GCD setting, we find that SPTNet can perform well on (i) labelled and seen domain (i.e., ‘real’), (ii) unlabelled but seen domains (i.e., ‘painting’) and (iii) unseen domains (i.e., others, including ‘quickdraw’, ‘sketch’, ‘infograph’, and ‘clipart’).

Table 11: Evaluation on the DomainNet benchmark. The model is trained on the ‘real’ and ‘painting’ domains and we report the respective results on real, painting and the remaining four domains (i.e., others). Bold values represent the best results, while underlined values represent the second best results.
Real Painting Others
Methods All Old New All Old New All Old New
RankStats+ 34.1 61.9 19.7 29.7 49.7 9.6 14.3 25.5 5.5
UNO+ 44.2 72.2 29.7 30.1 45.1 17.2 14.0 23.4 7.4
ORCA 31.9 49.8 23.5 28.7 38.5 7.1 10.4 19.5 8.1
GCD 47.3 53.6 44.1 32.9 41.8 23.0 15.2 22.0 11.1
SimGCD 61.3 77.8 52.9 34.5 35.6 33.5 16.7 22.5 12.2
SPTNet (Ours) 63.1 75.9 56.4 39.2 43.1 35.2 17.4 22.2 13.6

Appendix H Unknown category number

As the total number of categories (GT) cannot be accessed in the real-world setting, we evaluate our SPTNet-P with an estimated number of categories using an off-the-shelf method Vaze et al. (2022) (see Table 12) We also evaluate our method with varying category numbers (see Fig. 9). Our evaluation includes two representative datasets: CUB for fine-grained and ImageNet-100 for generic classification tasks. We find that our method consistently outperforms SimGCD on both datasets when the exact number of categories is unknown.

Table 12: Performance of SPTNet-P and the baseline method SimGCD with an estimated number of categories on CUB and ImageNet-100. Bold values represent the best results.
CUB ImageNet-100
Method |𝒞|𝒞|\mathcal{C}|| caligraphic_C | All Old New All Old New
SimGCD Wen et al. (2023) GT (200/100) 60.3 65.6 57.7 83.0 93.1 77.9
SPTNet-P (Ours) GT (200/100) 64.6 70.5 61.6 85.5 93.9 81.2
SimGCD Wen et al. (2023) Est. (231/109) 61.0 66.0 58.6 81.1 90.9 76.1
SPTNet-P (Ours) Est. (231/109) 65.2 71.0 62.3 83.4 91.8 74.6
Refer to caption
Figure 9: Performance with varying category numbers. We experiment with category numbers obtained by multiplying the GT number with different factors C={0.1,0.5,1.0,2.0,10.0}superscript𝐶0.10.51.02.010.0C^{\prime}=\{0.1,0.5,1.0,2.0,10.0\}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { 0.1 , 0.5 , 1.0 , 2.0 , 10.0 }.

We also evaluate our SPTNet-P on CIFAR-100 dataset with fewer known categories. The results are shown in Fig. 10, where 50% of the samples from known classes are labelled. The results indicate that SPTNet is robust in few-class scenarios and outperforms the concurrent method, PromptCal Zhang et al. (2023). It is more challenging for models to infer novel semantic clustering when fewer classes are known due to semantic shifts, resulting in decreased performance for all methods. This further demonstrates the effectiveness of our proposed method.

Refer to caption
Figure 10: Performance with a varying number of known classes |𝒞1|subscript𝒞1|\mathcal{C}_{1}|| caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT |.

Appendix I Performance and time efficiency

To assess the practicality of different methods, we conducted further comparisons in terms of accuracy, training time per epoch, and inference time. The results are presented in Table 13. Our proposed SPTNet demonstrates superior accuracy while obtaining mostly the best time efficiency.

Table 13: Time efficiency of different methods on ImageNet-100 and SSB. Bold values represent the best results.
ImageNet-100 SSB
Method Accuracy (All) Training time (Sec) Inference time (Sec) Accuracy (All) Training time (Sec) Inference time (Sec)
GCD Vaze et al. (2022) 74.1 803 2289 51.3 58 552
SimGCD Wen et al. (2023) 83.0 847 591 56.1 64 17
PromptCAL Zhang et al. (2023) 83.1 1817 893 55.1 492 103
SPTNet (Ours) 85.4 483 601 61.4 32 17

Appendix J Theoretical analysis on our alternate training

To estimate the model parameters θ𝜃\thetaitalic_θ, it is common to introduce the log-likelihood function L(θ)=ln𝒫(Xθ)𝐿𝜃𝒫conditional𝑋𝜃L(\theta)=\ln\mathcal{P}(X\mid\theta)italic_L ( italic_θ ) = roman_ln caligraphic_P ( italic_X ∣ italic_θ ). This function quantifies the likelihood of the parameter θ𝜃\thetaitalic_θ given the data X𝑋Xitalic_X. As the natural logarithm function, lnX𝑋\ln Xroman_ln italic_X, is monotonically increasing, maximizing 𝒫(Xθ)𝒫conditional𝑋𝜃\mathcal{P}(X\mid\theta)caligraphic_P ( italic_X ∣ italic_θ ) is equivalent to maximizing L(θ)𝐿𝜃L(\theta)italic_L ( italic_θ ). In other words, maximizing the log-likelihood function L(θ)𝐿𝜃L(\theta)italic_L ( italic_θ ) achieves the same objective.

The EM algorithm is an iterative procedure designed to maximize L(θ)𝐿𝜃L(\theta)italic_L ( italic_θ ). Let θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the current estimate for θ𝜃\thetaitalic_θ after the t𝑡titalic_t-th iteration. Our goal is to calculate an updated estimate θ𝜃\thetaitalic_θ that maximizes L(θ)𝐿𝜃L(\theta)italic_L ( italic_θ ):

L(θ)L(θt)𝐿𝜃𝐿subscript𝜃𝑡\displaystyle L(\theta)-L\left(\theta_{t}\right)italic_L ( italic_θ ) - italic_L ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =ln𝒫(Xθ)ln𝒫(Xθt)absent𝒫conditional𝑋𝜃𝒫conditional𝑋subscript𝜃𝑡\displaystyle=\ln\mathcal{P}\left(X\mid\theta\right)-\ln\mathcal{P}\left(X\mid% \theta_{t}\right)= roman_ln caligraphic_P ( italic_X ∣ italic_θ ) - roman_ln caligraphic_P ( italic_X ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (5)
=ln(p1:n𝒫(Xp1:n,θ)𝒫(p1:nθ))ln𝒫(Xθt)absentsubscriptsuperscript𝑝:1𝑛𝒫conditional𝑋superscript𝑝:1𝑛𝜃𝒫conditionalsuperscript𝑝:1𝑛𝜃𝒫conditional𝑋subscript𝜃𝑡\displaystyle=\ln\left(\sum_{p^{1:n}}\mathcal{P}(X\mid p^{1:n},\theta)\mathcal% {P}(p^{1:n}\mid\theta)\right)-\ln\mathcal{P}\left(X\mid\theta_{t}\right)= roman_ln ( ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_P ( italic_X ∣ italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT , italic_θ ) caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_θ ) ) - roman_ln caligraphic_P ( italic_X ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=ln(p1:n𝒫(Xp1:n,θ)𝒫(p1:nθ)𝒫(p1:nX,θt)𝒫(p1:nX,θt))ln𝒫(Xθt)absentsubscriptsuperscript𝑝:1𝑛𝒫conditional𝑋superscript𝑝:1𝑛𝜃𝒫conditionalsuperscript𝑝:1𝑛𝜃𝒫conditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡𝒫conditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡𝒫conditional𝑋subscript𝜃𝑡\displaystyle=\ln\left(\sum_{p^{1:n}}\mathcal{P}(X\mid p^{1:n},\theta)\mathcal% {P}(p^{1:n}\mid\theta)\cdot\frac{\mathcal{P}\left(p^{1:n}\mid X,\theta_{t}% \right)}{\mathcal{P}\left(p^{1:n}\mid X,\theta_{t}\right)}\right)-\ln\mathcal{% P}\left(X\mid\theta_{t}\right)= roman_ln ( ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_P ( italic_X ∣ italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT , italic_θ ) caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_θ ) ⋅ divide start_ARG caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ) - roman_ln caligraphic_P ( italic_X ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=ln(p1:n𝒫(p1:nX,θt)𝒫(Xp1:n,θ)𝒫(p1:nθ)𝒫(p1:nX,θt))ln𝒫(Xθt)absentsubscriptsuperscript𝑝:1𝑛𝒫conditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡𝒫conditional𝑋superscript𝑝:1𝑛𝜃𝒫conditionalsuperscript𝑝:1𝑛𝜃𝒫conditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡𝒫conditional𝑋subscript𝜃𝑡\displaystyle=\ln\left(\sum_{p^{1:n}}\mathcal{P}\left(p^{1:n}\mid X,\theta_{t}% \right)\frac{\mathcal{P}(X\mid p^{1:n},\theta)\mathcal{P}(p^{1:n}\mid\theta)}{% \mathcal{P}\left(p^{1:n}\mid X,\theta_{t}\right)}\right)-\ln\mathcal{P}\left(X% \mid\theta_{t}\right)= roman_ln ( ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) divide start_ARG caligraphic_P ( italic_X ∣ italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT , italic_θ ) caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_θ ) end_ARG start_ARG caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ) - roman_ln caligraphic_P ( italic_X ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
p1:n𝒫(p1:nX,θt)ln(𝒫(Xp1:n,θ)𝒫(p1:nθ)𝒫(p1:nX,θt))ln𝒫(Xθt)absentsubscriptsuperscript𝑝:1𝑛𝒫conditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡𝒫conditional𝑋superscript𝑝:1𝑛𝜃𝒫conditionalsuperscript𝑝:1𝑛𝜃𝒫conditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡𝒫conditional𝑋subscript𝜃𝑡\displaystyle\geq\sum_{p^{1:n}}\mathcal{P}\left(p^{1:n}\mid X,\theta_{t}\right% )\ln\left(\frac{\mathcal{P}(X\mid p^{1:n},\theta)\mathcal{P}(p^{1:n}\mid\theta% )}{\mathcal{P}\left(p^{1:n}\mid X,\theta_{t}\right)}\right)-\ln\mathcal{P}% \left(X\mid\theta_{t}\right)≥ ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_ln ( divide start_ARG caligraphic_P ( italic_X ∣ italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT , italic_θ ) caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_θ ) end_ARG start_ARG caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ) - roman_ln caligraphic_P ( italic_X ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=p1:n𝒫(p1:nX,θt)ln(𝒫(Xp1:n,θ)𝒫(p1:nθ)𝒫(p1:nX,θt)𝒫(Xθt))absentsubscriptsuperscript𝑝:1𝑛𝒫conditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡𝒫conditional𝑋superscript𝑝:1𝑛𝜃𝒫conditionalsuperscript𝑝:1𝑛𝜃𝒫conditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡𝒫conditional𝑋subscript𝜃𝑡\displaystyle=\sum_{p^{1:n}}\mathcal{P}\left(p^{1:n}\mid X,\theta_{t}\right)% \ln\left(\frac{\mathcal{P}(X\mid p^{1:n},\theta)\mathcal{P}(p^{1:n}\mid\theta)% }{\mathcal{P}\left(p^{1:n}\mid X,\theta_{t}\right)\mathcal{P}\left(X\mid\theta% _{t}\right)}\right)= ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_ln ( divide start_ARG caligraphic_P ( italic_X ∣ italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT , italic_θ ) caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_θ ) end_ARG start_ARG caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) caligraphic_P ( italic_X ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG )
H(θθt).absent𝐻conditional𝜃subscript𝜃𝑡\displaystyle\triangleq H\left(\theta\mid\theta_{t}\right).≜ italic_H ( italic_θ ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

Let l(θθt)=L(θt)+H(θtθt)𝑙conditional𝜃subscript𝜃𝑡𝐿subscript𝜃𝑡𝐻conditionalsubscript𝜃𝑡subscript𝜃𝑡l\left(\theta\mid\theta_{t}\right)=L\left(\theta_{t}\right)+H\left(\theta_{t}% \mid\theta_{t}\right)italic_l ( italic_θ ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_L ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_H ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) which is bounded above by the likelihood function L(θ)𝐿𝜃L\left(\theta\right)italic_L ( italic_θ ). Let θ=θt𝜃subscript𝜃𝑡\theta=\theta_{t}italic_θ = italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we observe that:

l(θtθt)𝑙conditionalsubscript𝜃𝑡subscript𝜃𝑡\displaystyle l\left(\theta_{t}\mid\theta_{t}\right)italic_l ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =L(θt)+H(θtθt)absent𝐿subscript𝜃𝑡𝐻conditionalsubscript𝜃𝑡subscript𝜃𝑡\displaystyle=L\left(\theta_{t}\right)+H\left(\theta_{t}\mid\theta_{t}\right)= italic_L ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_H ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (6)
=L(θt)+p1:n𝒫(p1:nX,θt)ln𝒫(Xp1:n,θt)𝒫(p1:nθt)𝒫(p1:nX,θt)𝒫(Xθt)absent𝐿subscript𝜃𝑡subscriptsuperscript𝑝:1𝑛𝒫conditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡𝒫conditional𝑋superscript𝑝:1𝑛subscript𝜃𝑡𝒫conditionalsuperscript𝑝:1𝑛subscript𝜃𝑡𝒫conditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡𝒫conditional𝑋subscript𝜃𝑡\displaystyle=L\left(\theta_{t}\right)+\sum_{p^{1:n}}\mathcal{P}\left(p^{1:n}% \mid X,\theta_{t}\right)\ln\frac{\mathcal{P}\left(X\mid p^{1:n},\theta_{t}% \right)\mathcal{P}\left(p^{1:n}\mid\theta_{t}\right)}{\mathcal{P}\left(p^{1:n}% \mid X,\theta_{t}\right)\mathcal{P}\left(X\mid\theta_{t}\right)}= italic_L ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_ln divide start_ARG caligraphic_P ( italic_X ∣ italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) caligraphic_P ( italic_X ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG
L(θt)+p1:n𝒫(p1:nX,θt)ln𝒫(X,p1:nθt)𝒫(X,p1:nθt)absent𝐿subscript𝜃𝑡subscriptsuperscript𝑝:1𝑛𝒫conditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡𝒫𝑋conditionalsuperscript𝑝:1𝑛subscript𝜃𝑡𝒫𝑋conditionalsuperscript𝑝:1𝑛subscript𝜃𝑡\displaystyle\neq L\left(\theta_{t}\right)+\sum_{p^{1:n}}\mathcal{P}\left(p^{1% :n}\mid X,\theta_{t}\right)\ln\frac{\mathcal{P}\left(X,p^{1:n}\mid\theta_{t}% \right)}{\mathcal{P}\left(X,p^{1:n}\mid\theta_{t}\right)}≠ italic_L ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_ln divide start_ARG caligraphic_P ( italic_X , italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG caligraphic_P ( italic_X , italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG
=L(θt)+p1:n𝒫(p1:nX,θt)ln1absent𝐿subscript𝜃𝑡subscriptsuperscript𝑝:1𝑛𝒫conditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡1\displaystyle=L\left(\theta_{t}\right)+\sum_{p^{1:n}}\mathcal{P}\left(p^{1:n}% \mid X,\theta_{t}\right)\ln 1= italic_L ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_ln 1
=L(θt).absent𝐿subscript𝜃𝑡\displaystyle=L\left(\theta_{t}\right).= italic_L ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

Note that in general, 𝒫(Xp1:n,θt)𝒫(p1:nθt)𝒫conditional𝑋superscript𝑝:1𝑛subscript𝜃𝑡𝒫conditionalsuperscript𝑝:1𝑛subscript𝜃𝑡\mathcal{P}(X\mid p^{1:n},\theta_{t})\mathcal{P}(p^{1:n}\mid\theta_{t})caligraphic_P ( italic_X ∣ italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) cannot be equal to 𝒫(X,p1:nθt)𝒫𝑋conditionalsuperscript𝑝:1𝑛subscript𝜃𝑡\mathcal{P}(X,p^{1:n}\mid\theta_{t})caligraphic_P ( italic_X , italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) since p1:nsuperscript𝑝:1𝑛p^{1:n}italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT is conditioned on both X𝑋Xitalic_X and θ𝜃\thetaitalic_θ. Consequently, the factorized form does not equal the joint distribution. As a result, for θ=θt𝜃subscript𝜃𝑡\theta=\theta_{t}italic_θ = italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the functions l(θθt)𝑙conditional𝜃subscript𝜃𝑡l\left(\theta\mid\theta_{t}\right)italic_l ( italic_θ ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and L(θ)𝐿𝜃L(\theta)italic_L ( italic_θ ) are not equal.

Our objective is to find the θ𝜃\thetaitalic_θ that maximizes the function L(θ)𝐿𝜃L(\theta)italic_L ( italic_θ ). While l(θ|θt)𝑙conditional𝜃subscript𝜃𝑡l(\theta|\theta_{t})italic_l ( italic_θ | italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and L(θ)𝐿𝜃L(\theta)italic_L ( italic_θ ) may not be equal for the current estimate θ=θt𝜃subscript𝜃𝑡\theta=\theta_{t}italic_θ = italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, it still holds that l(θ|θt)𝑙conditional𝜃subscript𝜃𝑡l(\theta|\theta_{t})italic_l ( italic_θ | italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is bounded by L(θ)𝐿𝜃L(\theta)italic_L ( italic_θ ). Therefore, increasing l(θ|θt)𝑙conditional𝜃subscript𝜃𝑡l(\theta|\theta_{t})italic_l ( italic_θ | italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) will also increase L(θ)𝐿𝜃L(\theta)italic_L ( italic_θ ). To achieve the greatest increase in L(θ)𝐿𝜃L(\theta)italic_L ( italic_θ ), the EM algorithm selects an updated θt+1subscript𝜃𝑡1\theta_{t+1}italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT that maximizes l(θ|θt)𝑙conditional𝜃subscript𝜃𝑡l(\theta|\theta_{t})italic_l ( italic_θ | italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ):

θt+1subscript𝜃𝑡1\displaystyle\theta_{t+1}italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =argmaxθ{l(θθt)}absentsubscript𝜃𝑙conditional𝜃subscript𝜃𝑡\displaystyle=\arg\max_{\theta}\left\{l\left(\theta\mid\theta_{t}\right)\right\}= roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT { italic_l ( italic_θ ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } (7)
=argmaxθ{L(θt)+p1:n𝒫(p1:nX,θt)ln𝒫(Xp1:n,θ)𝒫(p1:nθ)𝒫(Xθt)𝒫(p1:nX,θt)}absentsubscript𝜃𝐿subscript𝜃𝑡subscriptsuperscript𝑝:1𝑛𝒫conditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡𝒫conditional𝑋superscript𝑝:1𝑛𝜃𝒫conditionalsuperscript𝑝:1𝑛𝜃𝒫conditional𝑋subscript𝜃𝑡𝒫conditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡\displaystyle=\arg\max_{\theta}\left\{L\left(\theta_{t}\right)+\sum_{p^{1:n}}% \mathcal{P}\left(p^{1:n}\mid X,\theta_{t}\right)\ln\frac{\mathcal{P}(X\mid p^{% 1:n},\theta)\mathcal{P}(p^{1:n}\mid\theta)}{\mathcal{P}\left(X\mid\theta_{t}% \right)\mathcal{P}\left(p^{1:n}\mid X,\theta_{t}\right)}\right\}= roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT { italic_L ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_ln divide start_ARG caligraphic_P ( italic_X ∣ italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT , italic_θ ) caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_θ ) end_ARG start_ARG caligraphic_P ( italic_X ∣ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG }

Ignoring terms which are constant w.r.t. θ𝜃\thetaitalic_θ, the equation can be further deduced:

θt+1subscript𝜃𝑡1\displaystyle\theta_{t+1}italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =argmaxθ{p1:n𝒫(p1:nX,θt)ln𝒫(Xp1:n,θ)𝒫(p1:nθ)}absentsubscript𝜃subscriptsuperscript𝑝:1𝑛𝒫conditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡𝒫conditional𝑋superscript𝑝:1𝑛𝜃𝒫conditionalsuperscript𝑝:1𝑛𝜃\displaystyle=\arg\max_{\theta}\left\{\sum_{p^{1:n}}\mathcal{P}\left(p^{1:n}% \mid X,\theta_{t}\right)\ln\mathcal{P}(X\mid p^{1:n},\theta)\mathcal{P}(p^{1:n% }\mid\theta)\right\}= roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_ln caligraphic_P ( italic_X ∣ italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT , italic_θ ) caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_θ ) } (8)
=argmaxθ{p1:n𝒫(p1:nX,θt)ln𝒫(X,p1:n,θ)𝒫(p1:n,θ)𝒫(p1:n,θ)𝒫(θ)}absentsubscript𝜃subscriptsuperscript𝑝:1𝑛𝒫conditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡𝒫𝑋superscript𝑝:1𝑛𝜃𝒫superscript𝑝:1𝑛𝜃𝒫superscript𝑝:1𝑛𝜃𝒫𝜃\displaystyle=\arg\max_{\theta}\left\{\sum_{p^{1:n}}\mathcal{P}\left(p^{1:n}% \mid X,\theta_{t}\right)\ln\frac{\mathcal{P}(X,p^{1:n},\theta)}{\mathcal{P}(p^% {1:n},\theta)}\frac{\mathcal{P}(p^{1:n},\theta)}{\mathcal{P}(\theta)}\right\}= roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_ln divide start_ARG caligraphic_P ( italic_X , italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT , italic_θ ) end_ARG start_ARG caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT , italic_θ ) end_ARG divide start_ARG caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT , italic_θ ) end_ARG start_ARG caligraphic_P ( italic_θ ) end_ARG }
=argmaxθ{p1:n𝒫(p1:nX,θt)ln𝒫(X,p1:nθ)}absentsubscript𝜃subscriptsuperscript𝑝:1𝑛𝒫conditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡𝒫𝑋conditionalsuperscript𝑝:1𝑛𝜃\displaystyle=\arg\max_{\theta}\left\{\sum_{p^{1:n}}\mathcal{P}\left(p^{1:n}% \mid X,\theta_{t}\right)\ln\mathcal{P}(X,p^{1:n}\mid\theta)\right\}= roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_P ( italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_ln caligraphic_P ( italic_X , italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_θ ) }
=argmaxθ{Ep1:nX,θt{ln𝒫(X,p1:nθ)}}.absentsubscript𝜃subscriptEconditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡𝒫𝑋conditionalsuperscript𝑝:1𝑛𝜃\displaystyle=\arg\max_{\theta}\left\{\mathrm{E}_{p^{1:n}\mid X,\theta_{t}}\{% \ln\mathcal{P}(X,p^{1:n}\mid\theta)\}\right\}.= roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT { roman_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT { roman_ln caligraphic_P ( italic_X , italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_θ ) } } .

The alternate training algorithm thus consists of iterating (1) E-step: Determine the conditional expectation Ep1:nX,θt{ln𝒫(X,p1:nθ)}subscriptEconditionalsuperscript𝑝:1𝑛𝑋subscript𝜃𝑡𝒫𝑋conditionalsuperscript𝑝:1𝑛𝜃\mathrm{E}_{p^{1:n}\mid X,\theta_{t}}\{\ln\mathcal{P}(X,p^{1:n}\mid\theta)\}roman_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_X , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT { roman_ln caligraphic_P ( italic_X , italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT ∣ italic_θ ) }. (2) M-step: Maximize this expression with respect to θ𝜃\thetaitalic_θ. It is evident that end-to-end training for maximizing L(θ)𝐿𝜃L(\theta)italic_L ( italic_θ ) is not equivalent to two-stage learning l(θ|θt)𝑙conditional𝜃subscript𝜃𝑡l(\theta|\theta_{t})italic_l ( italic_θ | italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in the converged state, as verified in Eq. 6. Another advantage of two-stage learning is that it provides a framework for better estimation for both model and data parameters. This is further supported by the evidence presented in Fig. 8, where end-to-end training leads to sub-optimal solutions for p1:nsuperscript𝑝:1𝑛p^{1:n}italic_p start_POSTSUPERSCRIPT 1 : italic_n end_POSTSUPERSCRIPT.

Appendix K More visualization and analysis of attention maps

We visualize attention maps from different heads in the last layer of the ViT backbone for multiple datasets. The query position is selected either as the CLS token (in Fig. 11, Fig. 12, Fig. 13) or the local region on the edge of the foreground object (in Fig. 14, Fig. 15, Fig. 16, Fig. 17). We show the top 60% most attended patches in red for different attention heads. We observe that SPTNet-P and SPTNet can attend to more salient object regions, likely due to their ability to learn local invariance. Besides, SPTNet-P and SPTNet cover more diverse regions of the salient object regardless of query positions, illustrating more diverse attention patterns across heads. A similar phenomenon can be found in Stanford Cars and FGVC-Aircraft in Fig. 11 and Fig. 12 for the CLS token and Fig. 15 and Fig. 16 for the edge position, as well as in the generic dataset (e.g., ImageNet) in Fig. 13 and Fig. 17.

We also investigate the impact of our proposed SPT by separating the prompt and backbone from SPTNet-P. We remove the prompt component from a well-trained SPTNet-P and visualize the attention maps by feeding raw images to the backbone only, referred to as SPTNet-P (w/o prompt)’. Upon comparing the attention maps with and without patch-wise prompts, we observe that SPTNet-P with prompts (i.e., the 3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT column) exhibits clearer attention on foreground objects compared to SPTNet-P without prompts (i.e., the 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT column). This indicates that the learned prompts help elicit critical features for recognition. Additionally, when considering a generic dataset like ImageNet-100, we notice that there is no significant difference between the attention maps of models with and without prompts, resulting in an inferior performance boost compared to the fine-grained datasets.

Refer to caption
Figure 11: Attention visualization on Stanford Cars, for 12 different attention heads in the last layer of the ViT backbone, by querying the CLS token.
Refer to caption
Figure 12: Attention visualization on FGVC-Aircraft, for 12 different attention heads in the last layer of the ViT backbone, by querying the CLS token.
Refer to caption
Figure 13: Attention visualization on ImageNet-100, for 12 different attention heads in the last layer of the ViT backbone, by querying the CLS token.
Refer to caption
Figure 14: Attention visualization on CUB, for 12 different attention heads in the last layer of the ViT backbone, by querying the point marked as green.
Refer to caption
Figure 15: Attention visualization on Stanford Cars, for 12 different attention heads in the last layer of the ViT backbone, by querying the point marked as green.
Refer to caption
Figure 16: Attention visualization on FGVC-Aircraft, for 12 different attention heads in the last layer of the ViT backbone, by querying the point marked as green.
Refer to caption
Figure 17: Attention visualization on ImageNet-100, for 12 different attention heads in the last layer of the ViT backbone, by querying the point marked as green.
Refer to caption
Figure 18: Attention visualization on CUB, for 12 different attention heads in the last layer of the ViT backbone, by querying the CLS token. SPTNet-P and SPTNet can automatically identify salient objects, likely due to their ability to learn local invariance. Upon comparing the attention maps with and without patch-wise prompts, we observe that SPTNet with prompts (i.e., the 3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT column) exhibits more concentrated attention on the salient object compared to SPTNet without prompts (i.e., the 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT column). This indicates that our learned prompts help elicit critical features for recognition.
Refer to caption
Figure 19: Attention visualization on Stanford Cars, for 12 different attention heads in the last layer of the ViT backbone, by querying the CLS token. SPTNet-P and SPTNet can automatically identify salient objects, likely due to their ability to learn local invariance. Upon comparing the attention maps with and without patch-wise prompts, we observe that SPTNet with prompts (i.e., the 3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT column) exhibits more concentrated attention on the salient object compared to SPTNet without prompts (i.e., the 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT column). This indicates that our learned prompts help elicit critical features for recognition.
Refer to caption
Figure 20: Attention visualization on FGVC-Aircraft, for 12 different attention heads in the last layer of the ViT backbone, by querying the CLS token. SPTNet-P and SPTNet can automatically identify salient objects, likely due to their ability to learn local invariance. Upon comparing the attention maps with and without patch-wise prompts, we observe that SPTNet with prompts (i.e., the 3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT column) exhibits more concentrated attention on the salient object compared to SPTNet without prompts (i.e., the 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT column). This indicates that our learned prompts help elicit critical features for recognition.
Refer to caption
Figure 21: Attention visualization on ImageNet-100, for 12 different attention heads in the last layer of the ViT backbone, by querying the CLS token. SPTNet-P and SPTNet can automatically identify salient objects, likely due to their ability to learn local invariance. Upon comparing the attention maps with and without patch-wise prompts, we observe that SPTNet with prompts (i.e., the 3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT column) exhibits more concentrated attention on the salient object compared to SPTNet without prompts (i.e., the 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT column). This indicates that our learned prompts help elicit critical features for recognition.

To explore the difference of performance boost between generic and fine-grained datasets, we transfer a pre-trained model from a fine-grained dataset to a generic one in Fig. 22. Since the overlap between ImageNet-100 and fine-grained datasets only contains various types of birds, we select some bird samples from ImageNet-100 as inputs and visualize the attention map of two models: (1) trained on ImageNet-100 and (2) trained on CUB. The model trained on CUB appears to focus more on local regions, while the one trained on ImageNet-100 pays more attention to the entire object. Furthermore, based on the quantitative comparison presented in Section 4.3 and Appendix C, we observe that SPTNet-P outperforms SPTNet on the ImageNet-100 dataset but performs worse than SPTNet on CUB. Additionally, as depicted in Figure 22, we can observe that when more diverse attention is focused on different regions of the object, it corresponds to improved performance. This indicates that the ability of the model to concentrate attention on various object regions is beneficial for achieving better results. For instance, when comparing ‘SPTNet-P’ and ‘SPTNet’ trained on the ImageNet-100 dataset, we observe that ‘SPTNet-P’ exhibits a higher concentration on the objects compared to ‘SPTNet.’ This observation aligns with the qualitative comparison, indicating that ‘SPTNet-P" performs better than ‘SPTNet’ on ImageNet-100. Similarly, when considering ‘SPTNet-P (from CUB)’ and ‘SPTNet (from CUB)’ trained on the CUB dataset, we notice that ‘SPTNet’ demonstrates a stronger focus on the objects compared to ‘SPTNet-P’. This observation is consistent with the qualitative comparison, suggesting that ‘SPTNet’ outperforms ‘SPTNet-P’ on CUB.

Refer to caption
Figure 22: Attention visualization for models trained on ImageNet-100 and CUB by applying them to bird images from ImageNet-100. The models trained on CUB are marked as ‘(from CUB)’.

Appendix L Broader impacts

Our study is among the efforts to extend the capability of AI systems from the closed world to the open world. Particularly, it will play a positive role in fostering next-generation AI systems with the capability of categorizing and organizing open-world data automatically. However, our method still has several limitations. First, though we have achieved encouraging results on the public datasets, the interpretability still needs improvement, as the underlying principles of how the decisions are made by the systems remain not crystal clear. Second, the cross-domain robustness is not satisfactory, as can be seen from the results on the setting of GCD with domain shifts, though our method has achieved the best overall results and new class discovery results, the performance still has significant room to improve. Additionally, in the vanilla GCD setting, methods typically rely on a pre-trained model (e.g., DINO) as a feature extractor, which may inherit its drawbacks (e.g., discrimination and privacy issues).