Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions

Tianxu Wu, Shuo Ye, Shuhuang Chen, Qinmu Peng and Xinge You, This work was supported in part by the National Key R&D Program of China 2022YFC3301000, in part by the Fundamental Research Funds for the Central Universities, HUST: 2023JYCXJJ031. Co-corresponding author: Shuo Ye([email protected]), Qinmu Peng(e-mail: [email protected])TianXu Wu, Shuo Ye, and Shuhuang Chen are with the School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China.Qinmu Peng and Xinge You are with the School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China.©2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Abstract

The challenge in fine-grained visual categorization (FGVC) lies in how to explore the subtle differences between different subclasses and achieve accurate discrimination. Previous research has relied on large-scale annotated data and pre-trained deep models to achieve the objective. However, when only a limited amount of samples is available, similar methods often struggle to accurately learn the details of instances and perform effective recognition. Using diffusion models for data augmentation has gained widespread attention, but the high level of detail required for fine-grained images makes it challenging for existing methods to be directly employed. To address this issue, we propose a novel approach termed the detail reinforcement diffusion model (DRDM), which leverages the extensive knowledge of large models for fine-grained data augmentation and comprises two key components including discriminative semantic recombination (DSR) and spatial knowledge reference (SKR). Specifically, DSR is designed to extract implicit similarity relationships from the labels and reconstruct the semantic map** between labels and instances, which enables better discrimination of subtle differences between different subclasses. Furthermore, we introduce the SKR module, which incorporates the distributions of different datasets as references in the feature space. This allows the SKR to aggregate the high-dimensional distribution of subclass features in few-shot FGVC tasks, thus expanding the decision boundary. Through these two critical components, we effectively utilize the knowledge from large models to address the issue of data scarcity, resulting in improved performance for fine-grained visual recognition tasks. Extensive experiments demonstrate the consistent performance gain offered by our DRDM.

Index Terms:

Fine-grained visual categorization, few-shot learning, stable diffusion

I Introduction

Fine-grained visual categorization (FGVC) aims to achieve the recognition of subclasses that exhibit tiny visual distinctions within the same large class (e.g., birds [1]). Related studies have been extensively applied to autonomous vehicles [2, 3] and pharmaceutical products [4, 5]. Compared to general images, fine-grained images usually have similar features and are affected by interferences such as posture, perspective, and occlusion [6]. Therefore, the key to achieving FGVC often lies in discovering discriminative regions. This process is often achieved through the localization branch network [7, 8] or implicitly learned in end-to-end training [9, 10, 11]. While automatically identifying these regions from a large-scale labeled dataset is feasible, many practical FGVC tasks lack such datasets because annotating fine-grained data is time-consuming, and labeling rare subclasses demands experienced expertise. For instance, in the medical domain, discerning subtle feature differences among different subtypes of diseases, or in the industrial sector, identifying minute variations among components, and in ecology, recognizing specific types of pests or diseases are particularly reliant on expert annotation. The ability of deep neural networks to process fine-grained few-shot learning (FSL) is crucial for practical applications. Unfortunately, existing methods still perform much worse than weakly supervised methods on several few-shot benchmarks [12]. Networks often struggle to select the correct regions for recognition and tend to overfit pseudo-features from the training data [13].

Refer to caption — Figure 1: Feature contamination resulting from semantic misalignment during data augmentation using large models. This is specifically evident in the form of (a) irrelevant augmented data and (b) the loss of discriminative details.

Utilizing external information (e.g., multi-view[14] or multi-party[15] information) can significantly enhance the performance of FSL, but this involves complex information acquisition pathways. A direct method to mitigate overfitting in FSL is through data augmentation [16]. However, reliably getting diverse data remains a challenging problem, where the augmented instances should contain discriminative features of the classes and exhibit high intra-class diversity [17]. Unfortunately, this often breaks down in adversarial learning [18, 19, 20] methods, where there is a shortfall in generating diverse samples. Recently, leveraging prior knowledge from large models (e.g., stable diffusion [21]) for data augmentation demonstrates significant potential. However, this success has not seamlessly extended into the realm of FGVC, one reason to consider is feature contamination, as depicted in Figure 1. It is manifested specifically as augmented data being unrelated to the original data or suffering from detail loss. In (a), when the label Geococcyx is used as input, the model fails to generate the expected result, the result is an unrelated composite animal image. Similarly, in (b), when the input image is Crested Auklet, although the generated images possess bird-like structures, their detailed features do not align with the target subclass. Please note that this phenomenon has been observed across different types of datasets. One of the reasons contributing to this phenomenon is believed to be the specialized nature of fine-grained labels. This implies that nouns within these labels are less common compared to general images. Consequently, during the pre-training process, this inherent imbalance poses a challenge for models to effectively learn fine-grained information and accurately establish a map** between labels and semantic features. As a result, they struggle to depict fine-grained features during data augmentation with large models. Utilizing such feature contamination images for training fine-grained models would severely impair the models’ understanding of instances. Moreover, limited instances can also result in data feature points struggling to encompass the intricate boundaries between different categories. Models may lean towards adopting simplistic decision boundaries, preventing them from capturing the complex classification scenarios present in the real world.

We argue that since fine-grained labels possess a certain level of expertise, the naming of subclasses should adhere to specific conventions. For instance, labels like Parakeet Auklet and Crested Auklet both contain Auklet in their names. This kind of textual similarity implicitly encodes fundamental subclass features, such as a red beak and a short tail. By utilizing the inherent resemblance in label descriptions, the data augmentation process can be constrained, thereby effectively enhancing the performance of FGVC in few-shot conditions. To address this, we propose a detailed reinforcement model. Specifically, the discriminative semantic recombination module is designed to explicitly emphasize subclass-specific differences from a labeling perspective. It then utilizes the extracted similarity relationships to guide and constrain the data augmentation process performed by diffusion models. Meanwhile, the spatial knowledge reference module is designed to incorporate diverse data distributions from various data types as reference points into the feature space. This approach effectively addresses the challenge of poorly defined decision boundaries in FSL due to limited data, thereby enhancing the model’s instance understanding. Our model demonstrates notable scalability and can seamlessly integrate knowledge supplementation from different data modalities, leveraging reference knowledge from datasets of distinct types. Our main contributions are summarized as follows:

•

We analyzed the limitations of applying the diffusion model to fine-grained image data augmentation and proposed a Discriminative Semantic Recombination (DSR) module. This module effectively explores the relationships between instance labels and image information under weakly supervised conditions, thus enhancing the details of augmented data.
•

We proposed a Spatial Knowledge Referencing (SKR) approach that introduces the distributions of different data types as references in the feature space. This encourages the model to find clear and distinct boundaries for fine-grained features in the high-dimensional space, thereby enhancing the model’s understanding of instances.
•

We conducted extensive experiments on three benchmark datasets, and the results demonstrated that the proposed DRDM achieved favorable performance in the FSFG problem, and significantly improved the overall performance compared to other methods.

II Related Work

II-A Fine-Grained Visual Categorization in Few-Shot Setting

Fine-grained visual categorization (FGVC) aims to achieve a refined classification of subclasses within a large class, where instances have similar appearance features and discrimination regions only exist locally. Previous research achieves this by utilizing large-scale annotated data and pre-trained deep models. However, when only a few-shot data is available, those methods may become less effective [22, 23, 24]. To alleviate the pressure caused by the reduction in data quantity, relevant methods can be roughly divided into three categories including metric learning [25, 26], optimization, and data augmentation. Specifically, metric learning uses predefined metrics to learn the deep representation of instances, and by calculating the distance or similarity between different images, they are divided into different categories. In this process, pose-normalized representations are often used, which first locate the semantic parts in each image, and then describe the image by characterizing the appearance of each part [27]. Optimization methods often use transfer learning. Specifically, traditional deep learning is applied to adjust the source data, and then a simple classifier is trained to adjust the target data in a fixed representation [28, 29], or fine-tuned [30]. Most data augmentation methods are based on an assumption that internal category variations caused by pose, background, or lighting conditions are shared between categories. Internal category variations can be modeled as low-level statistical information [31] or pairwise transformations [32, 33], and can be directly applied to new samples.

Although these methods have been proven effective in general FSL tasks, the gains achieved in fine-grained datasets are minimal. For metric learning methods, The extracted features are difficult to form tight clusters for new classes because small changes in the feature space can be affected by small inter-class distances [34, 35, 36]. For the optimization methods, fine-tuning FGVC images on large models is challenging because discriminative features often exist only locally. Limited samples often lead to pre-trained models struggling to comprehend instance details and perform effective recognition properly. The data augmentation methods have shown significant promise, however, they also require careful design due to the risk of exacerbating the imbalance between discriminative and non-discriminative features [37, 38].

II-B Basic Principles of Diffusion Model

Diffusion models are a type of latent variable models that include forward and reverse noise-injection process. During the forward process, noise is gradually added to the data, each step in the forward process is a Gaussian transition according to the following Markovian process

q\left(\boldsymbol{x}_{t}|\boldsymbol{x}_{t-1}\right)=\mathcal{N}\left(\sqrt{% \alpha_{t}}\boldsymbol{x}_{t-1},\beta_{t}\mathbf{I}\right),\forall t\in\left\{% 1,...,T\right\},

(1)

q\left(\boldsymbol{x}_{1:T}|\boldsymbol{x}_{0}\right)=\prod_{t=0}^{T}{q\left(% \boldsymbol{x}_{t}|\boldsymbol{x}_{t-1}\right)},

(2)

where $T$ is the number of diffusion steps, The mean and variance of Gaussian noise are determined by $\beta_{t}$ , $\alpha_{t}=1-\beta_{t}$ . The reverse process is another Gaussian transition

p_{\theta}\left(\boldsymbol{x}_{t-1}|\boldsymbol{x}_{t}\right)=\mathcal{N}% \left(\boldsymbol{x}_{t-1}|\boldsymbol{\mu}_{\theta}\left(\boldsymbol{x}_{t},t% \right),\sigma_{\theta}^{2}\left(\boldsymbol{x}_{t},t\right)\mathbf{I}\right),

(3)

where the mean value $\boldsymbol{\mu}_{\theta}\left(\boldsymbol{x}_{t},t\right)$ can be seen as the combination of $\boldsymbol{x}_{t}$ and a noise prediction network $\epsilon_{\theta}\left(\boldsymbol{x}_{t},t\right)$ . The maximal likelihood estimation of the optimal mean is

\tilde{\boldsymbol{\mu}}_{\theta}\left(\boldsymbol{x}_{t},t\right)=\frac{1}{% \sqrt{\alpha_{t}}}\left(\boldsymbol{x}_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{% \alpha}}}\mathbb{E}\left(\epsilon|\boldsymbol{x}_{t}\right)\right).

(4)

To get the optimal mean, $\epsilon_{\theta}\left(\boldsymbol{x}_{t},t\right)$ can be learned by a noise prediction objective

\min_{\theta}\mathbb{E}_{\boldsymbol{x}_{0}\sim q\left(\boldsymbol{x}\right),% \epsilon\sim\mathcal{N}\left(0,\mathbf{I}\right),t}\left\|\epsilon_{\theta}% \left(\boldsymbol{x}_{t},t\right)-\epsilon\right\|_{2}^{2}.

(5)

However, in the context of conditional generation, the condition information $c$ should be considered during the training process of the noise prediction network

\min_{\theta}\mathbb{E}_{\boldsymbol{x}_{0}\sim q\left(\boldsymbol{x}\right),% \epsilon\sim\mathcal{N}\left(0,\mathbf{I}\right),t,c}\left\|\epsilon_{\theta}% \left(\boldsymbol{x}_{t},t,c\right)-\epsilon\right\|_{2}^{2}.

(6)

This process is typically implemented using the U-Net [39] architecture. In order to leverage information from different modalities, recent studies have also incorporated Transformer’s self-attention [40, 41, 42, 43] modules (including a self-attention layer, a cross-attention layer, and a fully connected feed-forward network) for feature alignment [44]. Specifically, the attention layer operates on queries $\boldsymbol{Q}\in\mathbb{R}^{n\times d_{k}}$ , and key-value pairs $\boldsymbol{K}\in\mathbb{R}^{m\times d_{k}}$ , $\boldsymbol{V}\in\mathbb{R}^{m\times d_{v}}$

A(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V})=\text{softmax}(\frac{% \boldsymbol{Q}\boldsymbol{K}^{T}}{\sqrt{d_{k}}})\boldsymbol{V},

(7)

where $n$ is the number of queries, $m$ is the number of key-value pairs $d_{k}$ is dimension of key, $d_{v}$ is the dimension of value. In the self-attention layer, $\boldsymbol{x}\in\mathbb{R}^{n\times d_{x}}$ is the only input. In the crosss attention layer of the conditioned diffusion model, there are two inputs $\boldsymbol{x}\in\mathbb{R}^{n\times d_{x}}$ and $\boldsymbol{c}\in\mathbb{R}^{m\times d_{c}}$ , where $\boldsymbol{x}$ is the output from the prior block and $\boldsymbol{c}$ represents the condition information. However, diffusion models cannot be used directly for data augmentation of fine-grained images due to fine-grained labels possess a certain level of expertise, and pre-trained models have difficulty in understanding the map** between labels and semantics (see more details in Section IV).

III Method

In this section, we describe our DRDM. As shown in Figure 2, it consists of two core components, including discriminative semantic recombination, and spatial knowledge reference module.

III-A Notation

In the few-shot FGVC tasks, the dataset is divided into meta-training set $D_{base}=\left\{\left(x_{i},y_{i}\right),y_{i}\in C_{base}\right\}$ and meta-testing set $D_{novel}=\left\{\left(x_{k},y_{k}\right),y_{k}\in C_{novel}\right\}$ , where $C_{base}$ and $C_{novel}$ represent base and novel classes respectively, and $C_{base}\cap C_{novel}=\phi$ . Here, $x_{k}$ , and $y_{k}$ denote input image and class name respectively. Furthermore, during the training and testing phases of few-shot FGVC tasks, they are typically composed of distinct episodes. Each episode includes a labeled support set $S=\left\{\left(x_{k},y_{k},\right)\right\}_{k=1}^{N\times K}$ and an unlabeled query set $Q=\left\{\left(x_{k},y_{k}\right)\right\}_{k=1}^{N\times U}$ . In these sets, $N$ signifies the number of randomly selected classes, and $S$ and $Q$ share the same class. $K$ and $U$ represent the quantities of labeled and unlabeled samples respectively, while ensuring $S\cap Q=\phi$ . For ease of reference, we have compiled some important symbols and definitions in Table I.

TABLE I: Partial symbols and definition explanations.

Symbol	Definition
$D_{base}$ , $D_{novel}$	Meta-training and meta-test set
$D_{extra}$	Additional meta-training set
$F_{i,j}^{S}$ , $F_{i,j}^{Q}$	Computed feature of support and query set
$F_{i,j}^{E}$	Computed feature of extra set
$F_{i}^{P}$	Prototype representative of $i$ -th class
$R_{i,c}^{intra}$ , $R_{i,c}^{inter}$	Intra-class and inter-class representation score
$w_{i}^{intra}$ , $w_{i}^{inter}$	Intra-class and inter-class attention weights
$w_{i}$	Channel attention weights
$\mathcal{L}_{MS}$	Multi-Similarity loss

III-B Discriminative Semantic Recombination (DSR)

As mentioned above, the implicit similarity descriptions in the textual modality encompass the fundamental features of the subclasses. We argue that transferring this part of similarity knowledge from the text space to the image space would aid in augmenting the data with more intricate details, thereby generating features with stronger representational capabilities. To achieve this, we first conduct similarity measurements in the text space. In this process, the model needs to establish a connection between fine-grained labels and instances, which becomes challenging under the few-shot paradigm. Utilizing too few instances to fine-tune a large model is not sufficient for the model to acquire enough discriminative knowledge and may lead to severe overfitting. Recently, a novel fine-tuning paradigm has emerged with the use of Adapter methods (e.g., AdapterFusion [45], AdapterDrop [46], and K-Adapter [47]). This paradigm involves adding Adapter modules to certain layers of a pre-trained model and freezing the pre-trained backbone during fine-tuning. The Adapter modules are responsible for learning specific downstream task knowledge, thereby avoiding the issues of full model fine-tuning and catastrophic forgetting.

Inspired by this, in our approach, we introduce Adapter knowledge layers into the process of interconnecting textual and visual features of the diffusion model. With only a few parameters specifically designed for the fine-grained task, these Adapter knowledge layers store instance-specific knowledge, thereby mitigating the overfitting issues that may arise from fine-tuning. We introduced an adapter module into the U-Net architecture after the cross-attention layer for visual features. At the same time, an adapter was added after the pre-trained text encoder to transfer the semantic understanding. Defining the prompt as $V$ , expressed as “a photo of a [class label]”. the formulation becomes $F_{C}=\Phi_{AD}\left(Z\left(V\right)\right)$ , where $Z$ represents the encoder used for the prompt input, such as CLIP [48]. $\Phi_{AD}$ signifies the added adapter module. The loss function for the CLIP branch is denoted as $\mathcal{L}_{C}=\mathcal{L}_{MS}\left(F_{C}\right)$ where $\mathcal{L}_{MS}$ refers to the MS loss [49], computed as follows:

	$\displaystyle\mathcal{L}_{MS}=$	$\displaystyle\frac{1}{m}\sum_{i=1}^{m}\left\{\frac{1}{2}log\left[1+\sum_{k\in% \mathcal{P}_{i}}e^{-2\left(S_{ik}-\frac{1}{2}\right)}\right]\right.$		(8)
		$\displaystyle\left.+\frac{1}{40}log\left[1+\sum_{k\in\mathcal{N}_{i}}e^{40% \left(S_{ik}-\frac{1}{2}\right)}\right]\right\},$		(8)

The reconstruction loss of Stable Diffusion ( $\mathcal{L}_{SD}$ ) is defined as follows:

\mathcal{L}_{SD}=\mathbb{E}_{t,x_{0},c,\epsilon}||\epsilon_{\theta}\left(x_{t}% ,t,c\right)-\epsilon||^{2}_{2}.

(9)

The overall loss function in DSR is given by:

\mathcal{L}_{DSR}=\mathcal{L}_{SD}+\alpha\mathcal{L}_{C}.

(10)

where $\mathcal{L}_{C}$ is the CLIP branch loss, and $\alpha$ is a hyperparameter controlling the trade-off between the reconstruction loss and the CLIP branch loss.

III-C Spatial Knowledge Reference (SKR)

One of the challenges in few-shot FGVC is the limited number of samples, which results in the subclass features becoming highly scattered when mapped to a high-dimensional space. Consequently, the classifier struggles to identify clear and distinct subclass feature boundaries in the high-dimensional space, significantly affecting the accuracy and performance of FGVC tasks. We argue that leveraging the knowledge from other datasets as a reference can significantly enhance the model’s understanding of similar instances.

One reason to consider is that data from a similar class tends to be closer in high-dimensional space. As shown in Figure 3.

It can be observed that CUB exhibits significant overlap in feature distribution with NABirds, Stanford Dogs, and Oxford-Pet, as well as Stanford Cars and CompCar, with the distribution distances noticeably closer compared to other classes of data. Referencing similar datasets can help the model understand the subtle differences between instances within similar features, encouraging the model to find clear and distinct boundaries for fine-grained features in the high-dimensional space. However, existing research on few-shot FGVC fails to utilize foundational knowledge from similar data, as the datasets’ training processes are disjointed. Additionally, directly applying knowledge from other datasets not only fails to improve the model’s performance but also leads to severe learning degradation.

To address these issues, we have designed a knowledge reference module, which aims to effectively incorporate knowledge from other datasets in a coherent manner during the training process. This module allows the model to benefit from the shared knowledge of similar data, leading to improved performance in few-shot FGVC tasks.

During the training phase of base class, in addition to the support set and query set, we also introduced additional set $E=\left\{\left(x_{k},y_{k}\right)\right\}_{k=1}^{W\times K}\subseteq D_{extra}$ , where $W$ represents the number of additional dataset classes added, and $D_{extra}=\left\{\left(x_{i},y_{i}\right),y_{i}\in C_{extra}\right\}$ , satisfying $C_{base}\cap C_{extra}=\phi$ and $C_{novel}\cap C_{extra}=\phi$ . The training framework involves a network feature extractor, denoted as $f\left(\cdot|\theta\right)$ , which is responsible for computing features for different sets. The prototype representative of each class is expressed as: $F_{i}^{P}=\frac{1}{K}\sum_{j=1}^{K}{F_{i,j}^{S}}$ , where $F_{i,j}^{S}$ denotes the $j$ -th feature in $i$ -th class. Inspired by TDM [36], we applied channel attention to the feature of both the support set and query set. Firstly, the intra-class representation score across channel dimensions is defined by:

R_{i,c}^{intra}=\frac{1}{H\times W}\left\|F_{i,c}^{P}-M_{i}^{P}\right\|^{2},

(11)

where $H$ and $W$ represent the width and height of the feature and $M_{i}^{P}\in R^{H\times W}$ represents the mean prototype feature across channel. Secondly, the inter-class representation score is calculated by:

R_{i,c}^{inter}=\underset{i\in C,j\in C,i\neq j}{\min}\frac{1}{H\times W}\left% \|F_{i,c}^{P}-M_{j}^{P}\right\|^{2}.

(12)

Subsequently, the obtained intra-class representation score and inter-class representation score are passed through the fully connected network to obtain the attention weights for different channels of the $i$ -th class:

w_{i}^{intra}=f_{intra}\left(R_{i}^{intra}\right),

(13)

w_{i}^{inter}=f_{inter}\left(R_{i}^{inter}\right),

(14)

where $f_{intra}\left(\cdot\right)$ and $f_{inter}\left(\cdot\right)$ denote the fully connected networks for intra-class scores and inter-class scores, respectively. The final channel attention weight is given by:

w_{i}=({w_{i}^{intra}+w_{i}^{inter}})/{2},

(15)

We apply channel attention weight to the prototype representation and query set of each class:

G_{i}^{S}=w_{i}\odot F_{i}^{P},

(16)

G_{i,j}^{Q}=w_{i}\odot F_{i,j}^{Q}.

(17)

Finally, according to ProtoNet [25], the inference results of query set are given by

p\left(\left.y=i\right|x\right)=\frac{\exp\left(-dist\left(G_{i}^{S},G_{i}^{Q}% \right)\right)}{\sum\nolimits_{j=1}^{N}{\exp\left(-dist\left(G_{j}^{S},G_{j}^{% Q}\right)\right)}},

(18)

where $dist\left(\cdot\right)$ denotes similarity distance measure between features. The computation of SKR gives rise to the corresponding loss term, denoted as $\mathcal{L}_{SKR}=\mathcal{L}_{MS}\left(Cat\left(F^{S},F^{E}\right)\right)$ with $Cat\left(\cdot\right)$ representing the concatenation operation. The overall loss function for the final classification network is expressed as:

\mathcal{L}_{CLS}=-\frac{1}{N\times U}\sum_{k=1}^{N\times U}{\left(\boldsymbol% {y}_{k}^{T}\log\left(\boldsymbol{p}_{k}\right)\right)}+\beta\mathcal{L}_{SKR},

(19)

where $\boldsymbol{y}_{k}$ stands for the one-hot vector and $\boldsymbol{p}_{k}$ for predicted probability, and $\beta$ is a hyperparameter controlling the balance between the classification cross-entropy loss and the SKR loss.

IV Experiments

In this section, we extensively evaluated the performance of our approach. We compared the performance of our approach with the latest state-of-the-art (SOTA) methods on each network architecture. The experimental settings, implementation details, and results for diverse tasks are described below.

IV-A Datasets and Experimental Setup

Experiments are conducted on three widely used datasets. All datasets provide fixed train and test splits. The details are summarized in Table II.

TABLE II: The splits of datasets. While C_all is the number of total subclasses, C_train, C_val, C_test are the number of training, validation, and test subclasses, respectively. The classes of subsets are disjoint.

Dataset	C_all	C_train	C_val	C_test
CUB-200-2011[1]	200	100	50	50
Stanford Dogs[50]	120	60	30	30
Stanford Cars[51]	196	130	17	49

For the CUB dataset, our data split is the same with [52]. Regarding the Cars dataset, we adhere to the same data split with [53]. As for the Dogs dataset, it comprises 90 subclasses designated for training and validation, along with an additional 30 subclasses for testing. To achieve effective spatial knowledge referencing, we employed the NABirds [54], Oxford-IIIT Pet [55], and CompCars [56] datasets as supplementary knowledge sources for the CUB, Dogs, and Cars datasets, respectively. In the experiments, ResNet [57] is pre-trained on ImageNet as the backbone, and all the input images are cropped to $84\times 84$ . The model is trained with the stochastic gradient descent (SGD) and momentum of 0.9 for all datasets. The initial learning rate of the main branch was set to 0.001 and 0.01 for the rest layers. Our implementation is based on PyTorch with an NVIDIA Geforce GTX 3090Ti GPU. In the N-way K-shot scenario, we carried out few-shot classification on 10,000 randomly sampled episodes, each containing 16 queries per class. We present the average classification accuracy along with 95% confidence intervals, as in [36].

IV-B Model Configuration

Model configuration experiments are conducted to verify the validity of the individual component and to determine the hyperparameters.

Multi-Similarity Loss ( $\boldsymbol{\alpha}$ ): To verify the effectiveness of MS loss and investigate the influence of the parameter $\alpha$ , extensive experiments are carried out on the three datasets, and the results are presented in Table III.

TABLE III: Experimental results using varied

\alpha

. “w/o” means learning without MS loss. The best performance is indicated in bold.

$\alpha$	0.1	0.3	0.5	0.7	0.9	w/o
CUB	88.09	88.40	88.53	88.35	88.05	88.14
Dogs	71.38	72.11	72.28	71.93	72.51	72.04
Car	80.58	80.83	81.03	80.80	80.63	80.38

When $\alpha$ was set properly (e.g., $\alpha\in[0.3~{}0.7]$ ), the MS loss could effectively embed the information contained in the label into the image space, to some extent facilitating the model’s comprehension of similar instances. However, an increment in $\alpha$ beyond a certain point led to a slight decline in our model’s performance. One possible reason is that the model overly relies on the relationships among labels within the few-shot training paradigm, potentially resulting in overfitting. This suggests that $\alpha=0.5$ could be a reliable choice for DRDM.

TABLE IV: Few-shot classification accuracy on the CUB, Stanford Dogs, and Stanford Cars dataset. All experiments are from a 5-way classification with the same backbone network (ResNet12). The best performance is indicated in bold.

Method	CUB-200-2011		Stanford Dogs		Stanford Cars
Method	1-shot	5-shot	1-shot	5-shot	1-shot	5-shot
MTL [58] (CVPR@2019)	73.31 $\pm$ 0.92	82.29 $\pm$ 0.51	54.96 $\pm$ 1.03	68.76 $\pm$ 0.65	-	-
MetaOptNet [59] (CVPR@2019)	75.15 $\pm$ 0.46	87.09 $\pm$ 0.30	65.48 $\pm$ 0.49	79.39 $\pm$ 0.25	-	-
S2M2 [60] (WACV@2020)	71.43 $\pm$ 0.28	85.55 $\pm$ 0.52	-	-	-	-
Neg-Cosine [61] (ECCV@2020)	72.66 $\pm$ 0.85	89.40 $\pm$ 0.43	-	-	-	-
A2 [62] (ECCV@2020)	74.22 $\pm$ 1.09	88.65 $\pm$ 0.55	-	-	-	-
MattML [63] (IJCAI@2020)	66.29 $\pm$ 0.56	80.34 $\pm$ 0.30	54.84 $\pm$ 0.53	71.34 $\pm$ 0.38	66.11 $\pm$ 0.54	82.80 $\pm$ 0.28
BSNet(R&C) [64] (TIP@2021)	65.89 $\pm$ 1.00	80.99 $\pm$ 0.63	51.06 $\pm$ 0.94	68.60 $\pm$ 0.73	54.12 $\pm$ 0.96	73.47 $\pm$ 0.75
VFD* [34] (ICCV@2021)	79.12 $\pm$ 0.83	91.48 $\pm$ 0.39	57.04 $\pm$ 0.89	72.95 $\pm$ 0.70	-	-
APF [65] (PR@2022)	78.73 $\pm$ 0.84	89.77 $\pm$ 0.47	60.89 $\pm$ 0.98	78.14 $\pm$ 0.62	78.14 $\pm$ 0.84	87.42 $\pm$ 0.57
TDM [36] (CVPR@2022)	84.36 $\pm$ 0.19	93.37 $\pm$ 0.10	57.32 $\pm$ 0.22	75.26 $\pm$ 0.16	67.10 $\pm$ 0.22	86.05 $\pm$ 0.12
CFMA [66] (IS@2022)	74.68 $\pm$ 1.38	90.91 $\pm$ 0.94	-	-	-	-
QGN [67] (PR@2023)	83.82 $\pm$ 0.00	91.22 $\pm$ 0.00	-	-	-	-
T2L [68] (KBS@2023)	71.04 $\pm$ 1.21	83.44 $\pm$ 0.94	52.12 $\pm$ 1.14	70.83 $\pm$ 1.09	56.80 $\pm$ 1.23	74.10 $\pm$ 1.65
TasNet [69] (PR@2023)	83.89 $\pm$ 0.69	91.35 $\pm$ 0.53	-	-	-	-
Ours	89.99 $\pm$ 0.14	94.63 $\pm$ 0.10	72.68 $\pm$ 0.17	80.43 $\pm$ 0.12	81.53 $\pm$ 0.15	90.03 $\pm$ 0.12

Spatial Knowledge Reference ( $\boldsymbol{\beta}$ ): To investigate the influence of SKR, we measured the impact of introducing the number of subclasses (N) with different $\beta$ values for both 1-shot and 5-shot scenarios on three datasets, the results are shown in Figure 4.

As can be seen, in the CUB dataset, the model’s accuracy exhibited a positive trend after incorporating subclasses as reference knowledge. However, when the number exceeded five, a slight reduction in accuracy was observed. This might be attributed to the challenge of the model in comprehending target subclasses under the setting of few-shot problems when too many subclasses were introduced. Besides, we observed that increasing $\beta$ leads to higher accuracy. However, the performance modestly decreased as the balance parameter $\beta$ was increased from 0.5 to 0.7, suggesting that when $\beta$ was equal to 0.5, the model was able to leverage a sufficient amount of knowledge. Similar results can be observed in the other two datasets. Therefore, we still chose $N=5$ and $\beta=0.5$ for all subsequent experiments, as it ideally balances computational complexity and accuracy.

IV-C Performance Evaluation

The experimental results and analysis of DRDM compared with recent SOTA methods on three datasets are presented in Table IV. It was observed that on the CUB dataset, the TDM [36] method demonstrated SOTA performance, which can be credited to its channel attention mechanism. This mechanism produced a support weight to represent the channel-wise discriminative power for each subclass. Benefiting from the embedding and learning of textual feature space relationships, our approach achieved performance improvements of 5.63% and 1.26% in the 1-shot and 5-shot settings, respectively. On the Dogs dataset, the SOTA performance was achieved by MetaOptNet [59]. Similarly, this approach overlooks the potential structural relationships within the labels. Consequently, in comparison, our method achieved performance improvements of 7.2% and 1.04% in the 1-shot and 5-shot settings, respectively. On the Cars dataset, the SOTA performance is achieved by APF [65]. In comparison, our method achieved performance improvements of 3.39% and 2.61% in the 1/5-shot settings, respectively. These results provide compelling evidence for the effectiveness of DRDM.

To validate the unique advantage of our proposed algorithm in fine-grained data augmentation, We also compared it with data augmentation methods based on the diffusion model. Instances augmented by different diffusion models are demonstrated in Figure 5.

TABLE V: Comparison of data augmentation methods based on diffusion model. The best performance is indicated in bold.

Method	CUB-200-2011		Stanford Dogs		Stanford Cars
Method	1-shot	5-shot	1-shot	5-shot	1-shot	5-shot
LDM [21] (CVPR@2022)	82.82 $\pm$ 0.18	90.12 $\pm$ 0.12	58.77 $\pm$ 0.18	70.48 $\pm$ 0.16	73.70 $\pm$ 0.18	86.50 $\pm$ 0.12
VQ-Diffusion [70] (CVPR@2022)	81.54 $\pm$ 0.19	89.96 $\pm$ 0.12	60.99 $\pm$ 0.19	70.60 $\pm$ 0.16	62.02 $\pm$ 0.21	85.12 $\pm$ 0.13
Ours	89.99 $\pm$ 0.14	94.63 $\pm$ 0.10	72.68 $\pm$ 0.17	80.43 $\pm$ 0.12	81.53 $\pm$ 0.15	90.03 $\pm$ 0.12

It was observed that, while the data augmented by existing diffusion-based models could capture the basic outline and features of the instances (e.g., Spotted Catbird, Bobolink, and Parakeet Auklet), the limited of training samples makes it challenging for these models to capture finer details. For instance, the feathers of Spotted Catbird are green, but the augmented data shows grey feathers. Moreover, the labels of fine-grained images have a certain level of expertise, making it difficult for large models to establish a clear map** between labels and semantic features during pre-training. As a result, the augmented images may not correctly represent the instances (e.g., the augmented data of Geococcyx and Chuck-will-widow). This leads to significant feature contamination, impeding model learning in a few-shot setting.

This conclusion is proved quantitatively in Table V. To ensure a fair comparison, all results were obtained using the same framework. It can be observed that utilizing the SOTA diffusion-based model for data augmentation has led to a decrease in 1-shot accuracy compared to prior research, with declines of 1.54%, 4.49%, and 4.44% on the three datasets (contrasting the results in Table IV), respectively. This reveals the prevalent presence of feature contamination and its impact on few-shot FGVC learning.

TABLE VI: Ablation studies of the DRDM on three datasets. The best performance is indicated in bold.

Dataset	Framework		Framework + SKR		Framework + DSR		DRDM
Dataset	1-shot	5-shot	1-shot	5-shot	1-shot	5-shot	1-shot	5-shot
CUB	84.18 $\pm$ 0.19	93.44 $\pm$ 0.10	86.43 $\pm$ 0.16	93.24 $\pm$ 0.10	88.14 $\pm$ 0.15	93.62 $\pm$ 0.10	89.99 $\pm$ 0.14	94.63 $\pm$ 0.10
Dogs	61.44 $\pm$ 0.22	78.73 $\pm$ 0.15	62.53 $\pm$ 0.23	79.69 $\pm$ 0.15	72.14 $\pm$ 0.18	79.86 $\pm$ 0.15	72.68 $\pm$ 0.17	80.43 $\pm$ 0.12
Cars	67.10 $\pm$ 0.22	86.05 $\pm$ 0.12	71.79 $\pm$ 0.21	88.79 $\pm$ 0.12	80.58 $\pm$ 0.17	89.52 $\pm$ 0.11	81.53 $\pm$ 0.15	90.03 $\pm$ 0.12

IV-D Ablation Study

To evaluate the proposed DRDM, an ablation study was conducted. “Framework” refers to using a structure designed without any strategy. On this basis, the influence of DSR and SKR strategies on learning was explored. The experimental results are presented in Table VI.

Taking CUB as an example, it can be observed that SKR has achieved a maximum 2.25% performance improvement compared to Framework. Next, the effectiveness of the proposed DSR was evaluated, which is used to explore the relationships between instance labels and image information under weakly supervised conditions. We observed that this module improved the accuracy by 3.96% based on “Framework”. This suggests that extracting potential similarity relationships in labels will help the model understand features better. When SKR and DSR were used together, the model achieved optimal performance. Since our method uses a noise prediction network, we further validated the model’s efficacy by subjecting it solely to noise addition. We conducted experiments on the CUB dataset and reported the corresponding results. The accuracy for 5-way-1-shot and 5-way-5-shot was 81.72% and 91.98%, respectively. Compared to the Framework, there was a decrease of 2.46% and 1.46% in accuracy, respectively. This indicates that merely adding noise not only does not help improve the model’s generalization ability but also interferes with the model’s understanding of fine-grained targets.

TABLE VII: Performance combined with the current state-of-the-art FSL methods.

Method	CUB-200-2011		Stanford Dogs		Stanford Cars
Method	1-shot	5-shot	1-shot	5-shot	1-shot	5-shot
ProtoNet [25] (NIPS@2017)	77.66 $\pm$ 0.21	89.42 $\pm$ 0.12	45.92 $\pm$ 0.21	67.50 $\pm$ 0.17	47.60 $\pm$ 0.21	72.81 $\pm$ 0.18
ProtoNet + DRDM	82.58 $\pm$ 0.18	90.80 $\pm$ 0.11	58.92 $\pm$ 0.18	70.45 $\pm$ 0.16	61.66 $\pm$ 0.20	74.37 $\pm$ 0.17
FRN [52] (CVPR@2021)	83.55 $\pm$ 0.19	92.92 $\pm$ 0.10	55.49 $\pm$ 0.21	74.54 $\pm$ 0.16	62.07 $\pm$ 0.22	83.18 $\pm$ 0.14
FRN + DRDM	85.40 $\pm$ 0.17	93.52 $\pm$ 0.10	61.88 $\pm$ 0.22	77.86 $\pm$ 0.15	79.25 $\pm$ 0.19	88.23 $\pm$ 0.10

IV-E Scalability Analysis

We also apply the augmented data from this paper to existing models to evaluate the gains of our approach over existing algorithms. It can be observed in Table VII, on the CUB dataset, our algorithm achieves a top-1 performance improvement of 4.92% and 1.85% for ProtoNet and FRN, respectively. Similar conclusions can be drawn for the Dogs and Cars datasets. This indicates that our algorithm has good portability, which can effectively enhance the performance of other few-shot learning models.

Simultaneously, we visualize the similarity of label features on three datasets to illustrate the improvement of our DSR model in addressing semantic misalignment issues. The results are presented in Figure 7. It can be observed that after applying DSR, the similarity relationships between labels are effectively constrained. This also reveals the reasons why augmented data can effectively prevent the generation of irrelevant images. Similar results can be observed in the other two datasets. This indicates that our method can better guide the learning process to improve the generalization ability of the model.

Besides, we also assessed the impact of different types of data on the knowledge referencing module. In this process, the 1-shot results were reported in Table VIII.

TABLE VIII: Impact of data types on knowledge referencing. “Pet” represents the “Oxford-IIIT Pet” dataset.

Dataset	Baseline	NABirds	Pet	CompCars
CUB [1]	84.18	86.43	85.34	85.09
Dogs [50]	61.44	61.55	62.53	60.89
Cars [51]	67.10	68.95	68.84	71.79

It was observed that leveraging external data to enhance the model’s understanding of instances under limited samples was effective. When instances exhibit substantial dissimilarities, embedding spatial relationships did not significantly impact the model’s performance. However, in cases of similar data types, the model demonstrated greater confidence in discrimination. We also employed t-SNE [71] to qualitatively illustrate the impact of knowledge referencing on features. Additionally, we utilized spatial utilization ratio $\rho$ [72] to quantitatively characterize the degree of subclass aggregation. $\rho$ represents the ratio of average inter-class distance to average intra-class distance of features. A higher value indicates a more compact feature distribution and a wider decision boundary allows the model to make decisions more confidently. The results are presented in Figure 6. Compared to the results of the framework, the utilization of label similarity relationships to constrain model learning has led to feature distribution aggregation. In fact, on the Cars and Dogs datasets, the $\rho$ has even doubled. Subsequently, with the introduction of appropriate knowledge as referencing, the feature distribution exhibited further intra-subclass aggregation and wider inter-subclass boundaries, ultimately benefiting the model’s final decision-making process. Compared to the ProtoNet and FRN methods, our algorithm still exhibited the highest spatial utilization rate, which partially explains why we achieved the best results.

V Conclusions

Augmenting data with subtle and consistent discriminative features is one effective approach to achieving reliable few-shot FGVC. In this paper, we introduce a diffusion-based model solution that explores latent instance similarity relationships within labels and employs external data for feature referencing, achieving enhanced data augmentation through detailed reinforcement. The experimental results validate that solely employing a diffusion-based class generation model for data augmentation cannot effectively address the challenge of few-shot learning. In contrast, our method leverages the constraints of similar relationships within labels and the reference from external knowledge, effectively alleviating label feature contamination and achieving a substantial performance improvement. We also qualitatively and quantitatively validated the impact of utilizing reference knowledge from different classes on the model. Our approach can effectively extract knowledge from similar subclass data while avoiding the influence of irrelevant subclasses. Besides, one thing should be noted, our method primarily relies on label similarity for feature constraints, potentially overlooking underlying cross-hierarchical label relationships. In the future, we aim to explore additional correlated information, such as cross-hierarchical label relationships or structural information in a single-instance feature, to further enhance the model’s generalization capabilities.

References

[1] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset.” California Institute of Technology, 2011.
[2] J. Mańdziuk, “New shades of the vehicle routing problem: Emerging problem formulations and computational intelligence solution methods,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 3, no. 3, pp. 230–244, 2019.
[3] K. Sadeghi, A. Banerjee, and S. K. Gupta, “A system-driven taxonomy of attacks and defenses in adversarial machine learning,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 4, no. 4, pp. 450–467, 2020.
[4] J. Yi, H. Zhang, J. Mao, Y. Chen, H. Zhong, and Y. Wang, “Pharmaceutical foreign particle detection: An efficient method based on adaptive convolution and multiscale attention,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 6, no. 6, pp. 1302–1313, 2022.
[5] J. Du, K. Guan, Y. Zhou, Y. Li, and T. Wang, “Parameter-free similarity-aware attention module for medical image classification and segmentation,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 7, no. 3, pp. 845–857, 2023.
[6] S. Ye, Y. Wang, Q. Peng, X. You, and C. P. Chen, “The image data and backbone in weakly supervised fine-grained visual categorization: A revisit and further thinking,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
[7] X. He, Y. Peng, and J. Zhao, “Stackdrl: Stacked deep reinforcement learning for fine-grained visual categorization.” in IJCAI, 2018, pp. 741–747.
[8] X. Zheng, L. Qi, Y. Ren, and X. Lu, “Fine-grained visual categorization by localizing object parts with single image,” IEEE Transactions on Multimedia, vol. 23, pp. 1187–1199, 2020.
[9] Y. Ding, Y. Zhou, Y. Zhu, Q. Ye, and J. Jiao, “Selective sparse sampling for fine-grained image recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6599–6608.
[10] S. Ye, Q. Peng, W. Sun, J. Xu, Y. Wang, X. You, and Y.-M. Cheung, “Discriminative suprasphere embedding for fine-grained visual categorization,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
[11] Y. Wang, S. Ye, S. Yu, and X. You, “R2-trans: Fine-grained visual categorization with redundancy reduction,” arXiv preprint arXiv:2204.10095, 2022.
[12] Z. Hong, S. Chen, G. Xie, W. Yang, J. Zhao, Y. Shao, Q. Peng, and X. You, “Semantic compression embedding for generative zero-shot learning,” IJCAI, Vienna, Austria, vol. 7, pp. 956–963, 2022.
[13] Y. Shu, B. Yu, H. Xu, and L. Liu, “Improving fine-grained visual recognition in low data regimes via self-boosting attention mechanism,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXV. Springer, 2022, pp. 449–465.
[14] Y. Wang, X. Pei, and H. Zhan, “Fine-grained graph learning for multi-view subspace clustering,” IEEE Transactions on Emerging Topics in Computational Intelligence, pp. 1–12, 2023.
[15] K.-Y. Feng, M. Gong, K. Pan, H. Zhao, Y. Wu, and K. Sheng, “Model sparsification for communication-efficient multi-party learning via contrastive distillation in image classification,” IEEE Transactions on Emerging Topics in Computational Intelligence, pp. 1–14, 2023.
[16] K. Li, Y. Zhang, K. Li, and Y. Fu, “Adversarial feature hallucination networks for few-shot learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 470–13 479.
[17] H. Le and D. Samaras, “Shadow removal via shadow image decomposition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8578–8587.
[18] R. Zhang, T. Che, Z. Ghahramani, Y. Bengio, and Y. Song, “Metagan: An adversarial approach to few-shot learning,” Advances in Neural Information Processing Systems, vol. 31, 2018.
[19] H. Gao, Z. Shou, A. Zareian, H. Zhang, and S.-F. Chang, “Low-shot learning via covariance-preserving adversarial augmentation networks,” Advances in Neural Information Processing Systems, vol. 31, 2018.
[20] S. Tsutsui, Y. Fu, and D. Crandall, “Meta-reinforced synthetic data for one-shot fine-grained visual recognition,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[21] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
[22] W. Li, Z. Wang, X. Yang, C. Dong, P. Tian, T. Qin, J. Huo, Y. Shi, L. Wang, Y. Gao et al., “Libfewshot: A comprehensive library for few-shot learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[23] S. Yan, N. Dong, L. Zhang, and J. Tang, “Clip-driven fine-grained text-image person re-identification,” IEEE Transactions on Image Processing, 2023.
[24] Z. Xin, S. Chen, T. Wu, Y. Shao, W. Ding, and X. You, “Few-shot object detection: Research advances and challenges,” Information Fusion, p. 102307, 2024.
[25] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[26] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1199–1208.
[27] L. Tang, D. Wertheimer, and B. Hariharan, “Revisiting pose-normalization for fine-grained few-shot recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14 352–14 361.
[28] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang, “A closer look at few-shot classification,” arXiv preprint arXiv:1904.04232, 2019.
[29] Y. Tian, Y. Wang, D. Krishnan, J. B. Tenenbaum, and P. Isola, “Rethinking few-shot image classification: a good embedding is all you need?” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. Springer, 2020, pp. 266–282.
[30] G. S. Dhillon, P. Chaudhari, A. Ravichandran, and S. Soatto, “A baseline for few-shot image classification,” arXiv preprint arXiv:1909.02729, 2019.
[31] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker, “Feature transfer learning for deep face recognition with under-represented data,” arXiv preprint arXiv:1803.09014, 2018.
[32] B. Hariharan and R. Girshick, “Low-shot visual recognition by shrinking and hallucinating features,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3018–3027.
[33] E. Schwartz, L. Karlinsky, J. Shtok, S. Harary, M. Marder, A. Kumar, R. Feris, R. Giryes, and A. Bronstein, “Delta-encoder: an effective sample synthesis method for few-shot object recognition,” Advances in Neural Information Processing Systems, vol. 31, 2018.
[34] J. Xu, H. Le, M. Huang, S. Athar, and D. Samaras, “Variational feature disentangling for fine-grained few-shot classification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8812–8821.
[35] B. Zhang, X. Li, Y. Ye, Z. Huang, and L. Zhang, “Prototype completion with primitive knowledge for few-shot learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3754–3762.
[36] S. Lee, W. Moon, and J.-P. Heo, “Task discrepancy maximization for fine-grained few-shot classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5331–5340.
[37] “Co** with change: Learning invariant and minimum sufficient representations for fine-grained visual categorization,” Computer Vision and Image Understanding, vol. 237, p. 103837, 2023.
[38] Z. Hong, Z. Wang, L. Shen, Y. Yao, Z. Huang, S. Chen, C. Yang, M. Gong, and T. Liu, “Improving non-transferable representation learning by harnessing content and style,” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=FYKVPOHCpE
[39] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 234–241.
[40] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[41] X. Liu, Z. Hu, H. Ling, and Y.-m. Cheung, “Mtfh: A matrix tri-factorization hashing framework for efficient cross-modal retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 3, pp. 964–981, 2019.
[42] X. Liu, X. Wang, and Y.-m. Cheung, “Fddh: Fast discriminative discrete hashing for large-scale cross-modal retrieval,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 11, pp. 6306–6320, 2021.
[43] S.-J. Peng, Y. Fan, Y.-m. Cheung, X. Liu, Z. Cui, and T. Li, “Towards efficient cross-modal anomaly detection using triple-adaptive network and bi-quintuple contrastive learning,” IEEE Transactions on Emerging Topics in Computational Intelligence, 2023.
[44] G. Baykal, H. F. Karagoz, T. Binhuraib, and G. Unal, “Protodiffusion: Classifier-free diffusion guidance with prototype learning,” arXiv preprint arXiv:2307.01924, 2023.
[45] J. Pfeiffer, A. Kamath, A. Rücklé, K. Cho, and I. Gurevych, “Adapterfusion: Non-destructive task composition for transfer learning,” arXiv preprint arXiv:2005.00247, 2020.
[46] A. Rücklé, G. Geigle, M. Glockner, T. Beck, J. Pfeiffer, N. Reimers, and I. Gurevych, “Adapterdrop: On the efficiency of adapters in transformers,” arXiv preprint arXiv:2010.11918, 2020.
[47] R. Wang, D. Tang, N. Duan, Z. Wei, X. Huang, G. Cao, D. Jiang, M. Zhou et al., “K-adapter: Infusing knowledge into pre-trained models with adapters,” arXiv preprint arXiv:2002.01808, 2020.
[48] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763.
[49] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott, “Multi-similarity loss with general pair weighting for deep metric learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5022–5030.
[50] A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li, “Novel dataset for fine-grained image categorization: Stanford dogs,” in Proc. CVPR workshop on fine-grained visual categorization (FGVC), vol. 2, no. 1. Citeseer, 2011.
[51] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2013, pp. 554–561.
[52] D. Wertheimer, L. Tang, and B. Hariharan, “Few-shot classification with feature map reconstruction networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8012–8021.
[53] W. Li, L. Wang, J. Xu, J. Huo, Y. Gao, and J. Luo, “Revisiting local descriptor based image-to-class measure for few-shot learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7260–7268.
[54] G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis, P. Perona, and S. Belongie, “Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 595–604.
[55] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 3498–3505.
[56] L. Yang, P. Luo, C. Change Loy, and X. Tang, “A large-scale car dataset for fine-grained categorization and verification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3973–3981.
[57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[58] Q. Sun, Y. Liu, T.-S. Chua, and B. Schiele, “Meta-transfer learning for few-shot learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 403–412.
[59] K. Lee, S. Maji, A. Ravichandran, and S. Soatto, “Meta-learning with differentiable convex optimization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 657–10 665.
[60] P. Mangla, N. Kumari, A. Sinha, M. Singh, B. Krishnamurthy, and V. N. Balasubramanian, “Charting the right manifold: Manifold mixup for few-shot learning,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 2218–2227.
[61] B. Liu, Y. Cao, Y. Lin, Q. Li, Z. Zhang, M. Long, and H. Hu, “Negative margin matters: Understanding margin in few-shot classification,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, 2020, pp. 438–455.
[62] A. Afrasiyabi, J.-F. Lalonde, and C. Gagné, “Associative alignment for few-shot image classification,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. Springer, 2020, pp. 18–35.
[63] Y. Zhu, C. Liu, and S. Jiang, “Multi-attention meta learning for few-shot fine-grained image recognition.” in IJCAI, 2020, pp. 1090–1096.
[64] X. Li, J. Wu, Z. Sun, Z. Ma, J. Cao, and J.-H. Xue, “Bsnet: Bi-similarity network for few-shot fine-grained image classification,” IEEE Transactions on Image Processing, vol. 30, pp. 1318–1331, 2020.
[65] H. Tang, C. Yuan, Z. Li, and J. Tang, “Learning attention-guided pyramidal features for few-shot fine-grained recognition,” Pattern Recognition, vol. 130, p. 108792, 2022.
[66] P. Li, G. Zhao, and X. Xu, “Coarse-to-fine few-shot classification with deep metric learning,” Information Sciences, vol. 610, pp. 592–604, 2022.
[67] B. Munjal, A. Flaborea, S. Amin, F. Tombari, and F. Galasso, “Query-guided networks for few-shot fine-grained classification and person search,” Pattern Recognition, vol. 133, p. 109049, 2023.
[68] N. Sun and P. Yang, “T2l: Trans-transfer learning for few-shot fine-grained visual categorization with extended adaptation,” Knowledge-Based Systems, vol. 264, p. 110329, 2023.
[69] M.-H. Pan, H.-Y. Xin, C.-Q. Xia, and H.-B. Shen, “Few-shot classification with task-adaptive semantic feature learning,” Pattern Recognition, vol. 141, p. 109594, 2023.
[70] S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo, “Vector quantized diffusion model for text-to-image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 696–10 706.
[71] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of Machine Learning Research, vol. 9, no. 11, 2008.
[72] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V. Balntas, “Sosnet: Second order similarity regularization for local descriptor learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 016–11 025.