Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions

Tianxu Wu, Shuo Ye, Shuhuang Chen, Qinmu Peng and Xinge You,  This work was supported in part by the National Key R&D Program of China 2022YFC3301000, in part by the Fundamental Research Funds for the Central Universities, HUST: 2023JYCXJJ031. Co-corresponding author: Shuo Ye([email protected]), Qinmu Peng(e-mail: [email protected])TianXu Wu, Shuo Ye, and Shuhuang Chen are with the School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China.Qinmu Peng and Xinge You are with the School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China.©2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Abstract

The challenge in fine-grained visual categorization (FGVC) lies in how to explore the subtle differences between different subclasses and achieve accurate discrimination. Previous research has relied on large-scale annotated data and pre-trained deep models to achieve the objective. However, when only a limited amount of samples is available, similar methods often struggle to accurately learn the details of instances and perform effective recognition. Using diffusion models for data augmentation has gained widespread attention, but the high level of detail required for fine-grained images makes it challenging for existing methods to be directly employed. To address this issue, we propose a novel approach termed the detail reinforcement diffusion model (DRDM), which leverages the extensive knowledge of large models for fine-grained data augmentation and comprises two key components including discriminative semantic recombination (DSR) and spatial knowledge reference (SKR). Specifically, DSR is designed to extract implicit similarity relationships from the labels and reconstruct the semantic map** between labels and instances, which enables better discrimination of subtle differences between different subclasses. Furthermore, we introduce the SKR module, which incorporates the distributions of different datasets as references in the feature space. This allows the SKR to aggregate the high-dimensional distribution of subclass features in few-shot FGVC tasks, thus expanding the decision boundary. Through these two critical components, we effectively utilize the knowledge from large models to address the issue of data scarcity, resulting in improved performance for fine-grained visual recognition tasks. Extensive experiments demonstrate the consistent performance gain offered by our DRDM.

Index Terms:
Fine-grained visual categorization, few-shot learning, stable diffusion

I Introduction

Fine-grained visual categorization (FGVC) aims to achieve the recognition of subclasses that exhibit tiny visual distinctions within the same large class (e.g., birds [1]). Related studies have been extensively applied to autonomous vehicles [2, 3] and pharmaceutical products [4, 5]. Compared to general images, fine-grained images usually have similar features and are affected by interferences such as posture, perspective, and occlusion [6]. Therefore, the key to achieving FGVC often lies in discovering discriminative regions. This process is often achieved through the localization branch network [7, 8] or implicitly learned in end-to-end training [9, 10, 11]. While automatically identifying these regions from a large-scale labeled dataset is feasible, many practical FGVC tasks lack such datasets because annotating fine-grained data is time-consuming, and labeling rare subclasses demands experienced expertise. For instance, in the medical domain, discerning subtle feature differences among different subtypes of diseases, or in the industrial sector, identifying minute variations among components, and in ecology, recognizing specific types of pests or diseases are particularly reliant on expert annotation. The ability of deep neural networks to process fine-grained few-shot learning (FSL) is crucial for practical applications. Unfortunately, existing methods still perform much worse than weakly supervised methods on several few-shot benchmarks [12]. Networks often struggle to select the correct regions for recognition and tend to overfit pseudo-features from the training data [13].

Refer to caption
Figure 1: Feature contamination resulting from semantic misalignment during data augmentation using large models. This is specifically evident in the form of (a) irrelevant augmented data and (b) the loss of discriminative details.

Utilizing external information (e.g., multi-view[14] or multi-party[15] information) can significantly enhance the performance of FSL, but this involves complex information acquisition pathways. A direct method to mitigate overfitting in FSL is through data augmentation [16]. However, reliably getting diverse data remains a challenging problem, where the augmented instances should contain discriminative features of the classes and exhibit high intra-class diversity [17]. Unfortunately, this often breaks down in adversarial learning [18, 19, 20] methods, where there is a shortfall in generating diverse samples. Recently, leveraging prior knowledge from large models (e.g., stable diffusion [21]) for data augmentation demonstrates significant potential. However, this success has not seamlessly extended into the realm of FGVC, one reason to consider is feature contamination, as depicted in Figure 1. It is manifested specifically as augmented data being unrelated to the original data or suffering from detail loss. In (a), when the label Geococcyx is used as input, the model fails to generate the expected result, the result is an unrelated composite animal image. Similarly, in (b), when the input image is Crested Auklet, although the generated images possess bird-like structures, their detailed features do not align with the target subclass. Please note that this phenomenon has been observed across different types of datasets. One of the reasons contributing to this phenomenon is believed to be the specialized nature of fine-grained labels. This implies that nouns within these labels are less common compared to general images. Consequently, during the pre-training process, this inherent imbalance poses a challenge for models to effectively learn fine-grained information and accurately establish a map** between labels and semantic features. As a result, they struggle to depict fine-grained features during data augmentation with large models. Utilizing such feature contamination images for training fine-grained models would severely impair the models’ understanding of instances. Moreover, limited instances can also result in data feature points struggling to encompass the intricate boundaries between different categories. Models may lean towards adopting simplistic decision boundaries, preventing them from capturing the complex classification scenarios present in the real world.

We argue that since fine-grained labels possess a certain level of expertise, the naming of subclasses should adhere to specific conventions. For instance, labels like Parakeet Auklet and Crested Auklet both contain Auklet in their names. This kind of textual similarity implicitly encodes fundamental subclass features, such as a red beak and a short tail. By utilizing the inherent resemblance in label descriptions, the data augmentation process can be constrained, thereby effectively enhancing the performance of FGVC in few-shot conditions. To address this, we propose a detailed reinforcement model. Specifically, the discriminative semantic recombination module is designed to explicitly emphasize subclass-specific differences from a labeling perspective. It then utilizes the extracted similarity relationships to guide and constrain the data augmentation process performed by diffusion models. Meanwhile, the spatial knowledge reference module is designed to incorporate diverse data distributions from various data types as reference points into the feature space. This approach effectively addresses the challenge of poorly defined decision boundaries in FSL due to limited data, thereby enhancing the model’s instance understanding. Our model demonstrates notable scalability and can seamlessly integrate knowledge supplementation from different data modalities, leveraging reference knowledge from datasets of distinct types. Our main contributions are summarized as follows:

  • We analyzed the limitations of applying the diffusion model to fine-grained image data augmentation and proposed a Discriminative Semantic Recombination (DSR) module. This module effectively explores the relationships between instance labels and image information under weakly supervised conditions, thus enhancing the details of augmented data.

  • We proposed a Spatial Knowledge Referencing (SKR) approach that introduces the distributions of different data types as references in the feature space. This encourages the model to find clear and distinct boundaries for fine-grained features in the high-dimensional space, thereby enhancing the model’s understanding of instances.

  • We conducted extensive experiments on three benchmark datasets, and the results demonstrated that the proposed DRDM achieved favorable performance in the FSFG problem, and significantly improved the overall performance compared to other methods.

II Related Work

II-A Fine-Grained Visual Categorization in Few-Shot Setting

Fine-grained visual categorization (FGVC) aims to achieve a refined classification of subclasses within a large class, where instances have similar appearance features and discrimination regions only exist locally. Previous research achieves this by utilizing large-scale annotated data and pre-trained deep models. However, when only a few-shot data is available, those methods may become less effective [22, 23, 24]. To alleviate the pressure caused by the reduction in data quantity, relevant methods can be roughly divided into three categories including metric learning [25, 26], optimization, and data augmentation. Specifically, metric learning uses predefined metrics to learn the deep representation of instances, and by calculating the distance or similarity between different images, they are divided into different categories. In this process, pose-normalized representations are often used, which first locate the semantic parts in each image, and then describe the image by characterizing the appearance of each part [27]. Optimization methods often use transfer learning. Specifically, traditional deep learning is applied to adjust the source data, and then a simple classifier is trained to adjust the target data in a fixed representation [28, 29], or fine-tuned [30]. Most data augmentation methods are based on an assumption that internal category variations caused by pose, background, or lighting conditions are shared between categories. Internal category variations can be modeled as low-level statistical information [31] or pairwise transformations [32, 33], and can be directly applied to new samples.

Although these methods have been proven effective in general FSL tasks, the gains achieved in fine-grained datasets are minimal. For metric learning methods, The extracted features are difficult to form tight clusters for new classes because small changes in the feature space can be affected by small inter-class distances [34, 35, 36]. For the optimization methods, fine-tuning FGVC images on large models is challenging because discriminative features often exist only locally. Limited samples often lead to pre-trained models struggling to comprehend instance details and perform effective recognition properly. The data augmentation methods have shown significant promise, however, they also require careful design due to the risk of exacerbating the imbalance between discriminative and non-discriminative features [37, 38].

II-B Basic Principles of Diffusion Model

Diffusion models are a type of latent variable models that include forward and reverse noise-injection process. During the forward process, noise is gradually added to the data, each step in the forward process is a Gaussian transition according to the following Markovian process

q(𝒙t|𝒙t1)=𝒩(αt𝒙t1,βt𝐈),t{1,,T},formulae-sequence𝑞conditionalsubscript𝒙𝑡subscript𝒙𝑡1𝒩subscript𝛼𝑡subscript𝒙𝑡1subscript𝛽𝑡𝐈for-all𝑡1𝑇q\left(\boldsymbol{x}_{t}|\boldsymbol{x}_{t-1}\right)=\mathcal{N}\left(\sqrt{% \alpha_{t}}\boldsymbol{x}_{t-1},\beta_{t}\mathbf{I}\right),\forall t\in\left\{% 1,...,T\right\},italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) , ∀ italic_t ∈ { 1 , … , italic_T } , (1)
q(𝒙1:T|𝒙0)=t=0Tq(𝒙t|𝒙t1),𝑞conditionalsubscript𝒙:1𝑇subscript𝒙0superscriptsubscriptproduct𝑡0𝑇𝑞conditionalsubscript𝒙𝑡subscript𝒙𝑡1q\left(\boldsymbol{x}_{1:T}|\boldsymbol{x}_{0}\right)=\prod_{t=0}^{T}{q\left(% \boldsymbol{x}_{t}|\boldsymbol{x}_{t-1}\right)},italic_q ( bold_italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , (2)

where T𝑇Titalic_T is the number of diffusion steps, The mean and variance of Gaussian noise are determined by βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, αt=1βtsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The reverse process is another Gaussian transition

pθ(𝒙t1|𝒙t)=𝒩(𝒙t1|𝝁θ(𝒙t,t),σθ2(𝒙t,t)𝐈),subscript𝑝𝜃conditionalsubscript𝒙𝑡1subscript𝒙𝑡𝒩conditionalsubscript𝒙𝑡1subscript𝝁𝜃subscript𝒙𝑡𝑡superscriptsubscript𝜎𝜃2subscript𝒙𝑡𝑡𝐈p_{\theta}\left(\boldsymbol{x}_{t-1}|\boldsymbol{x}_{t}\right)=\mathcal{N}% \left(\boldsymbol{x}_{t-1}|\boldsymbol{\mu}_{\theta}\left(\boldsymbol{x}_{t},t% \right),\sigma_{\theta}^{2}\left(\boldsymbol{x}_{t},t\right)\mathbf{I}\right),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , italic_σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) bold_I ) , (3)

where the mean value 𝝁θ(𝒙t,t)subscript𝝁𝜃subscript𝒙𝑡𝑡\boldsymbol{\mu}_{\theta}\left(\boldsymbol{x}_{t},t\right)bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) can be seen as the combination of 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and a noise prediction network ϵθ(𝒙t,t)subscriptitalic-ϵ𝜃subscript𝒙𝑡𝑡\epsilon_{\theta}\left(\boldsymbol{x}_{t},t\right)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ). The maximal likelihood estimation of the optimal mean is

𝝁~θ(𝒙t,t)=1αt(𝒙tβt1α¯𝔼(ϵ|𝒙t)).subscript~𝝁𝜃subscript𝒙𝑡𝑡1subscript𝛼𝑡subscript𝒙𝑡subscript𝛽𝑡1¯𝛼𝔼conditionalitalic-ϵsubscript𝒙𝑡\tilde{\boldsymbol{\mu}}_{\theta}\left(\boldsymbol{x}_{t},t\right)=\frac{1}{% \sqrt{\alpha_{t}}}\left(\boldsymbol{x}_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{% \alpha}}}\mathbb{E}\left(\epsilon|\boldsymbol{x}_{t}\right)\right).over~ start_ARG bold_italic_μ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG end_ARG end_ARG blackboard_E ( italic_ϵ | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) . (4)

To get the optimal mean, ϵθ(𝒙t,t)subscriptitalic-ϵ𝜃subscript𝒙𝑡𝑡\epsilon_{\theta}\left(\boldsymbol{x}_{t},t\right)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) can be learned by a noise prediction objective

minθ𝔼𝒙0q(𝒙),ϵ𝒩(0,𝐈),tϵθ(𝒙t,t)ϵ22.subscript𝜃subscript𝔼formulae-sequencesimilar-tosubscript𝒙0𝑞𝒙similar-toitalic-ϵ𝒩0𝐈𝑡superscriptsubscriptnormsubscriptitalic-ϵ𝜃subscript𝒙𝑡𝑡italic-ϵ22\min_{\theta}\mathbb{E}_{\boldsymbol{x}_{0}\sim q\left(\boldsymbol{x}\right),% \epsilon\sim\mathcal{N}\left(0,\mathbf{I}\right),t}\left\|\epsilon_{\theta}% \left(\boldsymbol{x}_{t},t\right)-\epsilon\right\|_{2}^{2}.roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( bold_italic_x ) , italic_ϵ ∼ caligraphic_N ( 0 , bold_I ) , italic_t end_POSTSUBSCRIPT ∥ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5)

However, in the context of conditional generation, the condition information c𝑐citalic_c should be considered during the training process of the noise prediction network

minθ𝔼𝒙0q(𝒙),ϵ𝒩(0,𝐈),t,cϵθ(𝒙t,t,c)ϵ22.subscript𝜃subscript𝔼formulae-sequencesimilar-tosubscript𝒙0𝑞𝒙similar-toitalic-ϵ𝒩0𝐈𝑡𝑐superscriptsubscriptnormsubscriptitalic-ϵ𝜃subscript𝒙𝑡𝑡𝑐italic-ϵ22\min_{\theta}\mathbb{E}_{\boldsymbol{x}_{0}\sim q\left(\boldsymbol{x}\right),% \epsilon\sim\mathcal{N}\left(0,\mathbf{I}\right),t,c}\left\|\epsilon_{\theta}% \left(\boldsymbol{x}_{t},t,c\right)-\epsilon\right\|_{2}^{2}.roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( bold_italic_x ) , italic_ϵ ∼ caligraphic_N ( 0 , bold_I ) , italic_t , italic_c end_POSTSUBSCRIPT ∥ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) - italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (6)

This process is typically implemented using the U-Net [39] architecture. In order to leverage information from different modalities, recent studies have also incorporated Transformer’s self-attention [40, 41, 42, 43] modules (including a self-attention layer, a cross-attention layer, and a fully connected feed-forward network) for feature alignment [44]. Specifically, the attention layer operates on queries 𝑸n×dk𝑸superscript𝑛subscript𝑑𝑘\boldsymbol{Q}\in\mathbb{R}^{n\times d_{k}}bold_italic_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and key-value pairs 𝑲m×dk𝑲superscript𝑚subscript𝑑𝑘\boldsymbol{K}\in\mathbb{R}^{m\times d_{k}}bold_italic_K ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝑽m×dv𝑽superscript𝑚subscript𝑑𝑣\boldsymbol{V}\in\mathbb{R}^{m\times d_{v}}bold_italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

A(𝑸,𝑲,𝑽)=softmax(𝑸𝑲Tdk)𝑽,𝐴𝑸𝑲𝑽softmax𝑸superscript𝑲𝑇subscript𝑑𝑘𝑽A(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V})=\text{softmax}(\frac{% \boldsymbol{Q}\boldsymbol{K}^{T}}{\sqrt{d_{k}}})\boldsymbol{V},italic_A ( bold_italic_Q , bold_italic_K , bold_italic_V ) = softmax ( divide start_ARG bold_italic_Q bold_italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) bold_italic_V , (7)

where n𝑛nitalic_n is the number of queries, m𝑚mitalic_m is the number of key-value pairs dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is dimension of key, dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is the dimension of value. In the self-attention layer, 𝒙n×dx𝒙superscript𝑛subscript𝑑𝑥\boldsymbol{x}\in\mathbb{R}^{n\times d_{x}}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the only input. In the crosss attention layer of the conditioned diffusion model, there are two inputs 𝒙n×dx𝒙superscript𝑛subscript𝑑𝑥\boldsymbol{x}\in\mathbb{R}^{n\times d_{x}}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒄m×dc𝒄superscript𝑚subscript𝑑𝑐\boldsymbol{c}\in\mathbb{R}^{m\times d_{c}}bold_italic_c ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where 𝒙𝒙\boldsymbol{x}bold_italic_x is the output from the prior block and 𝒄𝒄\boldsymbol{c}bold_italic_c represents the condition information. However, diffusion models cannot be used directly for data augmentation of fine-grained images due to fine-grained labels possess a certain level of expertise, and pre-trained models have difficulty in understanding the map** between labels and semantics (see more details in Section IV).

Refer to caption
Figure 2: Overview of the proposed method. Our framework first uses DSR to constrain similarity relations from the labels, thereby enhancing the model’s understanding of instance-specific features. Then, during the classification process, we introduce instance features from different datasets for comparative reference, ensuring that the learned features possess a stronger representational capacity and robustness.

III Method

In this section, we describe our DRDM. As shown in Figure 2, it consists of two core components, including discriminative semantic recombination, and spatial knowledge reference module.

III-A Notation

In the few-shot FGVC tasks, the dataset is divided into meta-training set Dbase={(xi,yi),yiCbase}subscript𝐷𝑏𝑎𝑠𝑒subscript𝑥𝑖subscript𝑦𝑖subscript𝑦𝑖subscript𝐶𝑏𝑎𝑠𝑒D_{base}=\left\{\left(x_{i},y_{i}\right),y_{i}\in C_{base}\right\}italic_D start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT } and meta-testing set Dnovel={(xk,yk),ykCnovel}subscript𝐷𝑛𝑜𝑣𝑒𝑙subscript𝑥𝑘subscript𝑦𝑘subscript𝑦𝑘subscript𝐶𝑛𝑜𝑣𝑒𝑙D_{novel}=\left\{\left(x_{k},y_{k}\right),y_{k}\in C_{novel}\right\}italic_D start_POSTSUBSCRIPT italic_n italic_o italic_v italic_e italic_l end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_n italic_o italic_v italic_e italic_l end_POSTSUBSCRIPT }, where Cbasesubscript𝐶𝑏𝑎𝑠𝑒C_{base}italic_C start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT and Cnovelsubscript𝐶𝑛𝑜𝑣𝑒𝑙C_{novel}italic_C start_POSTSUBSCRIPT italic_n italic_o italic_v italic_e italic_l end_POSTSUBSCRIPT represent base and novel classes respectively, and CbaseCnovel=ϕsubscript𝐶𝑏𝑎𝑠𝑒subscript𝐶𝑛𝑜𝑣𝑒𝑙italic-ϕC_{base}\cap C_{novel}=\phiitalic_C start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT ∩ italic_C start_POSTSUBSCRIPT italic_n italic_o italic_v italic_e italic_l end_POSTSUBSCRIPT = italic_ϕ. Here, xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and yksubscript𝑦𝑘y_{k}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denote input image and class name respectively. Furthermore, during the training and testing phases of few-shot FGVC tasks, they are typically composed of distinct episodes. Each episode includes a labeled support set S={(xk,yk,)}k=1N×KS=\left\{\left(x_{k},y_{k},\right)\right\}_{k=1}^{N\times K}italic_S = { ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ) } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N × italic_K end_POSTSUPERSCRIPT and an unlabeled query set Q={(xk,yk)}k=1N×U𝑄superscriptsubscriptsubscript𝑥𝑘subscript𝑦𝑘𝑘1𝑁𝑈Q=\left\{\left(x_{k},y_{k}\right)\right\}_{k=1}^{N\times U}italic_Q = { ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N × italic_U end_POSTSUPERSCRIPT. In these sets, N𝑁Nitalic_N signifies the number of randomly selected classes, and S𝑆Sitalic_S and Q𝑄Qitalic_Q share the same class. K𝐾Kitalic_K and U𝑈Uitalic_U represent the quantities of labeled and unlabeled samples respectively, while ensuring SQ=ϕ𝑆𝑄italic-ϕS\cap Q=\phiitalic_S ∩ italic_Q = italic_ϕ. For ease of reference, we have compiled some important symbols and definitions in Table I.

TABLE I: Partial symbols and definition explanations.
Symbol Definition
Dbasesubscript𝐷𝑏𝑎𝑠𝑒D_{base}italic_D start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT,Dnovelsubscript𝐷𝑛𝑜𝑣𝑒𝑙D_{novel}italic_D start_POSTSUBSCRIPT italic_n italic_o italic_v italic_e italic_l end_POSTSUBSCRIPT Meta-training and meta-test set
Dextrasubscript𝐷𝑒𝑥𝑡𝑟𝑎D_{extra}italic_D start_POSTSUBSCRIPT italic_e italic_x italic_t italic_r italic_a end_POSTSUBSCRIPT Additional meta-training set
Fi,jSsuperscriptsubscript𝐹𝑖𝑗𝑆F_{i,j}^{S}italic_F start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT,Fi,jQsuperscriptsubscript𝐹𝑖𝑗𝑄F_{i,j}^{Q}italic_F start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT Computed feature of support and query set
Fi,jEsuperscriptsubscript𝐹𝑖𝑗𝐸F_{i,j}^{E}italic_F start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT Computed feature of extra set
FiPsuperscriptsubscript𝐹𝑖𝑃F_{i}^{P}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT Prototype representative of i𝑖iitalic_i-th class
Ri,cintrasuperscriptsubscript𝑅𝑖𝑐𝑖𝑛𝑡𝑟𝑎R_{i,c}^{intra}italic_R start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUPERSCRIPT,Ri,cintersuperscriptsubscript𝑅𝑖𝑐𝑖𝑛𝑡𝑒𝑟R_{i,c}^{inter}italic_R start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUPERSCRIPT Intra-class and inter-class representation score
wiintrasuperscriptsubscript𝑤𝑖𝑖𝑛𝑡𝑟𝑎w_{i}^{intra}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUPERSCRIPT,wiintersuperscriptsubscript𝑤𝑖𝑖𝑛𝑡𝑒𝑟w_{i}^{inter}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUPERSCRIPT Intra-class and inter-class attention weights
wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Channel attention weights
MSsubscript𝑀𝑆\mathcal{L}_{MS}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S end_POSTSUBSCRIPT Multi-Similarity loss

III-B Discriminative Semantic Recombination (DSR)

As mentioned above, the implicit similarity descriptions in the textual modality encompass the fundamental features of the subclasses. We argue that transferring this part of similarity knowledge from the text space to the image space would aid in augmenting the data with more intricate details, thereby generating features with stronger representational capabilities. To achieve this, we first conduct similarity measurements in the text space. In this process, the model needs to establish a connection between fine-grained labels and instances, which becomes challenging under the few-shot paradigm. Utilizing too few instances to fine-tune a large model is not sufficient for the model to acquire enough discriminative knowledge and may lead to severe overfitting. Recently, a novel fine-tuning paradigm has emerged with the use of Adapter methods (e.g., AdapterFusion [45], AdapterDrop [46], and K-Adapter [47]). This paradigm involves adding Adapter modules to certain layers of a pre-trained model and freezing the pre-trained backbone during fine-tuning. The Adapter modules are responsible for learning specific downstream task knowledge, thereby avoiding the issues of full model fine-tuning and catastrophic forgetting.

Inspired by this, in our approach, we introduce Adapter knowledge layers into the process of interconnecting textual and visual features of the diffusion model. With only a few parameters specifically designed for the fine-grained task, these Adapter knowledge layers store instance-specific knowledge, thereby mitigating the overfitting issues that may arise from fine-tuning. We introduced an adapter module into the U-Net architecture after the cross-attention layer for visual features. At the same time, an adapter was added after the pre-trained text encoder to transfer the semantic understanding. Defining the prompt as V𝑉Vitalic_V, expressed as “a photo of a [class label]”. the formulation becomes FC=ΦAD(Z(V))subscript𝐹𝐶subscriptΦ𝐴𝐷𝑍𝑉F_{C}=\Phi_{AD}\left(Z\left(V\right)\right)italic_F start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT italic_A italic_D end_POSTSUBSCRIPT ( italic_Z ( italic_V ) ), where Z𝑍Zitalic_Z represents the encoder used for the prompt input, such as CLIP [48]. ΦADsubscriptΦ𝐴𝐷\Phi_{AD}roman_Φ start_POSTSUBSCRIPT italic_A italic_D end_POSTSUBSCRIPT signifies the added adapter module. The loss function for the CLIP branch is denoted as C=MS(FC)subscript𝐶subscript𝑀𝑆subscript𝐹𝐶\mathcal{L}_{C}=\mathcal{L}_{MS}\left(F_{C}\right)caligraphic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_M italic_S end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) where MSsubscript𝑀𝑆\mathcal{L}_{MS}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S end_POSTSUBSCRIPT refers to the MS loss [49], computed as follows:

MS=subscript𝑀𝑆absent\displaystyle\mathcal{L}_{MS}=caligraphic_L start_POSTSUBSCRIPT italic_M italic_S end_POSTSUBSCRIPT = 1mi=1m{12log[1+k𝒫ie2(Sik12)]\displaystyle\frac{1}{m}\sum_{i=1}^{m}\left\{\frac{1}{2}log\left[1+\sum_{k\in% \mathcal{P}_{i}}e^{-2\left(S_{ik}-\frac{1}{2}\right)}\right]\right.divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT { divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_l italic_o italic_g [ 1 + ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - 2 ( italic_S start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) end_POSTSUPERSCRIPT ] (8)
+140log[1+k𝒩ie40(Sik12)]},\displaystyle\left.+\frac{1}{40}log\left[1+\sum_{k\in\mathcal{N}_{i}}e^{40% \left(S_{ik}-\frac{1}{2}\right)}\right]\right\},+ divide start_ARG 1 end_ARG start_ARG 40 end_ARG italic_l italic_o italic_g [ 1 + ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT 40 ( italic_S start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) end_POSTSUPERSCRIPT ] } ,

The reconstruction loss of Stable Diffusion (SDsubscript𝑆𝐷\mathcal{L}_{SD}caligraphic_L start_POSTSUBSCRIPT italic_S italic_D end_POSTSUBSCRIPT) is defined as follows:

SD=𝔼t,x0,c,ϵϵθ(xt,t,c)ϵ22.subscript𝑆𝐷subscript𝔼𝑡subscript𝑥0𝑐italic-ϵsubscriptsuperscriptnormsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡𝑐italic-ϵ22\mathcal{L}_{SD}=\mathbb{E}_{t,x_{0},c,\epsilon}||\epsilon_{\theta}\left(x_{t}% ,t,c\right)-\epsilon||^{2}_{2}.caligraphic_L start_POSTSUBSCRIPT italic_S italic_D end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_t , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c , italic_ϵ end_POSTSUBSCRIPT | | italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) - italic_ϵ | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (9)

The overall loss function in DSR is given by:

DSR=SD+αC.subscript𝐷𝑆𝑅subscript𝑆𝐷𝛼subscript𝐶\mathcal{L}_{DSR}=\mathcal{L}_{SD}+\alpha\mathcal{L}_{C}.caligraphic_L start_POSTSUBSCRIPT italic_D italic_S italic_R end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_S italic_D end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT . (10)

where Csubscript𝐶\mathcal{L}_{C}caligraphic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is the CLIP branch loss, and α𝛼\alphaitalic_α is a hyperparameter controlling the trade-off between the reconstruction loss and the CLIP branch loss.

III-C Spatial Knowledge Reference (SKR)

One of the challenges in few-shot FGVC is the limited number of samples, which results in the subclass features becoming highly scattered when mapped to a high-dimensional space. Consequently, the classifier struggles to identify clear and distinct subclass feature boundaries in the high-dimensional space, significantly affecting the accuracy and performance of FGVC tasks. We argue that leveraging the knowledge from other datasets as a reference can significantly enhance the model’s understanding of similar instances.

One reason to consider is that data from a similar class tends to be closer in high-dimensional space. As shown in Figure 3.

Refer to caption
Figure 3: Analysis of dataset feature distributions. All features are extracted using a pre-trained ResNet-50. (a) Qualitative analysis, where points of different colors represent different datasets. (b) Quantitative analysis, which calculates the distances between the centers of each dataset.

It can be observed that CUB exhibits significant overlap in feature distribution with NABirds, Stanford Dogs, and Oxford-Pet, as well as Stanford Cars and CompCar, with the distribution distances noticeably closer compared to other classes of data. Referencing similar datasets can help the model understand the subtle differences between instances within similar features, encouraging the model to find clear and distinct boundaries for fine-grained features in the high-dimensional space. However, existing research on few-shot FGVC fails to utilize foundational knowledge from similar data, as the datasets’ training processes are disjointed. Additionally, directly applying knowledge from other datasets not only fails to improve the model’s performance but also leads to severe learning degradation.

To address these issues, we have designed a knowledge reference module, which aims to effectively incorporate knowledge from other datasets in a coherent manner during the training process. This module allows the model to benefit from the shared knowledge of similar data, leading to improved performance in few-shot FGVC tasks.

During the training phase of base class, in addition to the support set and query set, we also introduced additional set E={(xk,yk)}k=1W×KDextra𝐸superscriptsubscriptsubscript𝑥𝑘subscript𝑦𝑘𝑘1𝑊𝐾subscript𝐷𝑒𝑥𝑡𝑟𝑎E=\left\{\left(x_{k},y_{k}\right)\right\}_{k=1}^{W\times K}\subseteq D_{extra}italic_E = { ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W × italic_K end_POSTSUPERSCRIPT ⊆ italic_D start_POSTSUBSCRIPT italic_e italic_x italic_t italic_r italic_a end_POSTSUBSCRIPT, where W𝑊Witalic_W represents the number of additional dataset classes added, and Dextra={(xi,yi),yiCextra}subscript𝐷𝑒𝑥𝑡𝑟𝑎subscript𝑥𝑖subscript𝑦𝑖subscript𝑦𝑖subscript𝐶𝑒𝑥𝑡𝑟𝑎D_{extra}=\left\{\left(x_{i},y_{i}\right),y_{i}\in C_{extra}\right\}italic_D start_POSTSUBSCRIPT italic_e italic_x italic_t italic_r italic_a end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_e italic_x italic_t italic_r italic_a end_POSTSUBSCRIPT }, satisfying CbaseCextra=ϕsubscript𝐶𝑏𝑎𝑠𝑒subscript𝐶𝑒𝑥𝑡𝑟𝑎italic-ϕC_{base}\cap C_{extra}=\phiitalic_C start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT ∩ italic_C start_POSTSUBSCRIPT italic_e italic_x italic_t italic_r italic_a end_POSTSUBSCRIPT = italic_ϕ and CnovelCextra=ϕsubscript𝐶𝑛𝑜𝑣𝑒𝑙subscript𝐶𝑒𝑥𝑡𝑟𝑎italic-ϕC_{novel}\cap C_{extra}=\phiitalic_C start_POSTSUBSCRIPT italic_n italic_o italic_v italic_e italic_l end_POSTSUBSCRIPT ∩ italic_C start_POSTSUBSCRIPT italic_e italic_x italic_t italic_r italic_a end_POSTSUBSCRIPT = italic_ϕ. The training framework involves a network feature extractor, denoted as f(|θ)f\left(\cdot|\theta\right)italic_f ( ⋅ | italic_θ ), which is responsible for computing features for different sets. The prototype representative of each class is expressed as: FiP=1Kj=1KFi,jSsuperscriptsubscript𝐹𝑖𝑃1𝐾superscriptsubscript𝑗1𝐾superscriptsubscript𝐹𝑖𝑗𝑆F_{i}^{P}=\frac{1}{K}\sum_{j=1}^{K}{F_{i,j}^{S}}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT, where Fi,jSsuperscriptsubscript𝐹𝑖𝑗𝑆F_{i,j}^{S}italic_F start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT denotes the j𝑗jitalic_j-th feature in i𝑖iitalic_i-th class. Inspired by TDM [36], we applied channel attention to the feature of both the support set and query set. Firstly, the intra-class representation score across channel dimensions is defined by:

Ri,cintra=1H×WFi,cPMiP2,superscriptsubscript𝑅𝑖𝑐𝑖𝑛𝑡𝑟𝑎1𝐻𝑊superscriptnormsuperscriptsubscript𝐹𝑖𝑐𝑃superscriptsubscript𝑀𝑖𝑃2R_{i,c}^{intra}=\frac{1}{H\times W}\left\|F_{i,c}^{P}-M_{i}^{P}\right\|^{2},italic_R start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_H × italic_W end_ARG ∥ italic_F start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT - italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (11)

where H𝐻Hitalic_H and W𝑊Witalic_W represent the width and height of the feature and MiPRH×Wsuperscriptsubscript𝑀𝑖𝑃superscript𝑅𝐻𝑊M_{i}^{P}\in R^{H\times W}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT represents the mean prototype feature across channel. Secondly, the inter-class representation score is calculated by:

Ri,cinter=miniC,jC,ij1H×WFi,cPMjP2.superscriptsubscript𝑅𝑖𝑐𝑖𝑛𝑡𝑒𝑟formulae-sequence𝑖𝐶formulae-sequence𝑗𝐶𝑖𝑗1𝐻𝑊superscriptnormsuperscriptsubscript𝐹𝑖𝑐𝑃superscriptsubscript𝑀𝑗𝑃2R_{i,c}^{inter}=\underset{i\in C,j\in C,i\neq j}{\min}\frac{1}{H\times W}\left% \|F_{i,c}^{P}-M_{j}^{P}\right\|^{2}.italic_R start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUPERSCRIPT = start_UNDERACCENT italic_i ∈ italic_C , italic_j ∈ italic_C , italic_i ≠ italic_j end_UNDERACCENT start_ARG roman_min end_ARG divide start_ARG 1 end_ARG start_ARG italic_H × italic_W end_ARG ∥ italic_F start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT - italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (12)

Subsequently, the obtained intra-class representation score and inter-class representation score are passed through the fully connected network to obtain the attention weights for different channels of the i𝑖iitalic_i-th class:

wiintra=fintra(Riintra),superscriptsubscript𝑤𝑖𝑖𝑛𝑡𝑟𝑎subscript𝑓𝑖𝑛𝑡𝑟𝑎superscriptsubscript𝑅𝑖𝑖𝑛𝑡𝑟𝑎w_{i}^{intra}=f_{intra}\left(R_{i}^{intra}\right),italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUPERSCRIPT ) , (13)
wiinter=finter(Riinter),superscriptsubscript𝑤𝑖𝑖𝑛𝑡𝑒𝑟subscript𝑓𝑖𝑛𝑡𝑒𝑟superscriptsubscript𝑅𝑖𝑖𝑛𝑡𝑒𝑟w_{i}^{inter}=f_{inter}\left(R_{i}^{inter}\right),italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUPERSCRIPT ) , (14)

where fintra()subscript𝑓𝑖𝑛𝑡𝑟𝑎f_{intra}\left(\cdot\right)italic_f start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT ( ⋅ ) and finter()subscript𝑓𝑖𝑛𝑡𝑒𝑟f_{inter}\left(\cdot\right)italic_f start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ( ⋅ ) denote the fully connected networks for intra-class scores and inter-class scores, respectively. The final channel attention weight is given by:

wi=(wiintra+wiinter)/2,subscript𝑤𝑖superscriptsubscript𝑤𝑖𝑖𝑛𝑡𝑟𝑎superscriptsubscript𝑤𝑖𝑖𝑛𝑡𝑒𝑟2w_{i}=({w_{i}^{intra}+w_{i}^{inter}})/{2},italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUPERSCRIPT + italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUPERSCRIPT ) / 2 , (15)

We apply channel attention weight to the prototype representation and query set of each class:

GiS=wiFiP,superscriptsubscript𝐺𝑖𝑆direct-productsubscript𝑤𝑖superscriptsubscript𝐹𝑖𝑃G_{i}^{S}=w_{i}\odot F_{i}^{P},italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT , (16)
Gi,jQ=wiFi,jQ.superscriptsubscript𝐺𝑖𝑗𝑄direct-productsubscript𝑤𝑖superscriptsubscript𝐹𝑖𝑗𝑄G_{i,j}^{Q}=w_{i}\odot F_{i,j}^{Q}.italic_G start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ italic_F start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT . (17)

Finally, according to ProtoNet [25], the inference results of query set are given by

p(y=i|x)=exp(dist(GiS,GiQ))j=1Nexp(dist(GjS,GjQ)),𝑝𝑦conditional𝑖𝑥𝑑𝑖𝑠𝑡superscriptsubscript𝐺𝑖𝑆superscriptsubscript𝐺𝑖𝑄superscriptsubscript𝑗1𝑁𝑑𝑖𝑠𝑡superscriptsubscript𝐺𝑗𝑆superscriptsubscript𝐺𝑗𝑄p\left(\left.y=i\right|x\right)=\frac{\exp\left(-dist\left(G_{i}^{S},G_{i}^{Q}% \right)\right)}{\sum\nolimits_{j=1}^{N}{\exp\left(-dist\left(G_{j}^{S},G_{j}^{% Q}\right)\right)}},italic_p ( italic_y = italic_i | italic_x ) = divide start_ARG roman_exp ( - italic_d italic_i italic_s italic_t ( italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( - italic_d italic_i italic_s italic_t ( italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ) ) end_ARG , (18)

where dist()𝑑𝑖𝑠𝑡dist\left(\cdot\right)italic_d italic_i italic_s italic_t ( ⋅ ) denotes similarity distance measure between features. The computation of SKR gives rise to the corresponding loss term, denoted as SKR=MS(Cat(FS,FE))subscript𝑆𝐾𝑅subscript𝑀𝑆𝐶𝑎𝑡superscript𝐹𝑆superscript𝐹𝐸\mathcal{L}_{SKR}=\mathcal{L}_{MS}\left(Cat\left(F^{S},F^{E}\right)\right)caligraphic_L start_POSTSUBSCRIPT italic_S italic_K italic_R end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_M italic_S end_POSTSUBSCRIPT ( italic_C italic_a italic_t ( italic_F start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , italic_F start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ) ) with Cat()𝐶𝑎𝑡Cat\left(\cdot\right)italic_C italic_a italic_t ( ⋅ ) representing the concatenation operation. The overall loss function for the final classification network is expressed as:

CLS=1N×Uk=1N×U(𝒚kTlog(𝒑k))+βSKR,subscript𝐶𝐿𝑆1𝑁𝑈superscriptsubscript𝑘1𝑁𝑈superscriptsubscript𝒚𝑘𝑇subscript𝒑𝑘𝛽subscript𝑆𝐾𝑅\mathcal{L}_{CLS}=-\frac{1}{N\times U}\sum_{k=1}^{N\times U}{\left(\boldsymbol% {y}_{k}^{T}\log\left(\boldsymbol{p}_{k}\right)\right)}+\beta\mathcal{L}_{SKR},caligraphic_L start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N × italic_U end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N × italic_U end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log ( bold_italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) + italic_β caligraphic_L start_POSTSUBSCRIPT italic_S italic_K italic_R end_POSTSUBSCRIPT , (19)

where 𝒚ksubscript𝒚𝑘\boldsymbol{y}_{k}bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT stands for the one-hot vector and 𝒑ksubscript𝒑𝑘\boldsymbol{p}_{k}bold_italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for predicted probability, and β𝛽\betaitalic_β is a hyperparameter controlling the balance between the classification cross-entropy loss and the SKR loss.

IV Experiments

In this section, we extensively evaluated the performance of our approach. We compared the performance of our approach with the latest state-of-the-art (SOTA) methods on each network architecture. The experimental settings, implementation details, and results for diverse tasks are described below.

IV-A Datasets and Experimental Setup

Experiments are conducted on three widely used datasets. All datasets provide fixed train and test splits. The details are summarized in Table II.

TABLE II: The splits of datasets. While Call is the number of total subclasses, Ctrain, Cval, Ctest are the number of training, validation, and test subclasses, respectively. The classes of subsets are disjoint.
Dataset Call Ctrain Cval Ctest
CUB-200-2011[1] 200 100 50 50
Stanford Dogs[50] 120 60 30 30
Stanford Cars[51] 196 130 17 49

For the CUB dataset, our data split is the same with [52]. Regarding the Cars dataset, we adhere to the same data split with [53]. As for the Dogs dataset, it comprises 90 subclasses designated for training and validation, along with an additional 30 subclasses for testing. To achieve effective spatial knowledge referencing, we employed the NABirds [54], Oxford-IIIT Pet [55], and CompCars [56] datasets as supplementary knowledge sources for the CUB, Dogs, and Cars datasets, respectively. In the experiments, ResNet [57] is pre-trained on ImageNet as the backbone, and all the input images are cropped to 84×84848484\times 8484 × 84. The model is trained with the stochastic gradient descent (SGD) and momentum of 0.9 for all datasets. The initial learning rate of the main branch was set to 0.001 and 0.01 for the rest layers. Our implementation is based on PyTorch with an NVIDIA Geforce GTX 3090Ti GPU. In the N-way K-shot scenario, we carried out few-shot classification on 10,000 randomly sampled episodes, each containing 16 queries per class. We present the average classification accuracy along with 95% confidence intervals, as in  [36].

IV-B Model Configuration

Model configuration experiments are conducted to verify the validity of the individual component and to determine the hyperparameters.

Multi-Similarity Loss (α𝛼\boldsymbol{\alpha}bold_italic_α): To verify the effectiveness of MS loss and investigate the influence of the parameter α𝛼\alphaitalic_α, extensive experiments are carried out on the three datasets, and the results are presented in Table III.

TABLE III: Experimental results using varied α𝛼\alphaitalic_α. “w/o” means learning without MS loss. The best performance is indicated in bold.
α𝛼\alphaitalic_α 0.1 0.3 0.5 0.7 0.9 w/o
CUB 88.09 88.40 88.53 88.35 88.05 88.14
Dogs 71.38 72.11 72.28 71.93 72.51 72.04
Car 80.58 80.83 81.03 80.80 80.63 80.38

When α𝛼\alphaitalic_α was set properly (e.g., α[0.30.7]𝛼delimited-[]0.30.7\alpha\in[0.3~{}0.7]italic_α ∈ [ 0.3 0.7 ]), the MS loss could effectively embed the information contained in the label into the image space, to some extent facilitating the model’s comprehension of similar instances. However, an increment in α𝛼\alphaitalic_α beyond a certain point led to a slight decline in our model’s performance. One possible reason is that the model overly relies on the relationships among labels within the few-shot training paradigm, potentially resulting in overfitting. This suggests that α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 could be a reliable choice for DRDM.

TABLE IV: Few-shot classification accuracy on the CUB, Stanford Dogs, and Stanford Cars dataset. All experiments are from a 5-way classification with the same backbone network (ResNet12). The best performance is indicated in bold.
Method CUB-200-2011 Stanford Dogs Stanford Cars
1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
MTL [58] (CVPR@2019) 73.31±plus-or-minus\pm±0.92 82.29±plus-or-minus\pm±0.51 54.96±plus-or-minus\pm±1.03 68.76±plus-or-minus\pm±0.65 - -
MetaOptNet [59] (CVPR@2019) 75.15±plus-or-minus\pm±0.46 87.09±plus-or-minus\pm±0.30 65.48±plus-or-minus\pm±0.49 79.39±plus-or-minus\pm±0.25 - -
S2M2 [60] (WACV@2020) 71.43±plus-or-minus\pm±0.28 85.55±plus-or-minus\pm±0.52 - - - -
Neg-Cosine [61] (ECCV@2020) 72.66±plus-or-minus\pm±0.85 89.40±plus-or-minus\pm±0.43 - - - -
A2 [62] (ECCV@2020) 74.22±plus-or-minus\pm±1.09 88.65±plus-or-minus\pm±0.55 - - - -
MattML [63] (IJCAI@2020) 66.29±plus-or-minus\pm±0.56 80.34±plus-or-minus\pm±0.30 54.84±plus-or-minus\pm±0.53 71.34±plus-or-minus\pm±0.38 66.11±plus-or-minus\pm±0.54 82.80±plus-or-minus\pm±0.28
BSNet(R&C) [64] (TIP@2021) 65.89±plus-or-minus\pm±1.00 80.99±plus-or-minus\pm±0.63 51.06±plus-or-minus\pm±0.94 68.60±plus-or-minus\pm±0.73 54.12±plus-or-minus\pm±0.96 73.47±plus-or-minus\pm±0.75
VFD* [34] (ICCV@2021) 79.12±plus-or-minus\pm±0.83 91.48±plus-or-minus\pm±0.39 57.04±plus-or-minus\pm±0.89 72.95±plus-or-minus\pm±0.70 - -
APF [65] (PR@2022) 78.73±plus-or-minus\pm±0.84 89.77±plus-or-minus\pm±0.47 60.89±plus-or-minus\pm±0.98 78.14±plus-or-minus\pm±0.62 78.14±plus-or-minus\pm±0.84 87.42±plus-or-minus\pm±0.57
TDM [36] (CVPR@2022) 84.36±plus-or-minus\pm±0.19 93.37±plus-or-minus\pm±0.10 57.32±plus-or-minus\pm±0.22 75.26±plus-or-minus\pm±0.16 67.10±plus-or-minus\pm±0.22 86.05±plus-or-minus\pm±0.12
CFMA [66] (IS@2022) 74.68±plus-or-minus\pm±1.38 90.91±plus-or-minus\pm±0.94 - - - -
QGN [67] (PR@2023) 83.82±plus-or-minus\pm±0.00 91.22±plus-or-minus\pm±0.00 - - - -
T2L [68] (KBS@2023) 71.04±plus-or-minus\pm±1.21 83.44±plus-or-minus\pm±0.94 52.12±plus-or-minus\pm±1.14 70.83±plus-or-minus\pm±1.09 56.80±plus-or-minus\pm±1.23 74.10±plus-or-minus\pm±1.65
TasNet [69] (PR@2023) 83.89±plus-or-minus\pm±0.69 91.35±plus-or-minus\pm±0.53 - - - -
Ours 89.99±plus-or-minus\pm±0.14 94.63±plus-or-minus\pm±0.10 72.68±plus-or-minus\pm±0.17 80.43±plus-or-minus\pm±0.12 81.53±plus-or-minus\pm±0.15 90.03±plus-or-minus\pm±0.12

Spatial Knowledge Reference (β𝛽\boldsymbol{\beta}bold_italic_β): To investigate the influence of SKR, we measured the impact of introducing the number of subclasses (N) with different β𝛽\betaitalic_β values for both 1-shot and 5-shot scenarios on three datasets, the results are shown in Figure 4.

Refer to caption
Figure 4: The impact of different N and β𝛽\betaitalic_β on the learning process. The horizontal and vertical axes represent the number of introduced subclasses and the selection of β𝛽\betaitalic_β values, respectively. The first and second rows represent the results for the 1-shot and 5-shot scenarios, respectively, with different columns showing the results for different datasets. The color intensity is used to visualize the level of accuracy, where darker shades indicate higher accuracy.

As can be seen, in the CUB dataset, the model’s accuracy exhibited a positive trend after incorporating subclasses as reference knowledge. However, when the number exceeded five, a slight reduction in accuracy was observed. This might be attributed to the challenge of the model in comprehending target subclasses under the setting of few-shot problems when too many subclasses were introduced. Besides, we observed that increasing β𝛽\betaitalic_β leads to higher accuracy. However, the performance modestly decreased as the balance parameter β𝛽\betaitalic_β was increased from 0.5 to 0.7, suggesting that when β𝛽\betaitalic_β was equal to 0.5, the model was able to leverage a sufficient amount of knowledge. Similar results can be observed in the other two datasets. Therefore, we still chose N=5𝑁5N=5italic_N = 5 and β=0.5𝛽0.5\beta=0.5italic_β = 0.5 for all subsequent experiments, as it ideally balances computational complexity and accuracy.

IV-C Performance Evaluation

Refer to caption
Figure 5: Comparison of fine-grained image data augmented by Diffusion-Based Models. The first row depicts the source images, the second and third rows demonstrate the results generated by part of existing methods, and the last row exhibits our results.

The experimental results and analysis of DRDM compared with recent SOTA methods on three datasets are presented in Table IV. It was observed that on the CUB dataset, the TDM [36] method demonstrated SOTA performance, which can be credited to its channel attention mechanism. This mechanism produced a support weight to represent the channel-wise discriminative power for each subclass. Benefiting from the embedding and learning of textual feature space relationships, our approach achieved performance improvements of 5.63% and 1.26% in the 1-shot and 5-shot settings, respectively. On the Dogs dataset, the SOTA performance was achieved by MetaOptNet  [59]. Similarly, this approach overlooks the potential structural relationships within the labels. Consequently, in comparison, our method achieved performance improvements of 7.2% and 1.04% in the 1-shot and 5-shot settings, respectively. On the Cars dataset, the SOTA performance is achieved by APF [65]. In comparison, our method achieved performance improvements of 3.39% and 2.61% in the 1/5-shot settings, respectively. These results provide compelling evidence for the effectiveness of DRDM.

To validate the unique advantage of our proposed algorithm in fine-grained data augmentation, We also compared it with data augmentation methods based on the diffusion model. Instances augmented by different diffusion models are demonstrated in Figure 5.

TABLE V: Comparison of data augmentation methods based on diffusion model. The best performance is indicated in bold.
Method CUB-200-2011 Stanford Dogs Stanford Cars
1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
LDM [21] (CVPR@2022) 82.82±plus-or-minus\pm±0.18 90.12±plus-or-minus\pm±0.12 58.77±plus-or-minus\pm±0.18 70.48±plus-or-minus\pm±0.16 73.70±plus-or-minus\pm±0.18 86.50±plus-or-minus\pm±0.12
VQ-Diffusion [70] (CVPR@2022) 81.54±plus-or-minus\pm±0.19 89.96±plus-or-minus\pm±0.12 60.99±plus-or-minus\pm±0.19 70.60±plus-or-minus\pm±0.16 62.02±plus-or-minus\pm±0.21 85.12±plus-or-minus\pm±0.13
Ours 89.99±plus-or-minus\pm±0.14 94.63±plus-or-minus\pm±0.10 72.68±plus-or-minus\pm±0.17 80.43±plus-or-minus\pm±0.12 81.53±plus-or-minus\pm±0.15 90.03±plus-or-minus\pm±0.12

It was observed that, while the data augmented by existing diffusion-based models could capture the basic outline and features of the instances (e.g., Spotted Catbird, Bobolink, and Parakeet Auklet), the limited of training samples makes it challenging for these models to capture finer details. For instance, the feathers of Spotted Catbird are green, but the augmented data shows grey feathers. Moreover, the labels of fine-grained images have a certain level of expertise, making it difficult for large models to establish a clear map** between labels and semantic features during pre-training. As a result, the augmented images may not correctly represent the instances (e.g., the augmented data of Geococcyx and Chuck-will-widow). This leads to significant feature contamination, impeding model learning in a few-shot setting.

This conclusion is proved quantitatively in Table V. To ensure a fair comparison, all results were obtained using the same framework. It can be observed that utilizing the SOTA diffusion-based model for data augmentation has led to a decrease in 1-shot accuracy compared to prior research, with declines of 1.54%, 4.49%, and 4.44% on the three datasets (contrasting the results in Table IV), respectively. This reveals the prevalent presence of feature contamination and its impact on few-shot FGVC learning.

TABLE VI: Ablation studies of the DRDM on three datasets. The best performance is indicated in bold.
Dataset Framework Framework + SKR Framework + DSR DRDM
1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
CUB 84.18±plus-or-minus\pm±0.19 93.44 ±plus-or-minus\pm±0.10 86.43±plus-or-minus\pm±0.16 93.24±plus-or-minus\pm±0.10 88.14±plus-or-minus\pm±0.15 93.62±plus-or-minus\pm±0.10 89.99±plus-or-minus\pm±0.14 94.63±plus-or-minus\pm±0.10
Dogs 61.44±plus-or-minus\pm±0.22 78.73±plus-or-minus\pm±0.15 62.53±plus-or-minus\pm±0.23 79.69±plus-or-minus\pm±0.15 72.14±plus-or-minus\pm±0.18 79.86±plus-or-minus\pm±0.15 72.68±plus-or-minus\pm±0.17 80.43±plus-or-minus\pm±0.12
Cars 67.10±plus-or-minus\pm±0.22 86.05±plus-or-minus\pm±0.12 71.79±plus-or-minus\pm±0.21 88.79±plus-or-minus\pm±0.12 80.58±plus-or-minus\pm±0.17 89.52±plus-or-minus\pm±0.11 81.53±plus-or-minus\pm±0.15 90.03±plus-or-minus\pm±0.12
Refer to caption
Figure 6: The feature visualization using t-SNE [71]. Each row displays the results of different datasets, while each column represents the distribution of features at different stages. Each dot denotes the feature of a sample and different colors represent different categories. Rintersubscript𝑅𝑖𝑛𝑡𝑒𝑟R_{inter}italic_R start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT and Rintrasubscript𝑅𝑖𝑛𝑡𝑟𝑎R_{intra}italic_R start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT indicate the degree of compactness of inter-class and intra-class, respectively. ρ𝜌\rhoitalic_ρ means space utilization [72].

IV-D Ablation Study

To evaluate the proposed DRDM, an ablation study was conducted. “Framework” refers to using a structure designed without any strategy. On this basis, the influence of DSR and SKR strategies on learning was explored. The experimental results are presented in Table VI.

Taking CUB as an example, it can be observed that SKR has achieved a maximum 2.25% performance improvement compared to Framework. Next, the effectiveness of the proposed DSR was evaluated, which is used to explore the relationships between instance labels and image information under weakly supervised conditions. We observed that this module improved the accuracy by 3.96% based on “Framework”. This suggests that extracting potential similarity relationships in labels will help the model understand features better. When SKR and DSR were used together, the model achieved optimal performance. Since our method uses a noise prediction network, we further validated the model’s efficacy by subjecting it solely to noise addition. We conducted experiments on the CUB dataset and reported the corresponding results. The accuracy for 5-way-1-shot and 5-way-5-shot was 81.72% and 91.98%, respectively. Compared to the Framework, there was a decrease of 2.46% and 1.46% in accuracy, respectively. This indicates that merely adding noise not only does not help improve the model’s generalization ability but also interferes with the model’s understanding of fine-grained targets.

TABLE VII: Performance combined with the current state-of-the-art FSL methods.
Method CUB-200-2011 Stanford Dogs Stanford Cars
1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
ProtoNet [25] (NIPS@2017) 77.66±plus-or-minus\pm±0.21 89.42±plus-or-minus\pm±0.12 45.92±plus-or-minus\pm±0.21 67.50±plus-or-minus\pm±0.17 47.60±plus-or-minus\pm±0.21 72.81±plus-or-minus\pm±0.18
ProtoNet + DRDM 82.58±plus-or-minus\pm±0.18 90.80±plus-or-minus\pm±0.11 58.92±plus-or-minus\pm±0.18 70.45±plus-or-minus\pm±0.16 61.66±plus-or-minus\pm±0.20 74.37±plus-or-minus\pm±0.17
FRN [52] (CVPR@2021) 83.55±plus-or-minus\pm±0.19 92.92±plus-or-minus\pm±0.10 55.49±plus-or-minus\pm±0.21 74.54±plus-or-minus\pm±0.16 62.07±plus-or-minus\pm±0.22 83.18±plus-or-minus\pm±0.14
FRN + DRDM 85.40±plus-or-minus\pm±0.17 93.52±plus-or-minus\pm±0.10 61.88±plus-or-minus\pm±0.22 77.86±plus-or-minus\pm±0.15 79.25±plus-or-minus\pm±0.19 88.23±plus-or-minus\pm±0.10

IV-E Scalability Analysis

We also apply the augmented data from this paper to existing models to evaluate the gains of our approach over existing algorithms. It can be observed in Table VII, on the CUB dataset, our algorithm achieves a top-1 performance improvement of 4.92% and 1.85% for ProtoNet and FRN, respectively. Similar conclusions can be drawn for the Dogs and Cars datasets. This indicates that our algorithm has good portability, which can effectively enhance the performance of other few-shot learning models.

Refer to caption
Figure 7: The first and second rows represent the results before and after using DSR, respectively. It can be observed that the effective separation of textual relationships has been achieved.

Simultaneously, we visualize the similarity of label features on three datasets to illustrate the improvement of our DSR model in addressing semantic misalignment issues. The results are presented in Figure 7. It can be observed that after applying DSR, the similarity relationships between labels are effectively constrained. This also reveals the reasons why augmented data can effectively prevent the generation of irrelevant images. Similar results can be observed in the other two datasets. This indicates that our method can better guide the learning process to improve the generalization ability of the model.

Besides, we also assessed the impact of different types of data on the knowledge referencing module. In this process, the 1-shot results were reported in Table VIII.

TABLE VIII: Impact of data types on knowledge referencing. “Pet” represents the “Oxford-IIIT Pet” dataset.
Dataset Baseline NABirds Pet CompCars
CUB [1] 84.18 86.43 85.34 85.09
Dogs [50] 61.44 61.55 62.53 60.89
Cars [51] 67.10 68.95 68.84 71.79

It was observed that leveraging external data to enhance the model’s understanding of instances under limited samples was effective. When instances exhibit substantial dissimilarities, embedding spatial relationships did not significantly impact the model’s performance. However, in cases of similar data types, the model demonstrated greater confidence in discrimination. We also employed t-SNE [71] to qualitatively illustrate the impact of knowledge referencing on features. Additionally, we utilized spatial utilization ratio ρ𝜌\rhoitalic_ρ [72] to quantitatively characterize the degree of subclass aggregation. ρ𝜌\rhoitalic_ρ represents the ratio of average inter-class distance to average intra-class distance of features. A higher value indicates a more compact feature distribution and a wider decision boundary allows the model to make decisions more confidently. The results are presented in Figure 6. Compared to the results of the framework, the utilization of label similarity relationships to constrain model learning has led to feature distribution aggregation. In fact, on the Cars and Dogs datasets, the ρ𝜌\rhoitalic_ρ has even doubled. Subsequently, with the introduction of appropriate knowledge as referencing, the feature distribution exhibited further intra-subclass aggregation and wider inter-subclass boundaries, ultimately benefiting the model’s final decision-making process. Compared to the ProtoNet and FRN methods, our algorithm still exhibited the highest spatial utilization rate, which partially explains why we achieved the best results.

V Conclusions

Augmenting data with subtle and consistent discriminative features is one effective approach to achieving reliable few-shot FGVC. In this paper, we introduce a diffusion-based model solution that explores latent instance similarity relationships within labels and employs external data for feature referencing, achieving enhanced data augmentation through detailed reinforcement. The experimental results validate that solely employing a diffusion-based class generation model for data augmentation cannot effectively address the challenge of few-shot learning. In contrast, our method leverages the constraints of similar relationships within labels and the reference from external knowledge, effectively alleviating label feature contamination and achieving a substantial performance improvement. We also qualitatively and quantitatively validated the impact of utilizing reference knowledge from different classes on the model. Our approach can effectively extract knowledge from similar subclass data while avoiding the influence of irrelevant subclasses. Besides, one thing should be noted, our method primarily relies on label similarity for feature constraints, potentially overlooking underlying cross-hierarchical label relationships. In the future, we aim to explore additional correlated information, such as cross-hierarchical label relationships or structural information in a single-instance feature, to further enhance the model’s generalization capabilities.

References

  • [1] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset.”   California Institute of Technology, 2011.
  • [2] J. Mańdziuk, “New shades of the vehicle routing problem: Emerging problem formulations and computational intelligence solution methods,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 3, no. 3, pp. 230–244, 2019.
  • [3] K. Sadeghi, A. Banerjee, and S. K. Gupta, “A system-driven taxonomy of attacks and defenses in adversarial machine learning,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 4, no. 4, pp. 450–467, 2020.
  • [4] J. Yi, H. Zhang, J. Mao, Y. Chen, H. Zhong, and Y. Wang, “Pharmaceutical foreign particle detection: An efficient method based on adaptive convolution and multiscale attention,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 6, no. 6, pp. 1302–1313, 2022.
  • [5] J. Du, K. Guan, Y. Zhou, Y. Li, and T. Wang, “Parameter-free similarity-aware attention module for medical image classification and segmentation,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 7, no. 3, pp. 845–857, 2023.
  • [6] S. Ye, Y. Wang, Q. Peng, X. You, and C. P. Chen, “The image data and backbone in weakly supervised fine-grained visual categorization: A revisit and further thinking,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  • [7] X. He, Y. Peng, and J. Zhao, “Stackdrl: Stacked deep reinforcement learning for fine-grained visual categorization.” in IJCAI, 2018, pp. 741–747.
  • [8] X. Zheng, L. Qi, Y. Ren, and X. Lu, “Fine-grained visual categorization by localizing object parts with single image,” IEEE Transactions on Multimedia, vol. 23, pp. 1187–1199, 2020.
  • [9] Y. Ding, Y. Zhou, Y. Zhu, Q. Ye, and J. Jiao, “Selective sparse sampling for fine-grained image recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6599–6608.
  • [10] S. Ye, Q. Peng, W. Sun, J. Xu, Y. Wang, X. You, and Y.-M. Cheung, “Discriminative suprasphere embedding for fine-grained visual categorization,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • [11] Y. Wang, S. Ye, S. Yu, and X. You, “R2-trans: Fine-grained visual categorization with redundancy reduction,” arXiv preprint arXiv:2204.10095, 2022.
  • [12] Z. Hong, S. Chen, G. Xie, W. Yang, J. Zhao, Y. Shao, Q. Peng, and X. You, “Semantic compression embedding for generative zero-shot learning,” IJCAI, Vienna, Austria, vol. 7, pp. 956–963, 2022.
  • [13] Y. Shu, B. Yu, H. Xu, and L. Liu, “Improving fine-grained visual recognition in low data regimes via self-boosting attention mechanism,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXV.   Springer, 2022, pp. 449–465.
  • [14] Y. Wang, X. Pei, and H. Zhan, “Fine-grained graph learning for multi-view subspace clustering,” IEEE Transactions on Emerging Topics in Computational Intelligence, pp. 1–12, 2023.
  • [15] K.-Y. Feng, M. Gong, K. Pan, H. Zhao, Y. Wu, and K. Sheng, “Model sparsification for communication-efficient multi-party learning via contrastive distillation in image classification,” IEEE Transactions on Emerging Topics in Computational Intelligence, pp. 1–14, 2023.
  • [16] K. Li, Y. Zhang, K. Li, and Y. Fu, “Adversarial feature hallucination networks for few-shot learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 470–13 479.
  • [17] H. Le and D. Samaras, “Shadow removal via shadow image decomposition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8578–8587.
  • [18] R. Zhang, T. Che, Z. Ghahramani, Y. Bengio, and Y. Song, “Metagan: An adversarial approach to few-shot learning,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [19] H. Gao, Z. Shou, A. Zareian, H. Zhang, and S.-F. Chang, “Low-shot learning via covariance-preserving adversarial augmentation networks,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [20] S. Tsutsui, Y. Fu, and D. Crandall, “Meta-reinforced synthetic data for one-shot fine-grained visual recognition,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [21] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
  • [22] W. Li, Z. Wang, X. Yang, C. Dong, P. Tian, T. Qin, J. Huo, Y. Shi, L. Wang, Y. Gao et al., “Libfewshot: A comprehensive library for few-shot learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [23] S. Yan, N. Dong, L. Zhang, and J. Tang, “Clip-driven fine-grained text-image person re-identification,” IEEE Transactions on Image Processing, 2023.
  • [24] Z. Xin, S. Chen, T. Wu, Y. Shao, W. Ding, and X. You, “Few-shot object detection: Research advances and challenges,” Information Fusion, p. 102307, 2024.
  • [25] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [26] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1199–1208.
  • [27] L. Tang, D. Wertheimer, and B. Hariharan, “Revisiting pose-normalization for fine-grained few-shot recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14 352–14 361.
  • [28] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang, “A closer look at few-shot classification,” arXiv preprint arXiv:1904.04232, 2019.
  • [29] Y. Tian, Y. Wang, D. Krishnan, J. B. Tenenbaum, and P. Isola, “Rethinking few-shot image classification: a good embedding is all you need?” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16.   Springer, 2020, pp. 266–282.
  • [30] G. S. Dhillon, P. Chaudhari, A. Ravichandran, and S. Soatto, “A baseline for few-shot image classification,” arXiv preprint arXiv:1909.02729, 2019.
  • [31] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker, “Feature transfer learning for deep face recognition with under-represented data,” arXiv preprint arXiv:1803.09014, 2018.
  • [32] B. Hariharan and R. Girshick, “Low-shot visual recognition by shrinking and hallucinating features,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3018–3027.
  • [33] E. Schwartz, L. Karlinsky, J. Shtok, S. Harary, M. Marder, A. Kumar, R. Feris, R. Giryes, and A. Bronstein, “Delta-encoder: an effective sample synthesis method for few-shot object recognition,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [34] J. Xu, H. Le, M. Huang, S. Athar, and D. Samaras, “Variational feature disentangling for fine-grained few-shot classification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8812–8821.
  • [35] B. Zhang, X. Li, Y. Ye, Z. Huang, and L. Zhang, “Prototype completion with primitive knowledge for few-shot learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3754–3762.
  • [36] S. Lee, W. Moon, and J.-P. Heo, “Task discrepancy maximization for fine-grained few-shot classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5331–5340.
  • [37] “Co** with change: Learning invariant and minimum sufficient representations for fine-grained visual categorization,” Computer Vision and Image Understanding, vol. 237, p. 103837, 2023.
  • [38] Z. Hong, Z. Wang, L. Shen, Y. Yao, Z. Huang, S. Chen, C. Yang, M. Gong, and T. Liu, “Improving non-transferable representation learning by harnessing content and style,” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=FYKVPOHCpE
  • [39] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention.   Springer, 2015, pp. 234–241.
  • [40] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [41] X. Liu, Z. Hu, H. Ling, and Y.-m. Cheung, “Mtfh: A matrix tri-factorization hashing framework for efficient cross-modal retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 3, pp. 964–981, 2019.
  • [42] X. Liu, X. Wang, and Y.-m. Cheung, “Fddh: Fast discriminative discrete hashing for large-scale cross-modal retrieval,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 11, pp. 6306–6320, 2021.
  • [43] S.-J. Peng, Y. Fan, Y.-m. Cheung, X. Liu, Z. Cui, and T. Li, “Towards efficient cross-modal anomaly detection using triple-adaptive network and bi-quintuple contrastive learning,” IEEE Transactions on Emerging Topics in Computational Intelligence, 2023.
  • [44] G. Baykal, H. F. Karagoz, T. Binhuraib, and G. Unal, “Protodiffusion: Classifier-free diffusion guidance with prototype learning,” arXiv preprint arXiv:2307.01924, 2023.
  • [45] J. Pfeiffer, A. Kamath, A. Rücklé, K. Cho, and I. Gurevych, “Adapterfusion: Non-destructive task composition for transfer learning,” arXiv preprint arXiv:2005.00247, 2020.
  • [46] A. Rücklé, G. Geigle, M. Glockner, T. Beck, J. Pfeiffer, N. Reimers, and I. Gurevych, “Adapterdrop: On the efficiency of adapters in transformers,” arXiv preprint arXiv:2010.11918, 2020.
  • [47] R. Wang, D. Tang, N. Duan, Z. Wei, X. Huang, G. Cao, D. Jiang, M. Zhou et al., “K-adapter: Infusing knowledge into pre-trained models with adapters,” arXiv preprint arXiv:2002.01808, 2020.
  • [48] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8748–8763.
  • [49] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott, “Multi-similarity loss with general pair weighting for deep metric learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5022–5030.
  • [50] A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li, “Novel dataset for fine-grained image categorization: Stanford dogs,” in Proc. CVPR workshop on fine-grained visual categorization (FGVC), vol. 2, no. 1.   Citeseer, 2011.
  • [51] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2013, pp. 554–561.
  • [52] D. Wertheimer, L. Tang, and B. Hariharan, “Few-shot classification with feature map reconstruction networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8012–8021.
  • [53] W. Li, L. Wang, J. Xu, J. Huo, Y. Gao, and J. Luo, “Revisiting local descriptor based image-to-class measure for few-shot learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7260–7268.
  • [54] G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis, P. Perona, and S. Belongie, “Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 595–604.
  • [55] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2012, pp. 3498–3505.
  • [56] L. Yang, P. Luo, C. Change Loy, and X. Tang, “A large-scale car dataset for fine-grained categorization and verification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3973–3981.
  • [57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [58] Q. Sun, Y. Liu, T.-S. Chua, and B. Schiele, “Meta-transfer learning for few-shot learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 403–412.
  • [59] K. Lee, S. Maji, A. Ravichandran, and S. Soatto, “Meta-learning with differentiable convex optimization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 657–10 665.
  • [60] P. Mangla, N. Kumari, A. Sinha, M. Singh, B. Krishnamurthy, and V. N. Balasubramanian, “Charting the right manifold: Manifold mixup for few-shot learning,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 2218–2227.
  • [61] B. Liu, Y. Cao, Y. Lin, Q. Li, Z. Zhang, M. Long, and H. Hu, “Negative margin matters: Understanding margin in few-shot classification,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16.   Springer, 2020, pp. 438–455.
  • [62] A. Afrasiyabi, J.-F. Lalonde, and C. Gagné, “Associative alignment for few-shot image classification,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16.   Springer, 2020, pp. 18–35.
  • [63] Y. Zhu, C. Liu, and S. Jiang, “Multi-attention meta learning for few-shot fine-grained image recognition.” in IJCAI, 2020, pp. 1090–1096.
  • [64] X. Li, J. Wu, Z. Sun, Z. Ma, J. Cao, and J.-H. Xue, “Bsnet: Bi-similarity network for few-shot fine-grained image classification,” IEEE Transactions on Image Processing, vol. 30, pp. 1318–1331, 2020.
  • [65] H. Tang, C. Yuan, Z. Li, and J. Tang, “Learning attention-guided pyramidal features for few-shot fine-grained recognition,” Pattern Recognition, vol. 130, p. 108792, 2022.
  • [66] P. Li, G. Zhao, and X. Xu, “Coarse-to-fine few-shot classification with deep metric learning,” Information Sciences, vol. 610, pp. 592–604, 2022.
  • [67] B. Munjal, A. Flaborea, S. Amin, F. Tombari, and F. Galasso, “Query-guided networks for few-shot fine-grained classification and person search,” Pattern Recognition, vol. 133, p. 109049, 2023.
  • [68] N. Sun and P. Yang, “T2l: Trans-transfer learning for few-shot fine-grained visual categorization with extended adaptation,” Knowledge-Based Systems, vol. 264, p. 110329, 2023.
  • [69] M.-H. Pan, H.-Y. Xin, C.-Q. Xia, and H.-B. Shen, “Few-shot classification with task-adaptive semantic feature learning,” Pattern Recognition, vol. 141, p. 109594, 2023.
  • [70] S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo, “Vector quantized diffusion model for text-to-image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 696–10 706.
  • [71] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of Machine Learning Research, vol. 9, no. 11, 2008.
  • [72] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V. Balntas, “Sosnet: Second order similarity regularization for local descriptor learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 016–11 025.
[Uncaptioned image] Tianxu Wu is currently pursuing the Ph.D. in School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, China. His research interests include machine learning and computer vision.
[Uncaptioned image] Shuo Ye is currently a full-time Ph.D student in the School of Electronic Information and Communications, Huazhong University of Sciences and Technology (HUST), China. His current research interests span computer vision and voice signal processing with a series of topics, such as automatic speech recognition and fine-grained image categorization.
[Uncaptioned image] Shuhuang Chen is currently pursuing the Ph.D. in School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, China. His research interests include machine learning and computer vision.
[Uncaptioned image] Qinmu Peng received the Ph.D. degree from the Department of Computer Science, Hong Kong Baptist University, Hong Kong, in 2015. He is currently an Assistant Professor with the School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, China, and the Shenzhen Research Institute, Huazhong University of Science and Technology, Shenzhen, China. His current research interests include medical image processing, pattern recognition, machine learning, and computer vision.
[Uncaptioned image] Xinge You (M’08-SM’10) received the BS and MS degrees in mathematics from the Hubei University, Wuhan, China and the PhD degree from the Department of Computer Science, Hong Kong Baptist University, Hong Kong, in 1990, 2000, and 2004, respectively. Currently, he is a professor with the School of Electronic Information and Communications, Huazhong University of Science and Technology, China. His research interests include pattern recognition, image and signal processing, computer vision and machine learning. He has published more than 100 papers, such as the IEEE Transactions on Pattern Analysis and Machine Intelligence, TCB, the IEEE Transactions on Image Processing and CVPR. He is a senior member of the IEEE.