\addauthor

Dmitry [email protected] \addauthorAbduragim [email protected] \addauthorMihail [email protected] \addauthorMohammad [email protected] \addinstitution Mohamed bin Zayed University
of Artificial Intelligence
UAE, Abu Dhabi
Extract More from Less

Extract More from Less:
Efficient Fine-Grained Visual Recognition
in Low-Data Regimes

Abstract

The emerging task of fine-grained image classification in low-data regimes assumes the presence of low inter-class variance and large intra-class variation along with a highly limited amount of training samples per class. However, traditional ways of separately dealing with fine-grained categorisation and extremely scarce data may be inefficient under both these harsh conditions presented together. In this paper, we present a novel framework, called AD-Net, aiming to enhance deep neural network performance on this challenge by leveraging the power of Augmentation and Distillation techniques. Specifically, our approach is designed to refine learned features through self-distillation on augmented samples, mitigating harmful overfitting. We conduct comprehensive experiments on popular fine-grained image classification benchmarks where our AD-Net demonstrates consistent improvement over traditional fine-tuning and state-of-the-art low-data techniques. Remarkably, with the smallest data available, our framework shows an outstanding relative accuracy increase of up to 45 % compared to standard ResNet-50 and up to 27 % compared to the closest SOTA runner-up. We emphasise that our approach is practically architecture-independent and adds zero extra cost at inference time. Additionally, we provide an extensive study on the impact of every framework’s component, highlighting the importance of each in achieving optimal performance. Source code and trained models are publicly available at github.com/demidovd98/fgic_lowd.

1 Introduction

Overview. Deep learning models, inherently data-hungry, require vast quantities of annotated data for effective training [Dosovitskiy et al.(2020)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, et al.]. However acquiring such extensive datasets, particularly with meticulous annotations, is labour-intensive and challenging. Following this constraint, the task of low-data image classification assumes that only a highly limited amount of annotated samples per class is available with no other information. [Schmarje et al.(2021)Schmarje, Santarossa, Schröder, and Koch, Yang et al.(2022)Yang, Song, King, and Xu, Xu et al.(2021)Xu, Zhang, Hu, Wang, Wang, Wei, Bai, and Liu]. Despite the advantages of the existing methodologies, they often come with computationally expensive and architecture-dependent extra modules [Shu et al.(2022)Shu, Yu, Xu, and Liu] and in some cases still demand substantial minimally required data quantities [Wang et al.(2021)Wang, Yu, and Gao] or expect the input data to be in a specific domain only [Demidov et al.(2024)Demidov, Al Majzoub, Kumar, and Khan].

Refer to caption — Figure 1: An overview of our proposed pre-processing pipeline. Three random crops of different sizes are generated from each input image and all of them are further augmented with the same set of transformations. The largest cropped region is further used for classification and the other two (mid- and small-size) are used as target and source in a distillation objective. Only random crop is shown as an applied augmentation for simplicity.

A potentially helpful traditional way to address this bottleneck is to artificially increase data amounts by applying various data augmentations [Simonyan and Zisserman(2014), Cubuk et al.(2018)Cubuk, Zoph, Mane, Vasudevan, and Le, Yun et al.(2019)Yun, Han, Oh, Chun, Choe, and Yoo]. However, our investigation demonstrates that, when applied within the standard transfer learning pipeline [He et al.(2016)He, Zhang, Ren, and Sun], these approaches are not practically efficient under the extremely low amount of data. Another challenging domain within computer vision is fine-grained visual recognition, where the goal is to learn distinctive features for visually or semantically similar sub-classes of the same meta-category [Schmarje et al.(2021)Schmarje, Santarossa, Schröder, and Koch, Demidov. et al.(2023)Demidov., Sharif., Abdurahimov., Cholakkal., and Khan.]. This task may be difficult even for human annotators due to specific domain knowledge requirements [Yang et al.(2022)Yang, Song, King, and Xu, Xu et al.(2021)Xu, Zhang, Hu, Wang, Wang, Wei, Bai, and Liu], making large-scale data acquisition and labelling impractical. Therefore, the problem of low-data learning is especially important for fine-grained classification tasks where the available data per class is often scarce [Yang et al.(2022)Yang, Song, King, and Xu, Xu et al.(2021)Xu, Zhang, Hu, Wang, Wang, Wei, Bai, and Liu].

Problem Statement. Our work focuses on fine-grained image classification (FGIC) under low-data regimes, where both intra-class variation is high and the data is scarce, so conventional solutions may be limited [Simonyan and Zisserman(2014)]. This challenge is complicated by the intricate nature of the FGIC task and limited research in transfer learning for low-data problems. Most of the few existing solutions [Shu et al.(2022)Shu, Yu, Xu, and Liu, Lu et al.(2022)Lu, Lu, Yu, and Wang] aim to directly introduce learning constraints without addressing the data variation insufficiency and often at the cost of noticeable time and resource increase. Therefore, motivated by the under-explored low-data learning domain, we provide a solution that effectively solves fine-grained classification problems by utilising a soft and indirectly regularised transfer learning technique in order to mitigate overfitting challenges.

Approach. We propose a novel approach which performs two crucial tasks: leverages image augmentations to enrich the feature space and utilises self-distillation to facilitate knowledge refinement among feature layers. Unlike traditional methods, our single-model approach integrates the classification and teacher-student distillation pipelines in one framework. Specifically, in our solution, we leverage a self-distillation objective to maintain close feature distributions between teacher and student outputs for different views of each image, which helps to enhance learned representations. Our detailed study offers the following contributions:

•

We propose an end-to-end low-data framework which includes two feature distillation branches for model purification by soft and implicit regularisation.
•

Our architecture-independent solution allows a model to learn and refine representations by enriching the variability of data through extra augmented image views.
•

We demonstrate that our approach achieves state-of-the-art results on popular FGIC ben-chmarks under extremely low data regimes and helps to reduce overfitting.
•

With our framework, the performance gain comes at zero extra cost at inference time.

In Section 4.3, we present an extensive study of the architecture design, objective functions, and augmentation types, contributing to the evolving landscape of both fine-grained image classification and low-data regimes learning.

2 Related Work

Fine-grained Image Classification. There exist multiple approaches for the fine-grained image classification setting. Traditionally, pre-trained R-CNNs [Zhang et al.(2014)Zhang, Donahue, Girshick, and Darrell] and part detectors [Branson et al.(2014)Branson, Van Horn, Belongie, and Perona] had been utilised to detect similarities between specific parts of the object in the image. While modern approaches tend to utilise end-to-end architectures [Zhuang et al.(2020)Zhuang, Wang, and Qiao] where a mutual feature vector is obtained from multiple backbones and further used to compare to unique image representations such that the model can distinguish between difficult classes [Lin et al.(2015)Lin, RoyChowdhury, and Maji, Gao et al.(2016)Gao, Beijbom, Zhang, and Darrell, Zheng et al.(2019)Zheng, Fu, Zha, and Luo]. Another way to solve the FGIC problem is to focus on robust loss functions [Chang et al.(2020)Chang, Ding, Xie, Bhunia, Li, Ma, Wu, Guo, and Song], for example having both a discriminability component which forces all feature channels belonging to the same class to be discriminative, using a channel-wise attention mechanism, and a diversity component which promotes mutually exclusive channels. Similar methods also focus on better feature extraction, like aggregating the important tokens from each transformer layer to compensate for local, low-level and middle-level information [Wang et al.(2021)Wang, Yu, and Gao]. Some of the recent approaches [Lagunas et al.(2023)Lagunas, Impata, Martinez, Fernandez, Georgakis, Braun, and Bertrand, Chou et al.(2022)Chou, Lin, and Kao] consider utilising the heavy Vision Transformer (ViT) [Dosovitskiy et al.(2020)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, et al.] architecture and semi-supervised settings, however such backbones require a computationally expensive pre-training phase and are noticeably slow at inference time.

Low-Data Learning. The specific task of fine-grained image classification in low-data regimes [Horn and Perona(2017)] is not widely explored, and, to our knowledge, there exist only a few approaches that attempt to solve it. One of them is using a self-boosting attention mechanism in CNNs to focus on more descriptive parts of the classes [Shu et al.(2022)Shu, Yu, Xu, and Liu]. However, this FGIC method is designed in an architecture-specific way and can not be extended to other backbones. There exist other approaches that attempt to improve the training data quality by utilising saliency maps in CNNs to specifically target low-data regime settings. This can be seen in [Flores et al.(2019)Flores, Gonzalez-Garcia, van de Weijer, and Raducanu] where the authors incorporate a specific attention mechanism in the original image classification pipeline such that the model focuses on the important distinctive parts of the image when classifying similar classes. In [Tang et al.(2022)Tang, Yuan, Li, and Tang] the authors use a multi-scale feature pyramid and a multi-level attention pyramid on a backbone network to progressively aggregate features from different granular spaces. They further present an attention-guided refinement strategy in collaboration with a multi-level attention pyramid to reduce the uncertainty brought by backgrounds conditioned by limited samples. Nevertheless, the existing methods are often sub-optimal or architecture-specific and resource-demanding, leaving room for a more elegant solution.

Augmentations. Recent developments in augmentation strategies for self-supervised models include methods like MultiCrop [Caron et al.(2020a)Caron, Misra, Mairal, Goyal, Bojanowski, and Joulin] which employ a mix of views with different resolutions, enhancing encoding space variability without increasing memory or compute requirements. Another powerful augmentation is ScaleMix [Wang et al.(2022)Wang, Fan, Tian, Kihara, and Chen] which generates new views of an image by mixing two views of potentially different scales using binary masking. Similar to ScaleMix, CutMix [Yun et al.(2019)Yun, Han, Oh, Chun, Choe, and Yoo] operates exclusively on views from the same image, producing a single view of standard size, however, the latter also approximates benefits of multiple crops without requiring separate processing of small crops. Both MultiCrop and ScaleMix introduce extra variance to the encoding space, enriching the representations.

Distillation. Another recent approach utilised for efficient learning is model distillation during training. Self-supervised learning methodologies have increasingly embraced distillation objectives as a central principle for robust representation learning in the absence of explicit labels [Jaiswal et al.(2021)Jaiswal, Babu, Zadeh, Banerjee, and Makedon, Mazumder et al.(2021)Mazumder, Singh, and Namboodiri]. Empirical studies in the literature, including works such as [Caron et al.(2021)Caron, Touvron, Misra, Jégou, Mairal, Bojanowski, and Joulin, Grill et al.(2020)Grill, Strub, Altché, Tallec, Richemond, Buchatskaya, Doersch, Pires, Guo, Azar, Piot, Kavukcuoglu, Munos, and Valko, Zbontar et al.(2021)Zbontar, **g, Misra, LeCun, and Deny], demonstrate reasonable performance, particularly in the challenging paradigm of zero-shot learning [Sultana et al.(2022)Sultana, Naseer, Khan, Khan, and Khan]. Drawing motivation from this paradigm, we attempt to integrate the distillation techniques into our approach to target the challenging scenarios of limited data availability.

3 Method

3.1 Baseline Framework

Transfer Learning. In traditional transfer learning, a backbone is usually pre-trained on a huge and visually diverse dataset (such as ImageNet [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei]) and is further fine-tuned on a smaller fine-grained downstream task in order to transfer generic low- and mid-level features learned on the large and diverse dataset. This type of initialisation is especially beneficial in the case of using smaller fine-grained datasets [Lagunas et al.(2023)Lagunas, Impata, Martinez, Fernandez, Georgakis, Braun, and Bertrand, Darvish et al.(2022)Darvish, Pouramini, and Bahador]. Adopting standard terminology in transfer learning, we define the source domain $D_{S}$ with data $X_{S}$ and the target domain $D_{T}$ with data $X_{T}$ . Here, $X=\{x_{1},...,x_{n}\}$ represents the feature set within domain-specific feature spaces $\chi_{S}$ and $\chi_{T}$ , respectively. The objective in this context is to predict the label $y^{\prime}$ for a new instance $x^{\prime}$ in the target domain, utilising a label space $Y$ and learning a prediction function $f:\chi_{T}\rightarrow Y$ . This function is derived from training pairs $(x_{i},y_{i})$ , where $x_{i}\in X_{S}\cup X_{T}$ and $y_{i}\in Y$ , effectively map** the learned features to the desired outputs.

Limitations. Although the traditional fine-tuning strategy is useful for standard datasets with relative abundance of labelled samples, it may be sub-optimal on tasks with highly limited data. In low-data regimes, this practically leads to dramatic overfitting to the few given samples per class. Moreover, previous works on low-data learning [Shu et al.(2022)Shu, Yu, Xu, and Liu, Wang et al.(2021)Wang, Yu, and Gao] have demonstrated that the unconstrained baselines mostly focus on the most obvious but less distinctive visual features. Taking this into account, we address these challenges of the low-data setting with our proposed approach, which is able to reduce harmful overfitting by refining learned representations through self-distillation on multiple augmented views of the input image.

3.2 Our Approach

Overall Architecture. In this paper, we propose a framework, called AD-Net, which aims to achieve the enhancement of deep neural network performance through the combination of Augmentation and Distillation techniques. Such a combination leads to substantial quality improvement of learned information per image, which allows for the neural network to learn better representations with fewer images. Our primary focus is on the application of self-distillation techniques, incorporating an additional distillation loss applied to augmented input samples to facilitate knowledge enhancement within the network. More specifically, we explore the utilisation of various data augmentations to further improve the model’s robustness and experiment with different distillation loss functions and components, providing a comprehensive analysis of their effects on the model’s performance. Taking this into account, our proposed framework incorporates a multi-branch configuration, functioning in a manner akin to Siamese networks [Jaiswal et al.(2021)Jaiswal, Babu, Zadeh, Banerjee, and Makedon, Mazumder et al.(2021)Mazumder, Singh, and Namboodiri]. This design choice allows for weight sharing between branches, enhancing computational efficiency and facilitating the consolidation of learned features. Despite the shared weights, each branch is uniquely controlled by distinct loss objectives, emphasising their individual contributions to the overall learning process. Specifically, the branches can be delineated as the classification branch and the distillation branches for simplicity. Figure 2 provides an overview of our proposed framework.

Classification Branch. This branch replicates the standard supervised training process, which follows the conventional fine-tuning procedure [Ramdan et al.(2020)Ramdan, Heryana, Arisal, Kusumo, and Pardede] with an ordinary classification objective. Full-sized images with traditional data augmentation techniques applied are fed into this branch in order to separately obtain the class prediction distribution.

Distillation Branches. This part of the architecture performs knowledge distillation with two differently augmented views of the input image. More specifically, one “source" branch and one “target" branch are utilised to generate corresponding source and target representations of the original images, which are further used for self-distillation [Touvron et al.(2021)Touvron, Cord, Douze, Massa, Sablayrolles, and Jegou, Chen et al.(2020)Chen, Kornblith, Swersky, Norouzi, and Hinton]. This encourages the model to do feature refinement towards more useful class-specific patterns. In more detail, both branches utilise the same shared-weight backbone from the main classification branch, but it is now fed with random mid-sized and small crops of the original image. Specifically, following [Wang et al.(2022)Wang, Fan, Tian, Kihara, and Chen], we randomly sample two categories of patches, larger target crops ranging between 65-85% of the original image content and smaller source crops with an area range of 1-65% in order to capture fine-grained class-specific patterns [Caron et al.(2021)Caron, Touvron, Misra, Jégou, Mairal, Bojanowski, and Joulin]. This kind of image space limitation can help to increase data variation and introduce implicit feature pruning via information dropout [Touvron et al.(2022)Touvron, Cord, and Jégou]. Next, all sampled regions are resized to the fixed resolution and augmented with the same set of data augmentations used for the classification branch. Thus, for each input image, one cropped region from both categories is fed into the corresponding branches of the model to obtain two different representations. Finally, inspired by [Touvron et al.(2021)Touvron, Cord, Douze, Massa, Sablayrolles, and Jegou], a self-distillation technique is applied to the corresponding feature outputs of previously sampled image crops, which encourages the model to learn better class-specific patterns. This helps to overcome quick overfitting towards explicit but weak features.

Loss Formulation. In our approach, we deploy a combination of two distinct loss functions to optimise learning efficiency. The primary classification objective, denoted as $\mathcal{L}_{main}$ , explicitly facilitates the probability distribution prediction towards a pre-defined class. Concurrently, secondary distillation objective $\mathcal{L}_{dist}$ serves as an implicit regularisation mechanism [Wang et al.(2022)Wang, Fan, Tian, Kihara, and Chen]. More specifically, the classification loss is applied to the output logits of the classification branch, while the distillation loss is applied to the feature maps derived from the target and source distillation branches (refer to Section 2). For our main classification objective, we employ traditionally used Cross-Entropy loss [Mao et al.(2023)Mao, Mohri, and Zhong, Hui and Belkin(2021), Zhang and Sabuncu(2018)]:

\mathcal{L}_{main}=-\frac{1}{N}\sum_{n=1}^{N}\sum_{i}^{C}y_{ni}\log(\hat{y}_{% ni})

(1)

where $y_{ni}$ is the ground truth label and $\hat{y}_{ni}$ is the prediction of the model.

Meanwhile, for the distillation objective, we utilise Kullback–Leibler divergence [Kim et al.(2021)Kim, Oh, Kim, Cho, and Yun]:

\mathcal{L}_{dist}=\frac{1}{N}\sum_{n=1}^{N}\sum_{i}^{C}P^{t}_{ni}\log\left(% \frac{P^{t}_{ni}}{Q^{s}_{ni}}\right)

(2)

where $P^{t}_{ni}$ and $Q^{s}_{ni}$ are feature distributions from target and source distillation branches.

In our experiments we apply the softmax function on feature outputs to convert them into normalised probability distributions. The main intuition behind this decision is to enforce the model to produce similar feature distributions for different regions of the same image. Finally, following a widely accepted way of objectives combination [Shu et al.(2022)Shu, Yu, Xu, and Liu, Kurtulus et al.(2023)Kurtulus, Li, Dauphin, and Cubuk, Demidov et al.(2024)Demidov, Al Majzoub, Kumar, and Khan], our aggregated loss function ${L}_{agg}$ , consisting of the classification and distillation components, is:

\mathcal{L}_{agg}=\mathcal{L}_{main}+\alpha\cdot\mathcal{L}_{dist}

(3)

where $\alpha$ is a hyper-parameter determining the weight of the distillation loss in the final objective (for more details and ablation analysis refer to Section 4.3).

4 Experiments

4.1 Experimental Details

Datasets. We explore the properties of our approach on the following popular FGIC datasets: CUB-200-2011 [Wah et al.(2011)Wah, Branson, Welinder, Perona, and Belongie], Stanford Cars [Krause et al.(2013)Krause, Deng, Stark, and Fei-Fei], FGVC-Aircraft [Maji et al.(2013)Maji, Kannala, Rahtu, Blaschko, and Vedaldi]. The following datasets have been chosen due to their balanced class sizes and a similar number of images per class, which allow for a more fair evaluation (for details refer to Appendix B.1). In order to imitate low-data regimes, we strictly follow the sampling proposed in [Shu et al.(2022)Shu, Yu, Xu, and Liu], where samples are randomly drawn from the training set providing 10%, 15%, 30%, and 50% percentages of data.

Baselines. In order to provide a fair comparison, in our experiments we utilise methods based on popular CNN and ViT architectures, where for all models (including ours) we perform transfer learning by fine-tuning. For the vanilla baselines we also consider two setups: traditional fine-tuning and advanced fine-tuning with our proposed training recipe designed for low-data learning. For the comparison, we first consider methods specifically designed for the fine-grained classification task, such as: Full Bilinear Pooling (FBP) [Lin et al.(2015)Lin, RoyChowdhury, and Maji], Compact Bilinear Pooling with Tensor Sketch projection (CBP-TS) [Gao et al.(2016)Gao, Beijbom, Zhang, and Darrell], Hierarchical Bilinear Pooling (HBP) [Yu et al.(2018)Yu, Zhao, Zheng, Zhang, and You], Deep Bilinear Transformation (DBTNet-50) [Zheng et al.(2019)Zheng, Fu, Zha, and Luo], and FFVT [Wang et al.(2021)Wang, Yu, and Gao]. As a main competitor, we evaluate the first and currently the only specifically designed approach for low-data in FGIC [Shu et al.(2022)Shu, Yu, Xu, and Liu], which is a CNN-based model with a self-boosting attention mechanism, by preserving the authors’ codebase and training hyper-parameters. For our approach, we similarly use a family of ResNet models [He et al.(2016)He, Zhang, Ren, and Sun] as a main backbone along with such other popular baselines as: GoogLeNet [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich], DenseNet [Huang et al.(2016)Huang, Liu, and Weinberger], Inception v3 [Szegedy et al.(2016)Szegedy, Vanhoucke, Ioffe, Shlens, and Wojna], and ViT [Dosovitskiy et al.(2020)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, et al.]. All considered models were standardly pre-trained on ImageNet and further fine-tuned on the above-mentioned fine-grained datasets for both vanilla baselines and our AD-Net.

Implementation Details. Compared to the traditional fine-tuning procedure, initially designed for a sufficient amount of samples per class, our proposed training recipe includes several adaptations for the low-data regimes (for application guidelines refer to Appendix D.1). First, following a common idea of increasing the main learning rate value $lr$ by a fixed ratio for the last classification layer $lr_{cls}$ , we automatically update this ratio depending on the percentage of the available data $\mathcal{D}_{aval}$ with the following rule:

lr_{cls}=lr\cdot\biggl{(}round\bigl{(}\frac{100}{\mathcal{D}_{aval}}\cdot 2% \bigr{)}/2\biggr{)}.

(4)

This assumption is valid in the simulated setup due to similar dataset sizes, but it can be further extended by replacing the available data percentage with the number of images per class adjusted to the number of classes. Additionally, we utilise a learning rate scheduler with downscaling $lr$ for 5 iterations $iter\in[1:5]$ by $0.1$ at each $(steps_{max}\cdot 0.5^{iter})$ training step. For other implementation details and training hyper-parameters refer to Appendix B.2.

Table 1: Comparison of different approaches using various percentages of the data on three popular FGIC datasets. Our proposed solution achieves consistent improvement in performance over other methods across low data settings. Best results are highlighted in bold.

Dataset	Method	Training data percentage
Dataset	Method	10%	15%	30%	50%
CUB-200-2011 [Wah et al.(2011)Wah, Branson, Welinder, Perona, and Belongie]	ResNet-50	36.99	48.88	62.60	73.23
	FBP	37.88	49.12	63.27	73.70
	CBP-TS	37.12	47.82	62.24	72.37
	HBP	38.57	50.12	63.86	74.18
	DBTNet-50	37.67	49.52	63.16	73.28
	SAM (ResNet-50)	40.24	52.05	64.07	73.92
	SAM (FBP)	41.83	52.35	65.19	74.54
	Ours (ResNet-50)	47.51	60.08	71.11	77.67
Stanford Cars [Krause et al.(2013)Krause, Deng, Stark, and Fei-Fei]	ResNet-50	37.45	53.01	75.26	83.56
	FBP	40.13	55.07	76.42	85.10
	CBP-TS	37.77	54.87	75.51	84.80
	HBP	40.02	55.82	76.81	85.31
	DBTNet-50	39.48	55.24	76.52	86.52
	SAM (ResNet-50)	39.96	55.02	76.69	84.85
	SAM (FBP)	43.19	57.42	77.63	85.71
	Ours (ResNet-50)	55.09	67.42	81.53	87.41
FGVC-Aircraft [Maji et al.(2013)Maji, Kannala, Rahtu, Blaschko, and Vedaldi]	ResNet-50	43.52	53.17	71.32	78.61
	FBP	45.16	55.06	72.12	79.93
	CBP-TS	44.63	54.79	71.32	79.60
	HBP	45.28	56.12	72.58	81.47
	DBTNet-50	45.35	56.36	73.06	81.26
	SAM (ResNet-50)	46.73	56.02	72.59	79.21
	SAM (FBP)	47.97	57.47	73.43	80.86
	Ours (ResNet-50)	55.81	62.59	74.44	81.73

4.2 Results and Analysis

Quantitative Analysis. In order to thoroughly investigate our proposed method, we conduct experiments on different label proportions of the popular FGIC benchmarks and further compare the results with the state-of-the-art approaches. Results in Table 1 demonstrate that our self-distillation component significantly improves the performance of the vanilla ResNet-50 across all of the sampled percentages on all considered datasets, while also outperforming other SOTA methods in the low-data settings. Specifically, with the smallest data available, our framework shows an outstanding relative accuracy increase of up to 45 % compared to standard ResNet-50 and up to 27 %, compared to the closest SOTA runner-up. Our solution also outperforms the specifically designed for low-data fine-grained classification SAM framework [Shu et al.(2022)Shu, Yu, Xu, and Liu], while not introducing any extra resource-consuming and architecture-specific module. It can be observed that the most significant improvement is achieved with the smallest number of samples per class available, while the relative performance increase goes down with more data used for training. This could be explained by the overall data saturation on larger data percentages available, which is the same behaviour for every considered method. Nevertheless, the benchmark analysis proves the ability of our solution to adapt to the setups with limited data by performing feature refinement through self-distillation on the existing samples. For a detailed comparison including other low-performing baselines refer to Appendix C.4.1. Additionally, we demonstrate the high transferability of our approach by utilising it on top of most of the popular CNN- and ViT-based backbones (see Appendix C.4.2 for details). The absolute improvement varies between 3-10 % showcasing that distilling local augmentations is practically an architecture-independent technique.

Qualitative Analysis. In order to analyse the motivation behind the significant performance improvement with our approach, we examine the model’s performance under different levels of noise added to the input images. Specifically, we evaluate the model’s prediction uncertainty by using Monte Carlo Dropout [Gal and Ghahramani(2016)]. We first sample a random image from the validation set and create a set of 1000 perturbed copies of the image by adding Gaussian noise with varying intensity. We then perform a forward pass using vanilla ResNet and our AD-Net for each image in the set (comparison illustrated in Figure 3). We can observe that while the vanilla’s predictions have a high false positive rate on noisy samples, the predictions of our AD-Net have lower dispersion and are more stable due to more robust purified features. For more qualitative analysis and visualisations refer to Appendix C.1 and C.6.

4.3 Ablation Study

Architecture Design. In order to justify the chosen multi-branch pipeline, we provide an architecture design ablation for our framework (see Table 2). Specifically, we first measure the performance of the standard classification branch with the naive fine-tuning procedure. Next, we check the performance of this model with our advanced low-data training recipe, which includes an adaptive learning rate ratio for the final classification layer and a scheduler. Further, we add a single distillation branch which is fed with a small cropped region and then feature distillation is performed between the classification and distillation feature outputs. Finally, we show the performance of our best solution with one target and one source distillation branches which are fed with corresponding larger target crop and smaller source crop in order to perform a separate self-distillation procedure.

Table 2: Ablation results on our framework architecture design and training recipe. The experiments are performed on ResNet-50 and the CUB dataset with 10% of the data.

Distillation	Our recipe	Augmentation	Acc, %
-	$\times$	Basic	36.99
-	✓	Basic	40.05
Single-branch	✓	ScaleMix	45.98
	✓	MultiCrop	46.22
	✓	Basic	47.09
Double-branch	✓	ScaleMix	46.11
	✓	MultiCrop	45.91
	✓	AsymAug	47.18
(Ours)	✓	Basic	47.51

Interestingly, application of advanced augmentation techniques, such as ScaleMix [Wang et al.(2022)Wang, Fan, Tian, Kihara, and Chen], MultiCrop [Caron et al.(2020a)Caron, Misra, Mairal, Goyal, Bojanowski, and Joulin], and AsymAug [Wang et al.(2022)Wang, Fan, Tian, Kihara, and Chen], within our self-distillation branches demonstrated some gains, but is not superior. We assume that this could be explained by the strong nature of these augmentations, which may not be as efficient due to the harsh perturbations. For more ablation analysis and limitations refer to Appendix C.2, C.3, and C.5.

5 Conclusion

Our proposed AD-Net framework demonstrates exceptional results on the fine-grained image classification task under low-data regimes, which typically requires the identification of local and subtle class-specific features. This is achieved by integrating advanced augmentation techniques and a distinctive self-distillation strategy, wherein a single model processes both the original image and its transformed views. This multi-input approach enhances the model’s ability to recognise and differentiate intricate features unique to each class.

Through comprehensive experiments, we have established that our model surpasses existing state-of-the-art techniques by a significant margin, particularly excelling in scenarios with the most limited data (see Appendix C.6). Specifically, with the smallest data available, our AD-Net shows an outstanding relative accuracy increase of up to 45 % compared to standard ResNet-50 and up to 27 % compared to the closest SOTA runner-up. This success is attributed to the synergistic combination of augmentations and a tailored objective function, which collectively improve the model’s learning quality. In detail, we explain this noticeable performance gain by the effect of the self-distillation objective, which provides a separate and more detailed source of information by enforcing feature space alignment for the augmented views of the same input image. Such a setup allows our AD-Net to avoid quick overfitting typical for the standard fine-tuning procedure (refer to Appendix C.1).

We also emphasise that our proposed framework is practically architecture-independent since it requires only the final feature representation output from the utilised baseline. Along with zero extra costs at inference time, the above-mentioned benefits make our AD-Net framework highly suitable for low-data fine-grained image classification problems.

References

[Branson et al.(2014)Branson, Van Horn, Belongie, and Perona] Steve Branson, Grant Van Horn, Serge Belongie, and Pietro Perona. Bird species categorization using pose normalized deep convolutional nets. arXiv preprint arXiv:1406.2952, 2014.
[Caron et al.(2020a)Caron, Misra, Mairal, Goyal, Bojanowski, and Joulin] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020a.
[Caron et al.(2020b)Caron, Misra, Mairal, Goyal, Bojanowski, and Joulin] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020b. Curran Associates Inc. ISBN 9781713829546.
[Caron et al.(2021)Caron, Touvron, Misra, Jégou, Mairal, Bojanowski, and Joulin] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers, 2021.
[Chang et al.(2020)Chang, Ding, Xie, Bhunia, Li, Ma, Wu, Guo, and Song] Dongliang Chang, Yifeng Ding, Jiyang Xie, Ayan Kumar Bhunia, Xiaoxu Li, Zhanyu Ma, Ming Wu, Jun Guo, and Yi-Zhe Song. The devil is in the channels: Mutual-channel loss for fine-grained image classification. IEEE Transactions on Image Processing, 29:4683–4695, 2020.
[Chen et al.(2020)Chen, Kornblith, Swersky, Norouzi, and Hinton] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners, 2020.
[Chou et al.(2022)Chou, Lin, and Kao] Po-Yung Chou, Cheng-Hung Lin, and Wen-Chung Kao. A novel plug-in module for fine-grained visual classification. arXiv preprint arXiv:2202.03822, 2022.
[Cubuk et al.(2018)Cubuk, Zoph, Mane, Vasudevan, and Le] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
[Cubuk et al.(2020)Cubuk, Zoph, Shlens, and Le] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
[Darvish et al.(2022)Darvish, Pouramini, and Bahador] Mahdi Darvish, Mahsa Pouramini, and Hamid Bahador. Towards fine-grained image classification with generative adversarial networks and facial landmark detection. In 2022 International Conference on Machine Vision and Image Processing (MVIP), pages 1–6. IEEE, 2022.
[Demidov. et al.(2023)Demidov., Sharif., Abdurahimov., Cholakkal., and Khan.] Dmitry Demidov., Muhammad Sharif., Aliakbar Abdurahimov., Hisham Cholakkal., and Fahad Khan. Salient mask-guided vision transformer for fine-grained classification. In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 4: VISAPP, pages 27–38. INSTICC, SciTePress, 2023. ISBN 978-989-758-634-7. 10.5220/0011611100003417.
[Demidov et al.(2024)Demidov, Al Majzoub, Kumar, and Khan] Dmitry Demidov, Roba Al Majzoub, Amandeep Kumar, and Fahad Khan. Distilling local texture features for colorectal tissue classification in low data regimes. In Xiaohuan Cao, Xuanang Xu, Islem Rekik, Zhiming Cui, and Xi Ouyang, editors, Machine Learning in Medical Imaging, pages 357–366, Cham, 2024. Springer Nature Switzerland. ISBN 978-3-031-45676-3.
[Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 10.1109/CVPR.2009.5206848.
[Dosovitskiy et al.(2020)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, et al.] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[Flores et al.(2019)Flores, Gonzalez-Garcia, van de Weijer, and Raducanu] Carola Figueroa Flores, Abel Gonzalez-Garcia, Joost van de Weijer, and Bogdan Raducanu. Saliency for fine-grained object recognition in domains with scarce training data. Pattern Recognition, 94:62–73, 2019.
[Gal and Ghahramani(2016)] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning, 2016.
[Gao et al.(2016)Gao, Beijbom, Zhang, and Darrell] Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. Compact bilinear pooling. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 317–326, 2016. 10.1109/CVPR.2016.41.
[Grill et al.(2020)Grill, Strub, Altché, Tallec, Richemond, Buchatskaya, Doersch, Pires, Guo, Azar, Piot, Kavukcuoglu, Munos, and Valko] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning, 2020.
[He et al.(2021)He, Chen, Liu, Kortylewski, Yang, Bai, Wang, and Yuille] Ju He, Jieneng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, and Alan Loddon Yuille. Transfg: A transformer architecture for fine-grained recognition. In AAAI Conference on Artificial Intelligence, 2021. URL https://api.semanticscholar.org/CorpusID:232233178.
[He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[Hoffer et al.(2019)Hoffer, Ben-Nun, Hubara, Giladi, Hoefler, and Soudry] Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten Hoefler, and Daniel Soudry. Augment your batch: better training with larger batches. arXiv preprint arXiv:1901.09335, 2019.
[Horn and Perona(2017)] Grant van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint arXiv:1709.01450, 2, 2017.
[Huang et al.(2016)Huang, Liu, and Weinberger] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2016. URL https://api.semanticscholar.org/CorpusID:9433631.
[Hui and Belkin(2021)] Like Hui and Mikhail Belkin. Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks, 2021.
[Jaiswal et al.(2021)Jaiswal, Babu, Zadeh, Banerjee, and Makedon] Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon. A survey on contrastive self-supervised learning, 2021.
[Kim et al.(2021)Kim, Oh, Kim, Cho, and Yun] Taehyeon Kim, Jaehoon Oh, NakYil Kim, Sangwook Cho, and Se-Young Yun. Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation, 2021.
[Krause et al.(2013)Krause, Deng, Stark, and Fei-Fei] Jonathan Krause, Jia Deng, Michael Stark, and Li Fei-Fei. Collecting a large-scale dataset of fine-grained cars. 2013. URL https://api.semanticscholar.org/CorpusID:16632981.
[Kurtulus et al.(2023)Kurtulus, Li, Dauphin, and Cubuk] Emirhan Kurtulus, Zichao Li, Yann Dauphin, and Ekin D. Cubuk. Tied-augment: controlling representation similarity improves data augmentation. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
[Lagunas et al.(2023)Lagunas, Impata, Martinez, Fernandez, Georgakis, Braun, and Bertrand] Manuel Lagunas, Brayan Impata, Victor Martinez, Virginia Fernandez, Christos Georgakis, Sofia Braun, and Felipe Bertrand. Transfer learning for fine-grained classification using semi-supervised learning and visual transformers. arXiv preprint arXiv:2305.10018, 2023.
[Lin et al.(2015)Lin, RoyChowdhury, and Maji] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn models for fine-grained visual recognition. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1449–1457, 2015. 10.1109/ICCV.2015.170.
[Lu et al.(2022)Lu, Lu, Yu, and Wang] Ziqian Lu, Zheming Lu, Yunlong Yu, and Zonghui Wang. Learn more from less: Generalized zero-shot learning with severely limited labeled data. Neurocomputing, 477:25–35, 2022. ISSN 0925-2312. https://doi.org/10.1016/j.neucom.2022.01.007. URL https://www.sciencedirect.com/science/article/pii/S0925231222000078.
[Maji et al.(2013)Maji, Kannala, Rahtu, Blaschko, and Vedaldi] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013.
[Mao et al.(2023)Mao, Mohri, and Zhong] Anqi Mao, Mehryar Mohri, and Yutao Zhong. Cross-entropy loss functions: Theoretical analysis and applications, 2023.
[Mazumder et al.(2021)Mazumder, Singh, and Namboodiri] Pratik Mazumder, Pravendra Singh, and Vinay P. Namboodiri. Fair visual recognition in limited data regime using self-supervision and self-distillation, 2021.
[Radford et al.(2021)Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, et al.] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[Ramdan et al.(2020)Ramdan, Heryana, Arisal, Kusumo, and Pardede] Ade Ramdan, Ana Heryana, Andria Arisal, R. Budiarianto S. Kusumo, and Hilman F. Pardede. Transfer learning and fine-tuning for deep learning-based tea diseases detection on small datasets. In 2020 International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET), pages 206–211, 2020.
[Schmarje et al.(2021)Schmarje, Santarossa, Schröder, and Koch] Lars Schmarje, Monty Santarossa, Simon-Martin Schröder, and Reinhard Koch. A survey on semi-, self-and unsupervised learning for image classification. IEEE Access, 9:82146–82168, 2021.
[Shu et al.(2022)Shu, Yu, Xu, and Liu] Yangyang Shu, Baosheng Yu, Haiming Xu, and Lingqiao Liu. Improving fine-grained visual recognition in low data regimes via self-boosting attention mechanism. In European Conference on Computer Vision, pages 449–465. Springer, 2022.
[Simonyan and Zisserman(2014)] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[Sultana et al.(2022)Sultana, Naseer, Khan, Khan, and Khan] Maryam Sultana, Muzammal Naseer, Muhammad Haris Khan, Salman Khan, and Fahad Shahbaz Khan. Self-distilled vision transformer for domain generalization, 2022.
[Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, Los Alamitos, CA, USA, jun 2015. IEEE Computer Society. 10.1109/CVPR.2015.7298594. URL https://doi.ieeecomputersociety.org/10.1109/CVPR.2015.7298594.
[Szegedy et al.(2016)Szegedy, Vanhoucke, Ioffe, Shlens, and Wojna] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016. 10.1109/CVPR.2016.308.
[Tang et al.(2022)Tang, Yuan, Li, and Tang] Hao Tang, Chengcheng Yuan, Zechao Li, and **hui Tang. Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognition, 130:108792, 2022.
[Touvron et al.(2021)Touvron, Cord, Douze, Massa, Sablayrolles, and Jegou] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/touvron21a.html.
[Touvron et al.(2022)Touvron, Cord, and Jégou] Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: Revenge of the vit, 2022.
[Wah et al.(2011)Wah, Branson, Welinder, Perona, and Belongie] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. Technical report. the caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
[Wang et al.(2021)Wang, Yu, and Gao] Jun Wang, Xiaohan Yu, and Yongsheng Gao. Feature fusion vision transformer for fine-grained visual categorization. In British Machine Vision Conference, 2021. URL https://api.semanticscholar.org/CorpusID:235742913.
[Wang and Qi(2022)] Xiao Wang and Guo-Jun Qi. Contrastive learning with stronger augmentations. IEEE transactions on pattern analysis and machine intelligence, 45(5):5549–5560, 2022.
[Wang et al.(2022)Wang, Fan, Tian, Kihara, and Chen] Xiao Wang, Haoqi Fan, Yuandong Tian, Daisuke Kihara, and Xinlei Chen. On the importance of asymmetry for siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16570–16579, 2022.
[Xu et al.(2021)Xu, Zhang, Hu, Wang, Wang, Wei, Bai, and Liu] Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3060–3069, 2021.
[Yang et al.(2022)Yang, Song, King, and Xu] Xiangli Yang, Zixing Song, Irwin King, and Zenglin Xu. A survey on deep semi-supervised learning. IEEE Transactions on Knowledge and Data Engineering, 2022.
[Yu et al.(2018)Yu, Zhao, Zheng, Zhang, and You] Chaojian Yu, Xinyi Zhao, Qi Zheng, Peng Zhang, and Xinge You. Hierarchical bilinear pooling for fine-grained visual recognition. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision – ECCV 2018, pages 595–610, Cham, 2018. Springer International Publishing. ISBN 978-3-030-01270-0.
[Yun et al.(2019)Yun, Han, Oh, Chun, Choe, and Yoo] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
[Zbontar et al.(2021)Zbontar, **g, Misra, LeCun, and Deny] Jure Zbontar, Li **g, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction, 2021.
[Zhang et al.(2014)Zhang, Donahue, Girshick, and Darrell] Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell. Part-based r-cnns for fine-grained category detection. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pages 834–849. Springer, 2014.
[Zhang and Sabuncu(2018)] Zhilu Zhang and Mert R. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels, 2018.
[Zheng et al.(2019)Zheng, Fu, Zha, and Luo] Heliang Zheng, Jianlong Fu, Zheng-Jun Zha, and Jiebo Luo. Learning deep bilinear transformation for fine-grained image representation. Curran Associates Inc., Red Hook, NY, USA, 2019.
[Zhuang et al.(2020)Zhuang, Wang, and Qiao] Peiqin Zhuang, Yali Wang, and Yu Qiao. Learning attentive pairwise interaction for fine-grained classification. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13130–13137, 2020.

Appendix A Additional Field Overview

A.1 Efficient Learning

Augmentations. Augmentation techniques are crucial in advancing deep learning models [Wang et al.(2022)Wang, Fan, Tian, Kihara, and Chen], providing strategies that enhance training efficiency, improve generalisation, and bolster model robustness. One of the recent groundworks includes batch augmentation [Hoffer et al.(2019)Hoffer, Ben-Nun, Hubara, Giladi, Hoefler, and Soudry], which is a powerful strategy that utilises large batches comprising multiple transformations for each sample. This not only accelerates training by reducing the number of stochastic gradient descent (SGD) updates but also acts as a regulariser, leading to improved generalisation. Another popular approach is RandAugment [Cubuk et al.(2020)Cubuk, Zoph, Shlens, and Le], designed for automated data augmentation by narrowing the search space and providing parameterisation. This method consistently outperforms previous automated augmentation techniques, demonstrating its efficiency across various tasks and datasets. Recent advances in contrastive learning also demonstrate that the so-called Stronger Augmentations [Wang and Qi(2022)] significantly enhance contrastive learning by enforcing distributional divergence between images augmented with such "strong" permutations as random crop** and flip**. Addressing limitations associated with regional dropout strategies, CutMix [Yun et al.(2019)Yun, Han, Oh, Chun, Choe, and Yoo] involves cutting and pasting patches among training images. This ensures information preservation, promotes object localisation capabilities and improves the model’s resilience against input corruption.

Appendix B Experimental Details

B.1 Datasets

Table 3: The details of three fine-grained visual classification datasets used for the experiments.

Dataset	Categories	Classes	Images
CUB-200-2011 [Wah et al.(2011)Wah, Branson, Welinder, Perona, and Belongie]	Birds	200	11,788
Stanford Cars [Krause et al.(2013)Krause, Deng, Stark, and Fei-Fei]	Cars	196	16,185
FGVC-Aircraft [Maji et al.(2013)Maji, Kannala, Rahtu, Blaschko, and Vedaldi]	Airplanes	102	10,200

B.2 Implementation Details

For our self-distillation part, after performing random sampling from the input image, we resize both target and source sampled regions to $224\times 224$ pixels. The motivation is to provide more different scales for input images so that the model can learn more scale-invariant representations, which are usually assumed to be already present in standard-sized datasets. Our training setup includes the standard SGD optimiser with a momentum equal to 0.9, a learning rate of 0.03, and a training batch size of 24 for all datasets. All experiments have been conducted on a single NVIDIA RTX 6000 GPU using the PyTorch framework and the APEX utility for mixed precision training.

Appendix C Additional Analysis

C.1 Qualitative Analysis

In order to analyse the motivation behind the significant performance improvement with our approach, we provide a direct comparison of training behaviour for a fine-tuned vanilla ResNet-50 and our self-distilled AD-Net with the same backbone. It can be observed from Figure 4 (left) that the model trained with standard fine-tuning procedure tends to overfit the limited data samples quickly and does not allow further refinement, while our framework is able to avoid overfitting more effectively. This can be explained by the effect of the additional distillation objective, which introduces an independent and more detailed source of information by enforcing feature space alignment for augmented views of the same image.

This specific effect is demonstrated in Figure 4 (right), where we provide the evolution of both classification and distillation objectives separately. Specifically, categorisation loss gets saturated quickly and becomes insignificant after the first 10% of training time, while our self-distillation component provides a noticeable effect throughout the whole training process. This is especially useful in low-data regimes, where models tend to quickly overfit to the main classification loss and do not obtain significant update signals, which is not the case for our proposed multi-component objective function.

C.2 Ablation Study

Distillation Loss. To rigorously evaluate the impact of distillation loss within our framework, a series of experiments were conducted on the CUB 10% dataset. First, in order to find the most effective type of distillation loss, we explore various objective functions applied at different type of outputs. In Table 4 we illustrate the variance in performance among different loss functions when applied to measure the disparity between features or logits coming from the target and source distillation branches in our architecture. Notably, Cross Entropy and Focal Loss functions, when applied to the output logits of both branches, demonstrate inferior performance. While computing the loss on the output features (converted to normalised distributions with softmax) is more effective, with KL divergence showing best results compared to the L1 and L2 objectives. One plausible explanation for the superior efficacy of KL divergence over other loss functions could be its ability to compare more sophisticated abstractions coming from differently augmented views.

Table 4: Effects caused by the type of distillation loss on the final metric. KL divergence computed between feature outputs of distillation branches shows the best performance.

Loss	Features	Logits
Cross Entropy	-	39.09
Focal Loss	-	39.32
L1 (MAE)	43.96	-
L2 (MSE)	41.67	-
Kullback–Leibler (KL)	47.51	-

Following the identification of the most effective loss function for our distillation branch, we proceeded to investigate its influence on the aggregated loss, denoted as $\mathcal{L}_{agg}$ in Eq. 3, by varying the weight coefficient $\alpha$ . In order to find the most optimal value, we experiment with both constant and variable $\alpha$ values, and summarise the results in Table 5. Our heuristic findings revealed that the model achieves its peak performance with the value of $\alpha$ set to 0.1.

Table 5: Study of the effect of

\alpha

coefficient in the aggregated loss

\mathcal{L}_{agg}

from Eq. 3. The

\alpha

decay is a linear decay from 1.0 to 0.01 .

$\alpha$	1	0.5	0.1	0.01	$\alpha$ decay
Acc, %	45.98	46.13	47.51	44.08	45.6

C.3 Limitations

Although our approach demonstrates a significant performance gain in the low-data setting, we also analyse and acknowledge its current limitations. First, due to the extra forward passes for each of the separate branches, the training time is increased by approximately 35-90 % compared to the vanilla fine-tuning (depending on the architecture type, see Tab. 2). However, this matter is not presented at inference time since our solution brings zero compute and time cost after training. Second, our approach has an inversely proportional performance gain to the size and diversity of a dataset (refer to Table 1), which theoretically may lead to less significant accuracy improvement when the data is abundant. Lastly, our solution requires a hyper-parameter $\alpha$ in Eq. 3 for controlling the influence of the distillation objective function on the overall loss, which unadapted value may sometimes lead to unstable training results (due to the nature of the KL divergence loss). We suggest that our current heuristic choice can be potentially replaced by an independent learnable parameter.

C.4 Quantitative Analysis

C.4.1 State-of-the-art comparison

Main Results. In Table 6 we provide the full comparison of different approaches on all data percentages including full datasets.

Other Results. Additionally, in Table 7, we also compare other methods potentially suitable for the low-data setting. Namely, SwAV [Caron et al.(2020b)Caron, Misra, Mairal, Goyal, Bojanowski, and Joulin], pre-trained in a self-supervised way and fine-tuned, and CLIP [Radford et al.(2021)Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, et al.], a self-supervised vision-language model, used for zero-shot inference. As can be seen, compared to our AD-Net, they demonstrate promising but unstable results.

Table 6: Comparison of different approaches using various percentages of the data on three popular FGIC datasets. Our proposed solution achieves consistent improvement in performance over other methods across low data settings. Best results are highlighted in bold.

Dataset	Method	Training data percentage
Dataset	Method	10%	15%	30%	50%	100%
CUB-200-2011	ResNet-50	36.99	48.88	62.60	73.23	81.34
	FBP	37.88	49.12	63.27	73.70	82.52
	CBP-TS	37.12	47.82	62.24	72.37	81.48
	HBP	38.57	50.12	63.86	74.18	86.12
	DBTNet-50	37.67	49.52	63.16	73.28	86.04
	SAM (ResNet-50)	40.24	52.05	64.07	73.92	81.62
	SAM (FBP)	41.83	52.35	65.19	74.54	81.86
	Ours (ResNet-50)	47.51	60.08	71.11	77.67	82.06
Stanford Cars	ResNet-50	37.45	53.01	75.26	83.56	91.02
	FBP	40.13	55.07	76.42	85.10	91.63
	CBP-TS	37.77	54.87	75.51	84.80	89.52
	HBP	40.02	55.82	76.81	85.31	92.73
	DBTNet-50	39.48	55.24	76.52	86.52	94.32
	SAM (ResNet-50)	39.96	55.02	76.69	84.85	91.06
	SAM (FBP)	43.19	57.42	77.63	85.71	91.48
	Ours (ResNet-50)	55.09	67.42	81.53	87.41	91.96
FGVC-Aircraft	ResNet-50	43.52	53.17	71.32	78.61	87.13
	FBP	45.16	55.06	72.12	79.93	87.32
	CBP-TS	44.63	54.79	71.32	79.60	84.58
	HBP	45.28	56.12	72.58	81.47	89.74
	DBTNet-50	45.35	56.36	73.06	81.26	90.86
	SAM (ResNet-50)	46.73	56.02	72.59	79.21	86.74
	SAM (FBP)	47.97	57.47	73.43	80.86	87.46
	Ours (ResNet-50)	55.81	62.59	74.44	81.73	88.64

Table 7: Experiments with other models on datasets with 10% of training data. Results for CLIP are obtained using zero-shot classification. Vanilla and Our results are for comparison.

Type	Method	CUB	Cars	Air
CNN	ResNet-50 (vanilla) [He et al.(2016)He, Zhang, Ren, and Sun]	36.99	37.45	43.52
	SwAV (ResNet-50) [Caron et al.(2020b)Caron, Misra, Mairal, Goyal, Bojanowski, and Joulin]	16.91	36.19	49.49
	CLIP (ResNet-50, zero-shot) [Radford et al.(2021)Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, et al.]	-	55.80	19.30
	Ours (ResNet-50)	47.51	55.09	55.81
ViT	ViT-B/32 (vanilla) [Dosovitskiy et al.(2020)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, et al.]	65.60	28.21	33.84
	TransFG (ViT-B/32) [He et al.(2021)He, Chen, Liu, Kortylewski, Yang, Bai, Wang, and Yuille]	64.91	-	-
	CLIP (ViT-B/32, zero-shot) [Radford et al.(2021)Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, et al.]	-	59.40	21.20
	Ours (ViT-B/32)	69.27	33.34	36.01

C.4.2 Transferability

Additionally, in Table 8 we demonstrate the high transferability of our approach by utilising it on top of most of the popular CNN- and ViT-based backbones. The absolute improvement varies between 3-10 % showcasing that distilling local augmentations of input images indeed promotes feature refinement and is practically an architecture-independed technique.

Table 8: Transferability study of AD-Net on the CUB dataset with 10% of training data. Column

\Delta

shows the absolute performance increase with our approach compared to a vanilla backbone.

Type	Backbone	Vanilla	Ours	$\Delta$
CNN	ResNet-18	34.79	41.39	+6.60
	ResNet-34	36.83	45.81	+8.98
	ResNet-50	36.99	47.51	+10.52
	ResNet-101	40.19	49.44	+9.25
	GoogleNet	33.32	39.11	+5.79
	Inception v3	40.16	44.88	+4.72
	DenseNet-169	41.42	49.13	+7.71
ViT	ViT-B/32	65.60	69.27	+3.67
ViT	FFVT B/32	65.79	68.13	+2.34

C.5 Augmentations with Naive Fine-tuning

Additionally, we conduct experiments with some advanced augmentation techniques, such as as ScaleMix [Wang et al.(2022)Wang, Fan, Tian, Kihara, and Chen], MultiCrop [Caron et al.(2020a)Caron, Misra, Mairal, Goyal, Bojanowski, and Joulin], and AsymAug [Wang et al.(2022)Wang, Fan, Tian, Kihara, and Chen] applied in a naive way with standard fine-tuning. As can be observed in Table 9, without our distillation technique, the ResNet-50 performance with the advanced augmentations is below the baseline with basic augmentations. Where by the basic augmentations we assume the classical list of augmentations recommended in the literature for each dataset. This includes random crop**, colour jittering, random horizontal flip, and further normalisation.

Table 9: The results with existing augmentation techniques, such as ScaleMix, MultiCrop, and AsymAug applied in a naive way with standard fine-tuning. Advanced augmentations include more geometrical and colour-related perturbations from the popular AutoAugs approach. As can be observed, without our distillation technique, the performance is below the baseline ResNet-50 with basic augmentations. The results were obtained with 10% low-data regime on the CUB training set.

Augmentation	Standard procedure	Our procedure
Basic augmentations	36.99	40.05
Advanced augmentations	27.82	30.68
ScaleMix [Wang et al.(2022)Wang, Fan, Tian, Kihara, and Chen]	28.54	26.99
MultiCrop [Caron et al.(2020a)Caron, Misra, Mairal, Goyal, Bojanowski, and Joulin]	28.06	28.90
AsymAugs [Wang et al.(2022)Wang, Fan, Tian, Kihara, and Chen]	30.29	30.96

We assume that more advanced and complex types of augmentations harm the learning process under low-data regimes, since the model may be unable to catch the locally important patterns due to harsh image perturbations.

C.6 Activation Maps Visualisation

In order to investigate the reason behind the significant performance improvement with our approach on the lowest data settings, we provide the difference in feature activation maps between the vanilla ResNet-50 and our AD-Net based on the same backbone. In Figures 5 and 6 we can clearly observe higher quality of the activation area from our method (more attention to the distinctive foreground regions), which explains its noticeable performance gain.

Appendix D Application Guidelines

D.1 Answers for Potential Questions

1. "How to decide whether should AD-Net be used for a given scenario?

A: Our method is supposed to be used in the scenarious where the standard fine-tuning procedure shows poor performance due to a small amount of available labelled images.

2. “How many images should be collected at least?”

A: After a thorough investigation, we have concluded that the exact answer depends on multiple factors, such as the number of classes, the number of images per class, and the size and capacity of a chosen baseline.

Therefore, we believe some small prior experiments are needed to make the final decision. Specifically, we recommend starting with at least 5 images per class, and further increasing the amount until the minimum desired performance is achieved.

Although the general rule is “the more data - the better”, our solution was designed specifically for the cases where obtaining a lot of labelled data may be impractical.

3. “When should AD-Net be switched to a different method as more training data are obtained?”

A: We suggest tracking the overall performance increase compared to an initially chosen baseline along with the data increase. Once the performance gain reaches neglectable values a different method can be used instead.

However, our solution is targeting the cases where the total amount of labelled data is highly limited and obtaining more samples may be too difficult.

Extract More from Less: Efficient Fine-Grained Visual Recognition in Low-Data Regimes