An Efficient NAS-based Approach for Handling Imbalanced Datasets

Zhiwei Yao
School of Computer Science and Technology
University of Science and Technology of China
Abstract

Class imbalance is a common issue in real-world data distributions, negatively impacting the training of accurate classifiers. Traditional approaches to mitigate this problem fall into three main categories: class re-balancing, information transfer, and representation learning. In this paper, we introduce a novel approach to enhance performance on long-tailed datasets by optimizing the backbone architecture through neural architecture search (NAS). Our research shows that an architecture’s accuracy on a balanced dataset does not reliably predict its performance on imbalanced datasets. This necessitates a complete NAS run on long-tailed datasets, which can be computationally expensive. To address this computational challenge, we focus on existing work, called IMB-NAS, which proposes efficiently adapting a NAS super-network trained on a balanced source dataset to an imbalanced target dataset. A detailed description of the fundamental techniques for IMB-NAS is provided in this paper, including NAS and architecture transfer. Among various adaptation strategies, we find that the most effective approach is to retrain the linear classification head with reweighted loss while kee** the backbone NAS super-network trained on the balanced source dataset frozen. Finally, we conducted a series of experiments on the imbalanced CIFAR dataset for performance evaluation. Our conclusions are the same as those proposed in the IMB-NAS paper.

Index Terms:
Neural Architecture Search, Imbalanced Dataset.

1 Introduction

The natural world exhibits a long-tail data distribution, as shown in Fig.1, where a small number of classes dominate the majority of data samples, while the remaining data is spread across numerous minority classes. Much of the previous work [1, 2, 3] has concentrated on improving the accuracy of fixed backbone architectures like ResNet-32. In contrast, our work aims to optimize the backbone architecture using neural architecture search (NAS). This is particularly important as current practices require neural architectures to be optimized for the size and latency constraints of small edge devices.

Refer to caption
Figure 1: Long-tailed data distribution.

To enhance the backbone architecture, we leverage recent advancements in Neural Architecture Search (NAS) [4], which primarily focuses on datasets that are balanced across classes. This raises a critical question: Is an architecture optimized on a balanced dataset also optimal for imbalanced datasets? Obviously, when performing architecture search and model training on imbalanced datasets, the model is prone to bias towards the head classes with massive samples. This bias results in significantly lower accuracy on the tail classes, which have only a few samples. For example, Duggal [5] conducted experiments where they trained identical architectures on datasets with varying distributions. Their findings revealed that the performance of the model varied substantially depending on the distribution of the data. Specifically, the architectures showed high accuracy on balanced datasets but struggled with imbalanced datasets. This demonstrates the critical need for effective strategies to mitigate class imbalance during both the search and training phases of neural network development.

Executing a NAS procedure for each target dataset demands significant computational resources and rapidly becomes unfeasible when dealing with multiple target datasets. To address this challenge, we learn the scheme proposed in [5], which proposes a more efficient approach: adapting architectural rankings from balanced datasets to imbalanced ones. This approach leverages the strength of NAS while minimizing computational costs. Specifically, it focuses on reusing a NAS super-network trained on balanced data and adapting it to imbalanced data by retraining only the linear classification head. This strategy significantly reduces the computational burden as it involves training only a linear layer on top of the pre-trained super-network. Extensive experiments in IMB-NAS[5] reveal a key insight: the adaptation procedure is most influenced by the linear classification head trained on top of the backbone. This finding suggests that the backbone, once trained on a balanced dataset, can generalize well to imbalanced datasets with minimal additional training. Based on this insight, we implement this scheme over the imbalanced CIFAR dataset [6], reuse a NAS super-network backbone trained on balanced CIFAR-10 and retrain only the classification head to adapt efficiently to imbalanced CIFAR-100. This method is highly efficient as it involves training only a linear layer on top of the pre-trained super-network.

The remainder of the paper is organized as follows. Section 2 discusses some related work, which includes neural architecture search, long-tailed data learning and architecture transfer. In Section 3, we first introduce some preliminaries about the DARTS[7] and present a detailed description of the scheme in [5]. Section 4 presents the experimental setup and evaluation results, and finally, the paper is concluded in Section 5.

2 Related Works

2.1 Neural architecture search

NAS is a method for automating the design of neural network architectures, typically involving search space design, search strategy design, and performance estimation strategy. Search space design creates a diverse range of possible architectures, such as cell-based spaces like NASNets [8] and DARTS, or macro search spaces like those used in ShuffleNet [9] and MobileNet [10] models. Search Strategy Design focuses on efficiently identifying high-performing architectures within the search space. Common strategies include reinforcement learning [11, 8], where RL agents iteratively propose and evaluate architectures, receiving rewards based on their performance. Evolutionary algorithms [12, 13] apply principles such as mutation and selection to evolve a population of architectures, exploring a broad range of designs. Gradient-based methods, such as those used in DARTS [7], optimize architectures within a continuous relaxation of the search space, enabling more efficient searching compared to discrete methods. Performance estimation strategies aim to cheaply estimate the quality (e.g., accuracy or efficiency) of an architecture, using techniques like proxy tasks and weight sharing to reduce the computational cost of NAS [14, 15]. Proxy tasks involve training architectures on smaller or simplified versions of the target task to quickly evaluate their performance, while weight sharing trains a single super-network that contains all possible architectures within the search space, allowing for rapid evaluation without training each one from scratch. All of these approaches typically search for optimal architectures using fully balanced datasets. However, our experiments demonstrate that the set of optimal architectures can vary significantly between balanced and imbalanced datasets. This finding underscores the need for develo** new NAS methods or efficient adaptation strategies to search for optimal architectures on real-world, imbalanced datasets.

2.2 Long-tailed data learning

Class imbalance, particularly the long-tail distribution, is a significant challenge in many real-world applications. Long-tailed data refers to datasets where a few classes (head classes) have a large number of samples, while many other classes (tail classes) have relatively few samples. Prior research on addressing long-tail imbalance can be broadly categorized into three primary approaches: (as detailed in the survey of [16]). The first is class rebalancing, which aims to mitigate the effects of class imbalance by adjusting the training data or the loss function. It includes techniques such as data re-sampling [17, 18], loss re-weighting [1, 19, 3, 20], and logit adjustment [21, 22, 23]. In data re-sampling, minority classes are oversampled or majority classes are undersampled. Loss re-weighting assigns higher weights to tail classes in the loss function and logit adjustment modifies the logits (outputs before the final activation function) in a way that accounts for class imbalance, thus hel** the model to better distinguish between minority and majority classes. The second is information augmentation which includes transfer learning [24, 25], which leverages pre-trained models from balanced datasets, and data augmentation [26], which generates additional samples for minority classes through techniques like GANs. The third is Module Improvement, which focuses on enhancing the model’s architecture and learning process. Module improvement encompasses techniques in representation learning [27], classifier design [28], decoupled training [1], and ensembling [2]. Distinct from these existing approaches, the work in [5] explores a novel direction for enhancing performance on long-tail datasets by optimizing the backbone architecture through neural architecture search. This new approach complements existing methods and can be used in conjunction with them to further improve accuracy and efficiency on imbalanced datasets.

2.3 Architecture transfer

Previous work on evaluates the robustness of architectures to distributional shifts in training datasets. Neural Architecture Transfer [29] investigates the transferability of architectures from large-scale to small-scale fine-grained datasets. However, this approach has two main limitations: it only considers balanced source and target datasets, and it assumes that all target datasets are known in advance, which is not practical for many industrial applications.

NASTransfer [30] addresses transferability between large-scale imbalanced datasets, including highly imbalanced datasets, like ImageNet-22k. Their approach is practically useful for very large datasets (e.g., ImageNet-22k) for whom direct search is prohibitive, however when it is feasible (e.g., on ImageNet) direct search typically leads to better architectures than proxy search. Our work differs by focusing on directly adapting a super-network pre-trained on fully balanced datasets to imbalanced ones. This approach emphasizes efficiency by retraining only the linear classification head while kee** the backbone frozen. By doing so, we significantly reduce the computational effort required compared to performing a full search on the target dataset.

3 METHODOLOGY

Refer to caption
Figure 2: An overview of DARTS: (a) Operations on the edges are initially unspecified. (b) Continuous relaxation of the search space is achieved by placing a mixture of candidate operations on each edge. (c) Joint optimization of the mixing probabilities and the network weights is performed by solving a bilevel optimization problem. (d) The final architecture is derived from the learned mixing probabilities.

In this work, we mainly focus on searching for a super-network on the balanced dataset and adapting only the linear classifier on the target dataset to solve the computer vision problem (e.g., image classification). We will introduce some technical details related to the scheme in [5], including the following three parts:

3.1 Preliminaries

Consider a training dataset 𝒟={x1,yi}𝒟subscript𝑥1subscript𝑦𝑖\mathcal{D}=\left\{x_{1},y_{i}\right\}caligraphic_D = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes an image and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT its corresponding label. Let njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT be the number of training images in class j𝑗jitalic_j. Under the assumption of a long-tail distribution, after sorting classes by decreasing cardinality, we observe that ninjsubscript𝑛𝑖subscript𝑛𝑗n_{i}\geq n_{j}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for i<j𝑖𝑗i<jitalic_i < italic_j, and n1>>nCmuch-greater-thansubscript𝑛1subscript𝑛𝐶n_{1}>>n_{C}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > > italic_n start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT. We denote a deep neural network by ϕitalic-ϕ\phiitalic_ϕ, which comprises a backbone ϕ(α,wα)italic-ϕ𝛼subscript𝑤𝛼\phi\left(\alpha,w_{\alpha}\right)italic_ϕ ( italic_α , italic_w start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) with architecture a𝑎aitalic_a, weights wαsubscript𝑤𝛼w_{\alpha}italic_w start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT and a linear classifier ϕ(wθ)italic-ϕsubscript𝑤𝜃\phi\left(w_{\theta}\right)italic_ϕ ( italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ). The model ϕitalic-ϕ\phiitalic_ϕ is trained using a combination of a training loss and a loss re-weighting strategy.

For balanced datasets, the network is typically trained using the standard cross-entropy loss (CE). In contrast, for imbalanced datasets, a re-weighting strategy [19] is applied to mitigate the bias towards majority classes. Specifically, samples from class j𝑗jitalic_j are re-weighted by a factor of 1γ1γnj1𝛾1superscript𝛾𝑛𝑗\frac{1-\gamma}{1-\gamma^{n}j}divide start_ARG 1 - italic_γ end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_j end_ARG, where γ𝛾\gammaitalic_γ is a hyperparameter controlling the degree of re-weighting. This technique helps to balance the influence of each class during training, thereby improving performance on minority classes.

The backbone ϕ(a,𝐰α)italic-ϕ𝑎subscript𝐰𝛼\phi\left(a,\mathbf{w}_{\alpha}\right)italic_ϕ ( italic_a , bold_w start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) is responsible for feature extraction, transforming input images into high-level representations. These representations are then fed into the linear classifier ϕ(𝐰θ)italic-ϕsubscript𝐰𝜃\phi\left(\mathbf{w}_{\theta}\right)italic_ϕ ( bold_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ), which maps them to the final output classes. The training process involves optimizing both 𝐰αsubscript𝐰𝛼\mathbf{w}_{\alpha}bold_w start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT and 𝐰θsubscript𝐰𝜃\mathbf{w}_{\theta}bold_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to minimize the overall loss. To handle the class imbalance effectively, inspired by previous works [20, 31], the re-weighting strategy is often applied after a few initial epochs of standard training, known as delayed re-weighting (DRW). This approach allows the model to first learn general features before focusing on underrepresented classes. The combined loss function can be expressed as:

total =CE+λRW.subscripttotal subscriptCE𝜆subscriptRW\mathcal{L}_{\text{total }}=\mathcal{L}_{\mathrm{CE}}+\lambda\mathcal{L}_{% \mathrm{RW}}.caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT roman_CE end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT roman_RW end_POSTSUBSCRIPT . (1)

where CEsubscriptCE\mathcal{L}_{\mathrm{CE}}caligraphic_L start_POSTSUBSCRIPT roman_CE end_POSTSUBSCRIPT is the cross-entropy loss, RWsubscriptRW\mathcal{L}_{\mathrm{RW}}caligraphic_L start_POSTSUBSCRIPT roman_RW end_POSTSUBSCRIPT is the re-weighted loss, and λ𝜆\lambdaitalic_λ is a scaling factor.

3.2 Differentiable Architecture Search

Differentiable Architecture Search (DARTS) represents a significant advancement in the field of neural architecture search (NAS). Traditional NAS methods often rely on reinforcement learning or evolutionary algorithms, which are computationally expensive due to the need to train and evaluate a large number of candidate architectures. In contrast, DARTS introduces a differentiable approach that allows for the efficient optimization of neural network architectures using gradient-based methods.

As shown in Fig. 2, in DARTS, the architecture search space is parameterized by a set of continuous variables that represent the probabilities of choosing different operations (e.g., convolutions, pooling) at each layer of the network. These continuous variables are optimized jointly with the network weights using standard gradient descent techniques. By formulating the search process as a differentiable problem, DARTS can efficiently explore the architecture space and converge to an optimal architecture in a fraction of the time required by traditional methods.

The DARTS framework consists of two main phases: the search phase and the evaluation phase. During the search phase, a super-network that encompasses all possible architectures within the search space is trained using a mixture of operations weighted by the learned continuous variables. The super-network is represented as a directed acyclic graph (DAG), where each node corresponds to a network layer, and each edge represents a candidate operation. We denote the super-network with backbone ϕ(α,wα)italic-ϕ𝛼subscript𝑤𝛼\phi\left(\alpha,w_{\alpha}\right)italic_ϕ ( italic_α , italic_w start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) and classifier ϕ(wθ)italic-ϕsubscript𝑤𝜃\phi\left(w_{\theta}\right)italic_ϕ ( italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) on a training dataset D𝐷Ditalic_D via the following minimization:

wα,𝒟,wθ,𝒟=minwα,wθ𝔼α𝒜((ϕ(wθ),ϕ(α,wα);𝒟)).superscriptsubscript𝑤𝛼𝒟superscriptsubscript𝑤𝜃𝒟subscriptsubscript𝑤𝛼subscript𝑤𝜃similar-to𝛼𝒜𝔼italic-ϕsubscript𝑤𝜃italic-ϕ𝛼subscript𝑤𝛼𝒟w_{\alpha,\mathcal{D}}^{*},w_{\theta,\mathcal{D}}^{*}=\min_{w_{\alpha},w_{% \theta}}\underset{\alpha\sim\mathcal{A}}{\mathbb{E}}\left(\mathcal{L}\left(% \phi\left(w_{\theta}\right),\phi\left(\alpha,w_{\alpha}\right);\mathcal{D}% \right)\right).italic_w start_POSTSUBSCRIPT italic_α , caligraphic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT italic_θ , caligraphic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_UNDERACCENT italic_α ∼ caligraphic_A end_UNDERACCENT start_ARG blackboard_E end_ARG ( caligraphic_L ( italic_ϕ ( italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) , italic_ϕ ( italic_α , italic_w start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) ; caligraphic_D ) ) . (2)

Here, \mathcal{L}caligraphic_L denotes the loss function and α𝒜similar-to𝛼𝒜\alpha\sim\mathcal{A}italic_α ∼ caligraphic_A indicates sampling from the search space 𝒜𝒜\mathcal{A}caligraphic_A via uniform, or attentive sampling. The expectation 𝔼𝔼\mathbb{E}blackboard_E is computed over the sampled architectures α𝛼\alphaitalic_α, which are combined using the continuous variables.

The architecture of the neural network is represented as a weighted sum of candidate operations, where the weight of each operation is determined by the continuous variable α𝛼\alphaitalic_α. For each edge (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) in the DAG, the operation o(i,j)superscript𝑜𝑖𝑗o^{(i,j)}italic_o start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT is a weighted sum of all possible operations:

o(i,j)(x)=o𝒪exp(αo(i,j))o𝒪exp(αo(i,j))o(x)superscript𝑜𝑖𝑗𝑥subscript𝑜𝒪superscriptsubscript𝛼𝑜𝑖𝑗subscriptsuperscript𝑜𝒪superscriptsubscript𝛼superscript𝑜𝑖𝑗𝑜𝑥o^{(i,j)}(x)=\sum_{o\in\mathcal{O}}\frac{\exp\left(\alpha_{o}^{(i,j)}\right)}{% \sum_{o^{\prime}\in\mathcal{O}}\exp\left(\alpha_{o^{\prime}}^{(i,j)}\right)}o(x)italic_o start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_o ∈ caligraphic_O end_POSTSUBSCRIPT divide start_ARG roman_exp ( italic_α start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_O end_POSTSUBSCRIPT roman_exp ( italic_α start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT ) end_ARG italic_o ( italic_x ) (3)

Here, 𝒪𝒪\mathcal{O}caligraphic_O is the set of all candidate operations (e.g., convolutions, pooling), and αo(i,j)superscriptsubscript𝛼𝑜𝑖𝑗\alpha_{o}^{(i,j)}italic_α start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT is the weight for operation o𝑜oitalic_o on edge (i,j)𝑖𝑗(i,j)( italic_i , italic_j ). The softmax function ensures that the weights sum to 1, making the operation selection differentiable.

After the search phase, the learned continuous variables α𝛼\alphaitalic_α are used to derive the final discrete architecture. The most likely operations, as indicated by α𝛼\alphaitalic_α, are selected to form the optimal architecture. This derived architecture is then retrained from scratch to validate its performance. The objective during this phase is to find the architecture αsuperscript𝛼\alpha^{*}italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that maximizes the validation accuracy by follows:

α𝒟=maxα𝒜Acc(ϕ(wθ),ϕ(α,wα);𝒟)superscriptsubscript𝛼𝒟subscript𝛼𝒜Accitalic-ϕsubscript𝑤𝜃italic-ϕ𝛼subscript𝑤𝛼𝒟\left.\alpha_{\mathcal{D}}^{*}=\max_{\alpha\in\mathcal{A}}\operatorname{Acc}% \left(\phi\left(w_{\theta}\right),\phi\left(\alpha,w_{\alpha}\right);\mathcal{% D}\right)\right.italic_α start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_max start_POSTSUBSCRIPT italic_α ∈ caligraphic_A end_POSTSUBSCRIPT roman_Acc ( italic_ϕ ( italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) , italic_ϕ ( italic_α , italic_w start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) ; caligraphic_D ) (4)

This maximization is commonly carried out using evolutionary algorithms or reinforcement learning techniques. In the following sections, we will explore efficient adaptation methods for modifying a NAS super-network, originally trained on a balanced dataset, to perform well on an imbalanced dataset.

3.3 Rank adaptation procedures

Given source and target datasets 𝒟s,𝒟tsubscript𝒟𝑠subscript𝒟𝑡\mathcal{D}_{s},\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the initial step involves training a train a super-network on Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT by solving the following optimization problem:

𝐰α,𝒟s,𝐰θ,𝒟s=min𝐰α,𝐰θ𝔼α𝒜((ϕ(𝐰θ),ϕ(α,𝐰α);𝒟s))subscriptsuperscript𝐰𝛼subscript𝒟𝑠subscriptsuperscript𝐰𝜃subscript𝒟𝑠subscriptsubscript𝐰𝛼subscript𝐰𝜃subscript𝔼similar-to𝛼𝒜italic-ϕsubscript𝐰𝜃italic-ϕ𝛼subscript𝐰𝛼subscript𝒟𝑠\mathbf{w}^{*}_{\alpha,\mathcal{D}_{s}},\mathbf{w}^{*}_{\theta,\mathcal{D}_{s}% }=\min_{\mathbf{w}_{\alpha},\mathbf{w}_{\theta}}\mathbb{E}_{\alpha\sim\mathcal% {A}}\left(\mathcal{L}(\phi(\mathbf{w}_{\theta}),\phi(\alpha,\mathbf{w}_{\alpha% });\mathcal{D}_{s})\right)bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT bold_w start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_α ∼ caligraphic_A end_POSTSUBSCRIPT ( caligraphic_L ( italic_ϕ ( bold_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) , italic_ϕ ( italic_α , bold_w start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) ; caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) (5)

The primary objective is to adapt the optimal super-network weight wα,𝒟s,wθ,𝒟ssuperscriptsubscript𝑤𝛼subscript𝒟𝑠superscriptsubscript𝑤𝜃subscript𝒟𝑠w_{\alpha,\mathcal{D}_{s}}^{*},w_{\theta,\mathcal{D}_{s}}^{*}italic_w start_POSTSUBSCRIPT italic_α , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT italic_θ , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT obtained from the source dataset Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to the target dataset Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT which is characterized by class imbalance. The most efficient strategy for this adaptation involves freezing the backbone of the network and adapting only the linear classifier on Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by minimizing the reweighted loss function RWsubscript𝑅𝑊\mathcal{L}_{RW}caligraphic_L start_POSTSUBSCRIPT italic_R italic_W end_POSTSUBSCRIPT:

wθ,𝒟t=minwθ𝔼α𝒜(RW(ϕ(wθ),ϕ(α,wα,𝒟s);𝒟t)).superscriptsubscript𝑤𝜃subscript𝒟𝑡subscriptsubscript𝑤𝜃similar-to𝛼𝒜𝔼subscript𝑅𝑊italic-ϕsubscript𝑤𝜃italic-ϕ𝛼superscriptsubscript𝑤𝛼subscript𝒟𝑠subscript𝒟𝑡w_{\theta,\mathcal{D}_{t}}^{*}=\min_{w_{\theta}}\underset{\alpha\sim\mathcal{A% }}{\mathbb{E}}\left(\mathcal{L}_{RW}\left(\phi\left(w_{\theta}\right),\phi% \left(\alpha,w_{\alpha,\mathcal{D}_{s}}^{*}\right);\mathcal{D}_{t}\right)% \right).italic_w start_POSTSUBSCRIPT italic_θ , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_UNDERACCENT italic_α ∼ caligraphic_A end_UNDERACCENT start_ARG blackboard_E end_ARG ( caligraphic_L start_POSTSUBSCRIPT italic_R italic_W end_POSTSUBSCRIPT ( italic_ϕ ( italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) , italic_ϕ ( italic_α , italic_w start_POSTSUBSCRIPT italic_α , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) . (6)

Here, RWsubscriptRW\mathcal{L}_{\mathrm{RW}}caligraphic_L start_POSTSUBSCRIPT roman_RW end_POSTSUBSCRIPT is the re-weighted loss function tailored to handle class imbalance. This method significantly reduces computational costs as only the classifier is retrained, while the backbone remains unchanged. The super-network resulting from this procedure contains backbone weights wα,𝒟ssuperscriptsubscript𝑤𝛼subscript𝒟𝑠w_{\alpha,\mathcal{D}_{s}}^{*}italic_w start_POSTSUBSCRIPT italic_α , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT trained on 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and classifier weights wθ,𝒟ssuperscriptsubscript𝑤𝜃subscript𝒟𝑠w_{\theta,\mathcal{D}_{s}}^{*}italic_w start_POSTSUBSCRIPT italic_θ , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT trained on 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Alternatively, another approach involves fine-tuning both the backbone and the classifier on the target dataset. This procedure is more computationally intensive but allows for better adaptation to the new data distribution. This can be done by minimizing the delayed re-weighted loss DRWsubscript𝐷𝑅𝑊\mathcal{L}_{DRW}caligraphic_L start_POSTSUBSCRIPT italic_D italic_R italic_W end_POSTSUBSCRIPT:

wα,𝒟t,wθ,𝒟t=minwα,wθ𝔼α𝒜(DRW(ϕ(wθ),ϕ(α,wα,𝒟s);𝒟t))superscriptsubscript𝑤𝛼subscript𝒟𝑡absentsuperscriptsubscript𝑤𝜃subscript𝒟𝑡subscriptsubscript𝑤𝛼subscript𝑤𝜃similar-to𝛼𝒜𝔼subscript𝐷𝑅𝑊italic-ϕsubscript𝑤𝜃italic-ϕ𝛼superscriptsubscript𝑤𝛼subscript𝒟𝑠subscript𝒟𝑡w_{\alpha,\mathcal{D}_{t}}^{**},w_{\theta,\mathcal{D}_{t}}^{*}=\min_{w_{\alpha% },w_{\theta}}\underset{\alpha\sim\mathcal{A}}{\mathbb{E}}\left(\mathcal{L}_{% DRW}\left(\phi\left(w_{\theta}\right),\phi\left(\alpha,w_{\alpha,\mathcal{D}_{% s}}^{*}\right);\mathcal{D}_{t}\right)\right)\text{. }italic_w start_POSTSUBSCRIPT italic_α , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT italic_θ , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_UNDERACCENT italic_α ∼ caligraphic_A end_UNDERACCENT start_ARG blackboard_E end_ARG ( caligraphic_L start_POSTSUBSCRIPT italic_D italic_R italic_W end_POSTSUBSCRIPT ( italic_ϕ ( italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) , italic_ϕ ( italic_α , italic_w start_POSTSUBSCRIPT italic_α , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) . (7)

In this context, the double star on wα,𝒟tsuperscriptsubscript𝑤𝛼subscript𝒟𝑡absentw_{\alpha,\mathcal{D}_{t}}^{**}italic_w start_POSTSUBSCRIPT italic_α , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT indicates the weights were obtained via fine-tuning wα,𝒟ssuperscriptsubscript𝑤𝛼subscript𝒟𝑠w_{\alpha,\mathcal{D}_{s}}^{*}italic_w start_POSTSUBSCRIPT italic_α , caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT using a reduced learning rate and fewer training epochs. It’s important to note that the delayed re-weighted loss DRWsubscript𝐷𝑅𝑊\mathcal{L}_{DRW}caligraphic_L start_POSTSUBSCRIPT italic_D italic_R italic_W end_POSTSUBSCRIPT starts as the unweighted loss \mathcal{L}caligraphic_L during the initial epochs and transitions to the re-weighted loss RWsubscript𝑅𝑊\mathcal{L}_{RW}caligraphic_L start_POSTSUBSCRIPT italic_R italic_W end_POSTSUBSCRIPT in later epochs. Although this second adaptation procedure is more computationally demanding because it also involves updating the backbone, it is still much less intensive than performing a full search on the target dataset.

The third and most computationally demanding method involves conducting a direct search on the target dataset using the re-weighted loss DRWsubscript𝐷𝑅𝑊\mathcal{L}_{DRW}caligraphic_L start_POSTSUBSCRIPT italic_D italic_R italic_W end_POSTSUBSCRIPT. This approach aims to identify the optimal architecture and weights from scratch, using the following optimization:

wα,𝒟t,wθ,𝒟t=minwα,wθ𝔼α𝒜(DRW(ϕ(wθ),ϕ(α,wα);𝒟t))superscriptsubscript𝑤𝛼subscript𝒟𝑡superscriptsubscript𝑤𝜃subscript𝒟𝑡subscriptsubscript𝑤𝛼subscript𝑤𝜃similar-to𝛼𝒜𝔼subscript𝐷𝑅𝑊italic-ϕsubscript𝑤𝜃italic-ϕ𝛼subscript𝑤𝛼subscript𝒟𝑡w_{\alpha,\mathcal{D}_{t}}^{*},w_{\theta,\mathcal{D}_{t}}^{*}=\min_{w_{\alpha}% ,w_{\theta}}\underset{\alpha\sim\mathcal{A}}{\mathbb{E}}\left(\mathcal{L}_{DRW% }\left(\phi\left(w_{\theta}\right),\phi\left(\alpha,w_{\alpha}\right);\mathcal% {D}_{t}\right)\right)italic_w start_POSTSUBSCRIPT italic_α , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT italic_θ , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_UNDERACCENT italic_α ∼ caligraphic_A end_UNDERACCENT start_ARG blackboard_E end_ARG ( caligraphic_L start_POSTSUBSCRIPT italic_D italic_R italic_W end_POSTSUBSCRIPT ( italic_ϕ ( italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) , italic_ϕ ( italic_α , italic_w start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) ; caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (8)

Although this method incurs high computational costs, it potentially yields the most tailored architecture for the imbalanced dataset. The three adaptation strategies are summarized in Table I.

Refer to caption
(a) Balance
Refer to caption
(b) Exponential(u𝑢uitalic_u=0.1)
Refer to caption
(c) Exponential(u𝑢uitalic_u=0.01)
Refer to caption
(d) Step(u𝑢uitalic_u=0.01)
Figure 3: Four Different Label Distributions of CIFAR-10 dataset.
TABLE I: Summary of rank adaptation procedures.
Symbol Eqn Details
P0 (5) No adaptation.
P1 (6) Freeze backbone, retrain classifier on 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.
P2 (7) Finetune backbone and retrain classifier on 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.
P2 (8) Re-train backbone and classifier on Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

4 Experiment Evaluation

4.1 Implementation details

Datasets. To simulate real-world class imbalance scenarios, we constructed imbalanced versions of the CIFAR-10 and CIFAR-100 datasets by sub-sampling from their original training splits. Concretely, CIFAR-10 contains a total of 60,000 32×\times×32 color images, which are classified into 10 distinct classes. The dataset is divided into two subsets for training and testing purposes, with 50,000 for training and 10,000 for testing. The images in CIFAR-10 are 32×\times×32×\times×3 dimensional. CIFAR-100 has the same number of images as CIFAR-10 but consists of 100 classes, which makes it more challenging to train models for classification. With the CIFAR-10-LT dataset and CIFAR-100-LT dataset, we explored three types of imbalance following [31]:

  • Balance: Every class contains all the samples from the original dataset, i.e., 5,000.

  • Step: The latter half of the classes have the number of samples adjusted by a specific factor, resulting in each class having 5000 * factor samples.

  • Exponential: For each class i{0,1,,9}𝑖019i\in\{0,1,\ldots,9\}italic_i ∈ { 0 , 1 , … , 9 }, the number of samples is set to 5000×5000\times5000 × factor (i/9).

Specifically, our experiments involve four different label distributions. Taking CIFAR-10 for example, the sample distribution of CIFAR-10 is illustrated in Fig. 3. By Figs. 3(b)-3(c), it can be found that the smaller the imbalance factor μ𝜇\muitalic_μ, the more pronounced the long tail characteristics of the label distribution.

Sub-network and super-network training details. Following IMB-NAS, we trained a network on balanced CIFAR-10 and CIFAR-100 datasets for 200 epochs. The training process began with an initial learning rate of 0.1, which was decayed by a factor of 0.01 at epochs 160 and 180. The cross-entropy loss function was used to optimize the model. For the imbalanced versions of these datasets, an effective re-weighting strategy inspired by [19] is implemented.

We trained a super-network for 500 epochs, starting with an initial learning rate of 0.1. The learning rate was reduced by a factor of 0.01 at epochs 300 and 400. For imbalanced datasets, re-weighting is applied at epoch 350 to address class imbalance issues. To identify the best subnet, we employed an evolutionary search strategy as described [4]. This involved 20 generations with a population of 50, utilizing a crossover number of 25, mutation number of 25, a mutation probability of 0.1, and selecting the top-k of 10 for the final architecture selection.

Adaptation Strategies. To adapt a super-network from a balanced to an imbalanced dataset, as described in [5], we fine-tuned the network for 200 epochs with an initial learning rate of 0.01, which was decayed by a factor of 0.01 at epoch 100. For procedure P1𝑃1P1italic_P 1, we introduce re-weighting at epoch 1. For P2, we delay the re-weighting to epoch 100. For P3, we follow the NAS strategy detailed above.

Experiment Setup. We performed our experiments using an AMAX deep learning workstation. This setup is equipped with an Intel(R) Xeon(R) Gold 5218R CPU, 8 NVIDIA GeForce RTX 3090 GPUs, and 256 GB RAM. All model training and evaluation were implemented using the PyTorch framework [32], ensuring a robust and scalable environment for our extensive simulations.

4.2 IMB-NAS Evaluation

Given a NAS super-network trained on a source dataset Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, the goal is to adapt it efficiently to a target dataset Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Table II presents the results for scenarios where Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is CIFAR-10 and Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is CIFAR-100 with varying levels of imbalance. The baseline (P0) involves retraining the best sub-networks from Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The optimal scenario (P3) entails direct training on Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, representing the highest accuracy achievable.

The adaptation procedures (P1 and P2 ) are displayed in the middle rows. Both adaptation methods surpass the baseline at higher imbalance levels, indicating that architectures optimized on Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT do not transfer well to imbalanced Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Surprisingly, we obtained similar results as IMB-NAS that P1𝑃1P1italic_P 1 consistently outperforms P2. This result is unexpected since P2 involves adapting the NAS backbone with the target data, whereas P1 retains the backbone from the source dataset. This suggests that class imbalance poses a greater challenge for NAS backbone optimization than the domain differences between CIFAR-10 and CIFAR-100. Also we find that P1 and P2 achieve accuracy levels close to the P3 while avoiding much of the computational cost associated with P3. This efficiency is crucial for practical applications where computational resources are limited. These findings underscore the importance of selective adaptation strategies in handling imbalanced datasets, highlighting the necessity for tailored approaches in NAS applications.

Symbol Imbalance Ratio (factor)
0(Balance)0𝐵𝑎𝑙𝑎𝑛𝑐𝑒0(Balance)0 ( italic_B italic_a italic_l italic_a italic_n italic_c italic_e ) 0.1(Exp)0.1𝐸𝑥𝑝0.1(Exp)0.1 ( italic_E italic_x italic_p ) 0.01(Exp)0.01𝐸𝑥𝑝0.01(Exp)0.01 ( italic_E italic_x italic_p ) 0.01(Step)0.01𝑆𝑡𝑒𝑝0.01(Step)0.01 ( italic_S italic_t italic_e italic_p )
baseline P0 52.83 48.43 41.18 38.43
P1 52.7652.7652.7652.76 49.3949.3949.3949.39 42.3742.3742.3742.37 39.58
P2 52.71 49.05 42.02 39.21
paragon P3 52.86 49.54 42.32 39.49
TABLE II: CIFAR10-Balanced \longrightarrow CIFAR100

5 Conclusions

This work aims to improve performance on class-imbalanced datasets by optimizing the backbone architecture. We begin by reviewing related works on NAS and various techniques for handling long-tailed datasets, providing a thorough understanding of the current landscape. Our investigation revealed that an architecture’s performance on balanced datasets does not consistently predict its effectiveness on imbalanced datasets. This insight implies that re-running NAS for each target dataset might be necessary. To avoid the substantial computational cost of re-running NAS, we explored an existing approach called IMB-NAS. This innovative method proposes adapting a NAS super-network trained on balanced datasets to imbalanced ones. IMB-NAS introduces several adaptation methods and discovers that re-training the linear classification head while kee** the NAS super-network backbone frozen outperforms other adaptation strategies. To further understand IMB-NAS, we conducted a series of experiments on the long-tailed dataset to evaluate its performance. Our experimental results generally aligned with our expectations, confirming the effectiveness of IMB-NAS.

References

  • [1] B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, and Y. Kalantidis, “Decoupling representation and classifier for long-tailed recognition,” arXiv preprint arXiv:1910.09217, 2019.
  • [2] B. Zhou, Q. Cui, X.-S. Wei, and Z.-M. Chen, “Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9719–9728.
  • [3] R. Duggal, S. Freitas, S. Dhamnani, D. H. Chau, and J. Sun, “Har: Hardness aware reweighting for imbalanced datasets,” in 2021 IEEE International Conference on Big Data (Big Data).   IEEE, 2021, pp. 735–745.
  • [4] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun, “Single path one-shot neural architecture search with uniform sampling,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16.   Springer, 2020, pp. 544–560.
  • [5] R. Duggal, S. Peng, H. Zhou, and D. H. Chau, “Imb-nas: Neural architecture search for imbalanced datasets,” arXiv preprint arXiv:2210.00136, 2022.
  • [6] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
  • [7] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” arXiv preprint arXiv:1806.09055, 2018.
  • [8] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8697–8710.
  • [9] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6848–6856.
  • [10] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
  • [11] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural network architectures using reinforcement learning,” arXiv preprint arXiv:1611.02167, 2016.
  • [12] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin, “Large-scale evolution of image classifiers,” in International conference on machine learning.   PMLR, 2017, pp. 2902–2911.
  • [13] R. Duggal, H. Zhou, S. Yang, Y. Xiong, W. Xia, Z. Tu, and S. Soatto, “Compatibility-aware heterogeneous visual search,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 10 723–10 732.
  • [14] B. Baker, O. Gupta, R. Raskar, and N. Naik, “Accelerating neural architecture search using performance prediction,” arXiv preprint arXiv:1705.10823, 2017.
  • [15] S. Falkner, A. Klein, and F. Hutter, “Bohb: Robust and efficient hyperparameter optimization at scale,” in International conference on machine learning.   PMLR, 2018, pp. 1437–1446.
  • [16] Y. Zhang, B. Kang, B. Hooi, S. Yan, and J. Feng, “Deep long-tailed learning: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [17] H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning,” in 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence).   Ieee, 2008, pp. 1322–1328.
  • [18] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.
  • [19] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie, “Class-balanced loss based on effective number of samples,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9268–9277.
  • [20] R. Duggal, S. Freitas, S. Dhamnani, D. H. Chau, and J. Sun, “Elf: An early-exiting framework for long-tailed classification,” arXiv preprint arXiv:2006.11979, 2020.
  • [21] A. K. Menon, S. Jayasumana, A. S. Rawat, H. Jain, A. Veit, and S. Kumar, “Long-tail learning via logit adjustment,” arXiv preprint arXiv:2007.07314, 2020.
  • [22] J. Tian, Y.-C. Liu, N. Glaser, Y.-C. Hsu, and Z. Kira, “Posterior re-calibration for imbalanced datasets,” Advances in neural information processing systems, vol. 33, pp. 8101–8113, 2020.
  • [23] S. Zhang, Z. Li, S. Yan, X. He, and J. Sun, “Distribution alignment: A unified framework for long-tail visual recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 2361–2370.
  • [24] Y.-X. Wang, D. Ramanan, and M. Hebert, “Learning to model the tail,” Advances in neural information processing systems, vol. 30, 2017.
  • [25] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker, “Feature transfer learning for face recognition with under-represented data,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5704–5713.
  • [26] P. Chu, X. Bian, S. Liu, and H. Ling, “Feature space augmentation for long-tailed data,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16.   Springer, 2020, pp. 694–710.
  • [27] Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu, “Large-scale long-tailed recognition in an open world,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2537–2546.
  • [28] T.-Y. Wu, P. Morgado, P. Wang, C.-H. Ho, and N. Vasconcelos, “Solving long-tailed recognition with deep realistic taxonomic classifier,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16.   Springer, 2020, pp. 171–189.
  • [29] Z. Lu, G. Sreekumar, E. Goodman, W. Banzhaf, K. Deb, and V. N. Boddeti, “Neural architecture transfer,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 9, pp. 2971–2989, 2021.
  • [30] R. Panda, M. Merler, M. S. Jaiswal, H. Wu, K. Ramakrishnan, U. Finkler, C.-F. R. Chen, M. Cho, R. Feris, D. Kung et al., “Nastransfer: Analyzing architecture transferability in large scale neural architecture search,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 10, 2021, pp. 9294–9302.
  • [31] K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma, “Learning imbalanced datasets with label-distribution-aware margin loss,” Advances in neural information processing systems, vol. 32, 2019.
  • [32] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.