An Efficient NAS-based Approach for Handling Imbalanced Datasets

Zhiwei Yao
School of Computer Science and Technology
University of Science and Technology of China

Abstract

Class imbalance is a common issue in real-world data distributions, negatively impacting the training of accurate classifiers. Traditional approaches to mitigate this problem fall into three main categories: class re-balancing, information transfer, and representation learning. In this paper, we introduce a novel approach to enhance performance on long-tailed datasets by optimizing the backbone architecture through neural architecture search (NAS). Our research shows that an architecture’s accuracy on a balanced dataset does not reliably predict its performance on imbalanced datasets. This necessitates a complete NAS run on long-tailed datasets, which can be computationally expensive. To address this computational challenge, we focus on existing work, called IMB-NAS, which proposes efficiently adapting a NAS super-network trained on a balanced source dataset to an imbalanced target dataset. A detailed description of the fundamental techniques for IMB-NAS is provided in this paper, including NAS and architecture transfer. Among various adaptation strategies, we find that the most effective approach is to retrain the linear classification head with reweighted loss while kee** the backbone NAS super-network trained on the balanced source dataset frozen. Finally, we conducted a series of experiments on the imbalanced CIFAR dataset for performance evaluation. Our conclusions are the same as those proposed in the IMB-NAS paper.

Index Terms:

Neural Architecture Search, Imbalanced Dataset.

1 Introduction

The natural world exhibits a long-tail data distribution, as shown in Fig.1, where a small number of classes dominate the majority of data samples, while the remaining data is spread across numerous minority classes. Much of the previous work [1, 2, 3] has concentrated on improving the accuracy of fixed backbone architectures like ResNet-32. In contrast, our work aims to optimize the backbone architecture using neural architecture search (NAS). This is particularly important as current practices require neural architectures to be optimized for the size and latency constraints of small edge devices.

Refer to caption — Figure 1: Long-tailed data distribution.

To enhance the backbone architecture, we leverage recent advancements in Neural Architecture Search (NAS) [4], which primarily focuses on datasets that are balanced across classes. This raises a critical question: Is an architecture optimized on a balanced dataset also optimal for imbalanced datasets? Obviously, when performing architecture search and model training on imbalanced datasets, the model is prone to bias towards the head classes with massive samples. This bias results in significantly lower accuracy on the tail classes, which have only a few samples. For example, Duggal [5] conducted experiments where they trained identical architectures on datasets with varying distributions. Their findings revealed that the performance of the model varied substantially depending on the distribution of the data. Specifically, the architectures showed high accuracy on balanced datasets but struggled with imbalanced datasets. This demonstrates the critical need for effective strategies to mitigate class imbalance during both the search and training phases of neural network development.

Executing a NAS procedure for each target dataset demands significant computational resources and rapidly becomes unfeasible when dealing with multiple target datasets. To address this challenge, we learn the scheme proposed in [5], which proposes a more efficient approach: adapting architectural rankings from balanced datasets to imbalanced ones. This approach leverages the strength of NAS while minimizing computational costs. Specifically, it focuses on reusing a NAS super-network trained on balanced data and adapting it to imbalanced data by retraining only the linear classification head. This strategy significantly reduces the computational burden as it involves training only a linear layer on top of the pre-trained super-network. Extensive experiments in IMB-NAS[5] reveal a key insight: the adaptation procedure is most influenced by the linear classification head trained on top of the backbone. This finding suggests that the backbone, once trained on a balanced dataset, can generalize well to imbalanced datasets with minimal additional training. Based on this insight, we implement this scheme over the imbalanced CIFAR dataset [6], reuse a NAS super-network backbone trained on balanced CIFAR-10 and retrain only the classification head to adapt efficiently to imbalanced CIFAR-100. This method is highly efficient as it involves training only a linear layer on top of the pre-trained super-network.

The remainder of the paper is organized as follows. Section 2 discusses some related work, which includes neural architecture search, long-tailed data learning and architecture transfer. In Section 3, we first introduce some preliminaries about the DARTS[7] and present a detailed description of the scheme in [5]. Section 4 presents the experimental setup and evaluation results, and finally, the paper is concluded in Section 5.

2 Related Works

2.1 Neural architecture search

NAS is a method for automating the design of neural network architectures, typically involving search space design, search strategy design, and performance estimation strategy. Search space design creates a diverse range of possible architectures, such as cell-based spaces like NASNets [8] and DARTS, or macro search spaces like those used in ShuffleNet [9] and MobileNet [10] models. Search Strategy Design focuses on efficiently identifying high-performing architectures within the search space. Common strategies include reinforcement learning [11, 8], where RL agents iteratively propose and evaluate architectures, receiving rewards based on their performance. Evolutionary algorithms [12, 13] apply principles such as mutation and selection to evolve a population of architectures, exploring a broad range of designs. Gradient-based methods, such as those used in DARTS [7], optimize architectures within a continuous relaxation of the search space, enabling more efficient searching compared to discrete methods. Performance estimation strategies aim to cheaply estimate the quality (e.g., accuracy or efficiency) of an architecture, using techniques like proxy tasks and weight sharing to reduce the computational cost of NAS [14, 15]. Proxy tasks involve training architectures on smaller or simplified versions of the target task to quickly evaluate their performance, while weight sharing trains a single super-network that contains all possible architectures within the search space, allowing for rapid evaluation without training each one from scratch. All of these approaches typically search for optimal architectures using fully balanced datasets. However, our experiments demonstrate that the set of optimal architectures can vary significantly between balanced and imbalanced datasets. This finding underscores the need for develo** new NAS methods or efficient adaptation strategies to search for optimal architectures on real-world, imbalanced datasets.

2.2 Long-tailed data learning

Class imbalance, particularly the long-tail distribution, is a significant challenge in many real-world applications. Long-tailed data refers to datasets where a few classes (head classes) have a large number of samples, while many other classes (tail classes) have relatively few samples. Prior research on addressing long-tail imbalance can be broadly categorized into three primary approaches: (as detailed in the survey of [16]). The first is class rebalancing, which aims to mitigate the effects of class imbalance by adjusting the training data or the loss function. It includes techniques such as data re-sampling [17, 18], loss re-weighting [1, 19, 3, 20], and logit adjustment [21, 22, 23]. In data re-sampling, minority classes are oversampled or majority classes are undersampled. Loss re-weighting assigns higher weights to tail classes in the loss function and logit adjustment modifies the logits (outputs before the final activation function) in a way that accounts for class imbalance, thus hel** the model to better distinguish between minority and majority classes. The second is information augmentation which includes transfer learning [24, 25], which leverages pre-trained models from balanced datasets, and data augmentation [26], which generates additional samples for minority classes through techniques like GANs. The third is Module Improvement, which focuses on enhancing the model’s architecture and learning process. Module improvement encompasses techniques in representation learning [27], classifier design [28], decoupled training [1], and ensembling [2]. Distinct from these existing approaches, the work in [5] explores a novel direction for enhancing performance on long-tail datasets by optimizing the backbone architecture through neural architecture search. This new approach complements existing methods and can be used in conjunction with them to further improve accuracy and efficiency on imbalanced datasets.

2.3 Architecture transfer

Previous work on evaluates the robustness of architectures to distributional shifts in training datasets. Neural Architecture Transfer [29] investigates the transferability of architectures from large-scale to small-scale fine-grained datasets. However, this approach has two main limitations: it only considers balanced source and target datasets, and it assumes that all target datasets are known in advance, which is not practical for many industrial applications.

NASTransfer [30] addresses transferability between large-scale imbalanced datasets, including highly imbalanced datasets, like ImageNet-22k. Their approach is practically useful for very large datasets (e.g., ImageNet-22k) for whom direct search is prohibitive, however when it is feasible (e.g., on ImageNet) direct search typically leads to better architectures than proxy search. Our work differs by focusing on directly adapting a super-network pre-trained on fully balanced datasets to imbalanced ones. This approach emphasizes efficiency by retraining only the linear classification head while kee** the backbone frozen. By doing so, we significantly reduce the computational effort required compared to performing a full search on the target dataset.

3 METHODOLOGY

In this work, we mainly focus on searching for a super-network on the balanced dataset and adapting only the linear classifier on the target dataset to solve the computer vision problem (e.g., image classification). We will introduce some technical details related to the scheme in [5], including the following three parts:

3.1 Preliminaries

Consider a training dataset $\mathcal{D}=\left\{x_{1},y_{i}\right\}$ , where $x_{i}$ denotes an image and $y_{i}$ its corresponding label. Let $n_{j}$ be the number of training images in class $j$ . Under the assumption of a long-tail distribution, after sorting classes by decreasing cardinality, we observe that $n_{i}\geq n_{j}$ for $i<j$ , and $n_{1}>>n_{C}$ . We denote a deep neural network by $\phi$ , which comprises a backbone $\phi\left(\alpha,w_{\alpha}\right)$ with architecture $a$ , weights $w_{\alpha}$ and a linear classifier $\phi\left(w_{\theta}\right)$ . The model $\phi$ is trained using a combination of a training loss and a loss re-weighting strategy.

For balanced datasets, the network is typically trained using the standard cross-entropy loss (CE). In contrast, for imbalanced datasets, a re-weighting strategy [19] is applied to mitigate the bias towards majority classes. Specifically, samples from class $j$ are re-weighted by a factor of $\frac{1-\gamma}{1-\gamma^{n}j}$ , where $\gamma$ is a hyperparameter controlling the degree of re-weighting. This technique helps to balance the influence of each class during training, thereby improving performance on minority classes.

The backbone $\phi\left(a,\mathbf{w}_{\alpha}\right)$ is responsible for feature extraction, transforming input images into high-level representations. These representations are then fed into the linear classifier $\phi\left(\mathbf{w}_{\theta}\right)$ , which maps them to the final output classes. The training process involves optimizing both $\mathbf{w}_{\alpha}$ and $\mathbf{w}_{\theta}$ to minimize the overall loss. To handle the class imbalance effectively, inspired by previous works [20, 31], the re-weighting strategy is often applied after a few initial epochs of standard training, known as delayed re-weighting (DRW). This approach allows the model to first learn general features before focusing on underrepresented classes. The combined loss function can be expressed as:

\mathcal{L}_{\text{total }}=\mathcal{L}_{\mathrm{CE}}+\lambda\mathcal{L}_{% \mathrm{RW}}.

(1)

where $\mathcal{L}_{\mathrm{CE}}$ is the cross-entropy loss, $\mathcal{L}_{\mathrm{RW}}$ is the re-weighted loss, and $\lambda$ is a scaling factor.

3.2 Differentiable Architecture Search

Differentiable Architecture Search (DARTS) represents a significant advancement in the field of neural architecture search (NAS). Traditional NAS methods often rely on reinforcement learning or evolutionary algorithms, which are computationally expensive due to the need to train and evaluate a large number of candidate architectures. In contrast, DARTS introduces a differentiable approach that allows for the efficient optimization of neural network architectures using gradient-based methods.

As shown in Fig. 2, in DARTS, the architecture search space is parameterized by a set of continuous variables that represent the probabilities of choosing different operations (e.g., convolutions, pooling) at each layer of the network. These continuous variables are optimized jointly with the network weights using standard gradient descent techniques. By formulating the search process as a differentiable problem, DARTS can efficiently explore the architecture space and converge to an optimal architecture in a fraction of the time required by traditional methods.

The DARTS framework consists of two main phases: the search phase and the evaluation phase. During the search phase, a super-network that encompasses all possible architectures within the search space is trained using a mixture of operations weighted by the learned continuous variables. The super-network is represented as a directed acyclic graph (DAG), where each node corresponds to a network layer, and each edge represents a candidate operation. We denote the super-network with backbone $\phi\left(\alpha,w_{\alpha}\right)$ and classifier $\phi\left(w_{\theta}\right)$ on a training dataset $D$ via the following minimization:

w_{\alpha,\mathcal{D}}^{*},w_{\theta,\mathcal{D}}^{*}=\min_{w_{\alpha},w_{% \theta}}\underset{\alpha\sim\mathcal{A}}{\mathbb{E}}\left(\mathcal{L}\left(% \phi\left(w_{\theta}\right),\phi\left(\alpha,w_{\alpha}\right);\mathcal{D}% \right)\right).

(2)

Here, $\mathcal{L}$ denotes the loss function and $\alpha\sim\mathcal{A}$ indicates sampling from the search space $\mathcal{A}$ via uniform, or attentive sampling. The expectation $\mathbb{E}$ is computed over the sampled architectures $\alpha$ , which are combined using the continuous variables.

The architecture of the neural network is represented as a weighted sum of candidate operations, where the weight of each operation is determined by the continuous variable $\alpha$ . For each edge $(i,j)$ in the DAG, the operation $o^{(i,j)}$ is a weighted sum of all possible operations:

o^{(i,j)}(x)=\sum_{o\in\mathcal{O}}\frac{\exp\left(\alpha_{o}^{(i,j)}\right)}{% \sum_{o^{\prime}\in\mathcal{O}}\exp\left(\alpha_{o^{\prime}}^{(i,j)}\right)}o(x)

(3)

Here, $\mathcal{O}$ is the set of all candidate operations (e.g., convolutions, pooling), and $\alpha_{o}^{(i,j)}$ is the weight for operation $o$ on edge $(i,j)$ . The softmax function ensures that the weights sum to 1, making the operation selection differentiable.

After the search phase, the learned continuous variables $\alpha$ are used to derive the final discrete architecture. The most likely operations, as indicated by $\alpha$ , are selected to form the optimal architecture. This derived architecture is then retrained from scratch to validate its performance. The objective during this phase is to find the architecture $\alpha^{*}$ that maximizes the validation accuracy by follows:

\left.\alpha_{\mathcal{D}}^{*}=\max_{\alpha\in\mathcal{A}}\operatorname{Acc}% \left(\phi\left(w_{\theta}\right),\phi\left(\alpha,w_{\alpha}\right);\mathcal{% D}\right)\right.

(4)

This maximization is commonly carried out using evolutionary algorithms or reinforcement learning techniques. In the following sections, we will explore efficient adaptation methods for modifying a NAS super-network, originally trained on a balanced dataset, to perform well on an imbalanced dataset.

3.3 Rank adaptation procedures

Given source and target datasets $\mathcal{D}_{s},\mathcal{D}_{t}$ , the initial step involves training a train a super-network on $D_{s}$ by solving the following optimization problem:

\mathbf{w}^{*}_{\alpha,\mathcal{D}_{s}},\mathbf{w}^{*}_{\theta,\mathcal{D}_{s}% }=\min_{\mathbf{w}_{\alpha},\mathbf{w}_{\theta}}\mathbb{E}_{\alpha\sim\mathcal% {A}}\left(\mathcal{L}(\phi(\mathbf{w}_{\theta}),\phi(\alpha,\mathbf{w}_{\alpha% });\mathcal{D}_{s})\right)

(5)

The primary objective is to adapt the optimal super-network weight $w_{\alpha,\mathcal{D}_{s}}^{*},w_{\theta,\mathcal{D}_{s}}^{*}$ obtained from the source dataset $D_{s}$ to the target dataset $D_{t}$ which is characterized by class imbalance. The most efficient strategy for this adaptation involves freezing the backbone of the network and adapting only the linear classifier on $D_{t}$ by minimizing the reweighted loss function $\mathcal{L}_{RW}$ :

w_{\theta,\mathcal{D}_{t}}^{*}=\min_{w_{\theta}}\underset{\alpha\sim\mathcal{A% }}{\mathbb{E}}\left(\mathcal{L}_{RW}\left(\phi\left(w_{\theta}\right),\phi% \left(\alpha,w_{\alpha,\mathcal{D}_{s}}^{*}\right);\mathcal{D}_{t}\right)% \right).

(6)

Here, $\mathcal{L}_{\mathrm{RW}}$ is the re-weighted loss function tailored to handle class imbalance. This method significantly reduces computational costs as only the classifier is retrained, while the backbone remains unchanged. The super-network resulting from this procedure contains backbone weights $w_{\alpha,\mathcal{D}_{s}}^{*}$ trained on $\mathcal{D}_{s}$ and classifier weights $w_{\theta,\mathcal{D}_{s}}^{*}$ trained on $\mathcal{D}_{t}$ . Alternatively, another approach involves fine-tuning both the backbone and the classifier on the target dataset. This procedure is more computationally intensive but allows for better adaptation to the new data distribution. This can be done by minimizing the delayed re-weighted loss $\mathcal{L}_{DRW}$ :

w_{\alpha,\mathcal{D}_{t}}^{**},w_{\theta,\mathcal{D}_{t}}^{*}=\min_{w_{\alpha% },w_{\theta}}\underset{\alpha\sim\mathcal{A}}{\mathbb{E}}\left(\mathcal{L}_{% DRW}\left(\phi\left(w_{\theta}\right),\phi\left(\alpha,w_{\alpha,\mathcal{D}_{% s}}^{*}\right);\mathcal{D}_{t}\right)\right)\text{. }

(7)

In this context, the double star on $w_{\alpha,\mathcal{D}_{t}}^{**}$ indicates the weights were obtained via fine-tuning $w_{\alpha,\mathcal{D}_{s}}^{*}$ using a reduced learning rate and fewer training epochs. It’s important to note that the delayed re-weighted loss $\mathcal{L}_{DRW}$ starts as the unweighted loss $\mathcal{L}$ during the initial epochs and transitions to the re-weighted loss $\mathcal{L}_{RW}$ in later epochs. Although this second adaptation procedure is more computationally demanding because it also involves updating the backbone, it is still much less intensive than performing a full search on the target dataset.

The third and most computationally demanding method involves conducting a direct search on the target dataset using the re-weighted loss $\mathcal{L}_{DRW}$ . This approach aims to identify the optimal architecture and weights from scratch, using the following optimization:

w_{\alpha,\mathcal{D}_{t}}^{*},w_{\theta,\mathcal{D}_{t}}^{*}=\min_{w_{\alpha}% ,w_{\theta}}\underset{\alpha\sim\mathcal{A}}{\mathbb{E}}\left(\mathcal{L}_{DRW% }\left(\phi\left(w_{\theta}\right),\phi\left(\alpha,w_{\alpha}\right);\mathcal% {D}_{t}\right)\right)

(8)

Although this method incurs high computational costs, it potentially yields the most tailored architecture for the imbalanced dataset. The three adaptation strategies are summarized in Table I.

TABLE I: Summary of rank adaptation procedures.

Symbol	Eqn	Details
P0	(5)	No adaptation.
P1	(6)	Freeze backbone, retrain classifier on $\mathcal{D}_{t}$ .
P2	(7)	Finetune backbone and retrain classifier on $\mathcal{D}_{t}$ .
P2	(8)	Re-train backbone and classifier on $D_{t}$ .

4 Experiment Evaluation

4.1 Implementation details

Datasets. To simulate real-world class imbalance scenarios, we constructed imbalanced versions of the CIFAR-10 and CIFAR-100 datasets by sub-sampling from their original training splits. Concretely, CIFAR-10 contains a total of 60,000 32 $\times$ 32 color images, which are classified into 10 distinct classes. The dataset is divided into two subsets for training and testing purposes, with 50,000 for training and 10,000 for testing. The images in CIFAR-10 are 32 $\times$ 32 $\times$ 3 dimensional. CIFAR-100 has the same number of images as CIFAR-10 but consists of 100 classes, which makes it more challenging to train models for classification. With the CIFAR-10-LT dataset and CIFAR-100-LT dataset, we explored three types of imbalance following [31]:

•

Balance: Every class contains all the samples from the original dataset, i.e., 5,000.
•

Step: The latter half of the classes have the number of samples adjusted by a specific factor, resulting in each class having 5000 * factor samples.
•

Exponential: For each class $i\in\{0,1,\ldots,9\}$ , the number of samples is set to $5000\times$ factor ^(i/9).

Specifically, our experiments involve four different label distributions. Taking CIFAR-10 for example, the sample distribution of CIFAR-10 is illustrated in Fig. 3. By Figs. 3(b)-3(c), it can be found that the smaller the imbalance factor $\mu$ , the more pronounced the long tail characteristics of the label distribution.

Sub-network and super-network training details. Following IMB-NAS, we trained a network on balanced CIFAR-10 and CIFAR-100 datasets for 200 epochs. The training process began with an initial learning rate of 0.1, which was decayed by a factor of 0.01 at epochs 160 and 180. The cross-entropy loss function was used to optimize the model. For the imbalanced versions of these datasets, an effective re-weighting strategy inspired by [19] is implemented.

We trained a super-network for 500 epochs, starting with an initial learning rate of 0.1. The learning rate was reduced by a factor of 0.01 at epochs 300 and 400. For imbalanced datasets, re-weighting is applied at epoch 350 to address class imbalance issues. To identify the best subnet, we employed an evolutionary search strategy as described [4]. This involved 20 generations with a population of 50, utilizing a crossover number of 25, mutation number of 25, a mutation probability of 0.1, and selecting the top-k of 10 for the final architecture selection.

Adaptation Strategies. To adapt a super-network from a balanced to an imbalanced dataset, as described in [5], we fine-tuned the network for 200 epochs with an initial learning rate of 0.01, which was decayed by a factor of 0.01 at epoch 100. For procedure $P1$ , we introduce re-weighting at epoch 1. For P2, we delay the re-weighting to epoch 100. For P3, we follow the NAS strategy detailed above.

Experiment Setup. We performed our experiments using an AMAX deep learning workstation. This setup is equipped with an Intel(R) Xeon(R) Gold 5218R CPU, 8 NVIDIA GeForce RTX 3090 GPUs, and 256 GB RAM. All model training and evaluation were implemented using the PyTorch framework [32], ensuring a robust and scalable environment for our extensive simulations.

4.2 IMB-NAS Evaluation

Given a NAS super-network trained on a source dataset $D_{s}$ , the goal is to adapt it efficiently to a target dataset $D_{t}$ . Table II presents the results for scenarios where $D_{s}$ is CIFAR-10 and $D_{t}$ is CIFAR-100 with varying levels of imbalance. The baseline (P0) involves retraining the best sub-networks from $D_{s}$ to $D_{t}$ . The optimal scenario (P3) entails direct training on $D_{t}$ , representing the highest accuracy achievable.

The adaptation procedures (P1 and P2 ) are displayed in the middle rows. Both adaptation methods surpass the baseline at higher imbalance levels, indicating that architectures optimized on $D_{s}$ do not transfer well to imbalanced $D_{t}$ . Surprisingly, we obtained similar results as IMB-NAS that $P1$ consistently outperforms P2. This result is unexpected since P2 involves adapting the NAS backbone with the target data, whereas P1 retains the backbone from the source dataset. This suggests that class imbalance poses a greater challenge for NAS backbone optimization than the domain differences between CIFAR-10 and CIFAR-100. Also we find that P1 and P2 achieve accuracy levels close to the P3 while avoiding much of the computational cost associated with P3. This efficiency is crucial for practical applications where computational resources are limited. These findings underscore the importance of selective adaptation strategies in handling imbalanced datasets, highlighting the necessity for tailored approaches in NAS applications.

	Symbol	Imbalance Ratio (factor)
		$0(Balance)$	$0.1(Exp)$	$0.01(Exp)$	$0.01(Step)$
baseline	P0	52.83	48.43	41.18	38.43
	P1	$52.76$	$49.39$	$42.37$	39.58
	P2	52.71	49.05	42.02	39.21
paragon	P3	52.86	49.54	42.32	39.49

TABLE II: CIFAR10-Balanced

\longrightarrow

CIFAR100

5 Conclusions

This work aims to improve performance on class-imbalanced datasets by optimizing the backbone architecture. We begin by reviewing related works on NAS and various techniques for handling long-tailed datasets, providing a thorough understanding of the current landscape. Our investigation revealed that an architecture’s performance on balanced datasets does not consistently predict its effectiveness on imbalanced datasets. This insight implies that re-running NAS for each target dataset might be necessary. To avoid the substantial computational cost of re-running NAS, we explored an existing approach called IMB-NAS. This innovative method proposes adapting a NAS super-network trained on balanced datasets to imbalanced ones. IMB-NAS introduces several adaptation methods and discovers that re-training the linear classification head while kee** the NAS super-network backbone frozen outperforms other adaptation strategies. To further understand IMB-NAS, we conducted a series of experiments on the long-tailed dataset to evaluate its performance. Our experimental results generally aligned with our expectations, confirming the effectiveness of IMB-NAS.

References

[1] B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, and Y. Kalantidis, “Decoupling representation and classifier for long-tailed recognition,” arXiv preprint arXiv:1910.09217, 2019.
[2] B. Zhou, Q. Cui, X.-S. Wei, and Z.-M. Chen, “Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9719–9728.
[3] R. Duggal, S. Freitas, S. Dhamnani, D. H. Chau, and J. Sun, “Har: Hardness aware reweighting for imbalanced datasets,” in 2021 IEEE International Conference on Big Data (Big Data). IEEE, 2021, pp. 735–745.
[4] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun, “Single path one-shot neural architecture search with uniform sampling,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16. Springer, 2020, pp. 544–560.
[5] R. Duggal, S. Peng, H. Zhou, and D. H. Chau, “Imb-nas: Neural architecture search for imbalanced datasets,” arXiv preprint arXiv:2210.00136, 2022.
[6] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
[7] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” arXiv preprint arXiv:1806.09055, 2018.
[8] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8697–8710.
[9] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6848–6856.
[10] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
[11] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural network architectures using reinforcement learning,” arXiv preprint arXiv:1611.02167, 2016.
[12] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin, “Large-scale evolution of image classifiers,” in International conference on machine learning. PMLR, 2017, pp. 2902–2911.
[13] R. Duggal, H. Zhou, S. Yang, Y. Xiong, W. Xia, Z. Tu, and S. Soatto, “Compatibility-aware heterogeneous visual search,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 10 723–10 732.
[14] B. Baker, O. Gupta, R. Raskar, and N. Naik, “Accelerating neural architecture search using performance prediction,” arXiv preprint arXiv:1705.10823, 2017.
[15] S. Falkner, A. Klein, and F. Hutter, “Bohb: Robust and efficient hyperparameter optimization at scale,” in International conference on machine learning. PMLR, 2018, pp. 1437–1446.
[16] Y. Zhang, B. Kang, B. Hooi, S. Yan, and J. Feng, “Deep long-tailed learning: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[17] H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning,” in 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). Ieee, 2008, pp. 1322–1328.
[18] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.
[19] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie, “Class-balanced loss based on effective number of samples,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9268–9277.
[20] R. Duggal, S. Freitas, S. Dhamnani, D. H. Chau, and J. Sun, “Elf: An early-exiting framework for long-tailed classification,” arXiv preprint arXiv:2006.11979, 2020.
[21] A. K. Menon, S. Jayasumana, A. S. Rawat, H. Jain, A. Veit, and S. Kumar, “Long-tail learning via logit adjustment,” arXiv preprint arXiv:2007.07314, 2020.
[22] J. Tian, Y.-C. Liu, N. Glaser, Y.-C. Hsu, and Z. Kira, “Posterior re-calibration for imbalanced datasets,” Advances in neural information processing systems, vol. 33, pp. 8101–8113, 2020.
[23] S. Zhang, Z. Li, S. Yan, X. He, and J. Sun, “Distribution alignment: A unified framework for long-tail visual recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 2361–2370.
[24] Y.-X. Wang, D. Ramanan, and M. Hebert, “Learning to model the tail,” Advances in neural information processing systems, vol. 30, 2017.
[25] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker, “Feature transfer learning for face recognition with under-represented data,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5704–5713.
[26] P. Chu, X. Bian, S. Liu, and H. Ling, “Feature space augmentation for long-tailed data,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16. Springer, 2020, pp. 694–710.
[27] Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu, “Large-scale long-tailed recognition in an open world,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2537–2546.
[28] T.-Y. Wu, P. Morgado, P. Wang, C.-H. Ho, and N. Vasconcelos, “Solving long-tailed recognition with deep realistic taxonomic classifier,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16. Springer, 2020, pp. 171–189.
[29] Z. Lu, G. Sreekumar, E. Goodman, W. Banzhaf, K. Deb, and V. N. Boddeti, “Neural architecture transfer,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 9, pp. 2971–2989, 2021.
[30] R. Panda, M. Merler, M. S. Jaiswal, H. Wu, K. Ramakrishnan, U. Finkler, C.-F. R. Chen, M. Cho, R. Feris, D. Kung et al., “Nastransfer: Analyzing architecture transferability in large scale neural architecture search,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 10, 2021, pp. 9294–9302.
[31] K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma, “Learning imbalanced datasets with label-distribution-aware margin loss,” Advances in neural information processing systems, vol. 32, 2019.
[32] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.