License: arXiv.org perpetual non-exclusive license
arXiv:2312.09598v2 [cs.CV] 24 Dec 2023

CLAF: Contrastive Learning with Augmented Features
for Imbalanced Semi-Supervised Learning

Abstract

Due to the advantages of leveraging unlabeled data and learning meaningful representations, semi-supervised learning and contrastive learning have been progressively combined to achieve better performances in popular applications with few labeled data and abundant unlabeled data. One common manner is assigning pseudo-labels to unlabeled samples and selecting positive and negative samples from pseudo-labeled samples to apply contrastive learning. However, the real-world data may be imbalanced, causing pseudo-labels to be biased toward the majority classes and further undermining the effectiveness of contrastive learning. To address the challenge, we propose Contrastive Learning with Augmented Features (CLAF). We design a class-dependent feature augmentation module to alleviate the scarcity of minority class samples in contrastive learning. For each pseudo-labeled sample, we select positive and negative samples from labeled data instead of unlabeled data to compute contrastive loss. Comprehensive experiments on imbalanced image classification datasets demonstrate the effectiveness of CLAF in the context of imbalanced semi-supervised learning.

Index Terms—  imbalance, semi-supervised learning, contrastive learning, feature augmentation

1 Introduction

Semi-supervised learning (SSL) has attracted much attention in recent years, owing to its potential to mitigate the demand for labeled data by leveraging unlabeled data. The primary challenge in SSL lies in learning valuable information from a large amount of unlabeled data. Representation learning empowers the capture of rich insights from labeled data, thereby reducing the difficulty of utilizing unlabeled data. Contrastive learning is an effective way to learn strong visual representations in an unsupervised manner and has been extended to supervised learning [1], making it a promising approach for integration into SSL. A general pipeline of incorporating contrastive learning into SSL involves producing pseudo-labels for unlabeled data and utilizing them in a manner of pseudo-label-based contrastive learning (PCL). For a pseudo-labeled sample, PCL selects unlabeled samples sharing the same pseudo-label as positive samples and regards unlabeled samples with different pseudo-labels as negative samples. The central idea of PCL is to bring positive samples closer while pushing negative samples further apart. Through the integration of PCL, most of the existing SSL algorithms have achieved exceptional performance [2, 3, 4].

Refer to caption
Fig. 1: Illustration of PCL (left) and CLAF (right) in imbalanced SSL. Compared with PCL, CLAF adopts class-dependent feature augmentation and selects positive and negative samples from both labeled samples and augmented samples to reduce the pushing and pulling effects of samples with biased pseudo-labels on other pseudo-labeled samples.

Although contrastive learning has demonstrated its efficacy in learning strong representations under SSL, these algorithms often assume class-balanced data, while many real-world data exhibit imbalanced distributions. Contrastive learning faces the risk of biased pseudo-labels and scarcity of minority class samples under imbalanced SSL. With class-imbalanced data, the class distribution of pseudo-labels from unlabeled data tends to exhibit towards the majority classes due to the confirmation bias [5]. Many pseudo-labels of majority classes are assigned to unlabeled samples that may not genuinely belong to those classes. Methods incorporating PCL tend to cluster instances with the same pseudo-labels from a specific majority class, potentially contradicting the actual relationships among unlabeled data. Additionally, the scarcity of minority class samples results in relatively poor representations of minority classes. These problems significantly constrain the representation learning capacity of contrastive learning in imbalanced SSL. In essence, the imbalanced data distribution leads to inaccurate pseudo-labels, subsequently undermining the precision of positive and negative samples.

In this paper, we propose a method called Contrastive Learning with Augmented Features (CLAF) devised to tackle the aforementioned challenges. First, we design a class-dependent feature augmentation module to alleviate the scarcity of labeled data in minority classes. Second, in contrast to conventional PCL that exclusively selects sample pairs from unlabeled data, CLAF selects both positive and negative samples from labeled data for each pseudo-labeled sample to reduce the influence of biased pseudo-labels as shown in Fig. 1.

2 Related Works

Semi-supervised learning (SSL): SSL learns from labeled data in conjunction with a large number of unlabeled data. Pseudo-labeling is a widely used SSL method, which uses the model’s predictions to label data and retrains the model with the artificial labels [5]. FixMatch [6] integrates consistency regularization and pseudo-labeling to align the predictions between weakly and strongly augmented unlabeled images.

Contrastive learning under SSL: Previous contrastive-based SSL works are almost two-stage ones. SelfMatch [7] adopts contrastive learning to pre-train a backbone and then fine-tune it based on augmentation consistency regularization. Existing SSL methods that build upon FixMatch mostly utilize pseudo-labels for contrastive learning [3]. To make use of the features learned by different loss functions and class-specific priors, SsCL [4] adopts the pseudo-labeling strategy with cross-entropy loss and instance discrimination with contrastive loss, jointly optimizing the two losses with a shared backbone in an end-to-end way. To address the confirmation bias due to the noise contained in pseudo-labels, CCSSL [2] introduces a class-aware contrastive module and focuses learning on unlabeled samples with pseudo-labels.

3 Preliminary

3.1 Problem Setup

For a K𝐾Kitalic_K-class semi-supervised image classification task, we are given labeled data 𝒳={(xn,yn)}n=1N𝒳subscriptsuperscriptsubscript𝑥𝑛subscript𝑦𝑛𝑁𝑛1\mathcal{X}=\{(x_{n},y_{n})\}^{N}_{n=1}caligraphic_X = { ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT and unlabeled data 𝒰={um}m=1M𝒰subscriptsuperscriptsubscript𝑢𝑚𝑀𝑚1\mathcal{U}=\{u_{m}\}^{M}_{m=1}caligraphic_U = { italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT to train a model f𝑓fitalic_f comprising a feature encoder fθencsubscriptsuperscript𝑓enc𝜃f^{\text{enc}}_{\theta}italic_f start_POSTSUPERSCRIPT enc end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT followed by a linear classifier fϕclssuperscriptsubscript𝑓italic-ϕclsf_{\phi}^{\text{cls}}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT cls end_POSTSUPERSCRIPT, where θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ correspond to the parameters of fθencsubscriptsuperscript𝑓enc𝜃f^{\text{enc}}_{\theta}italic_f start_POSTSUPERSCRIPT enc end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and fϕclssuperscriptsubscript𝑓italic-ϕclsf_{\phi}^{\text{cls}}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT cls end_POSTSUPERSCRIPT respectively. For labeled data, the prediction f(x)𝑓𝑥f(x)italic_f ( italic_x ) of a image x𝑥xitalic_x is learned from clssubscriptcls\mathcal{L}_{\text{cls}}caligraphic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT (e.g., cross-entropy) and its label y𝑦yitalic_y. For unlabeled data, a pseudo-label p^K^𝑝superscript𝐾\hat{p}\in\mathbb{R}^{K}over^ start_ARG italic_p end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT is utilized in unsupervised loss u=Φu(p^,f(u))subscript𝑢subscriptΦ𝑢^𝑝𝑓𝑢\mathcal{L}_{u}=\Phi_{u}\left(\hat{p},f(u)\right)caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG , italic_f ( italic_u ) ), where ΦusubscriptΦ𝑢\Phi_{u}roman_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT can be implemented via entropy [8] or consistency regularization [9], depending on the SSL methods adopted.

Take FixMatch [6] as an example, the pseudo-label p^=OneHot(argmaxkpk(w))^𝑝OneHotsubscriptargmax𝑘subscriptsuperscript𝑝𝑤𝑘\hat{p}=\textnormal{OneHot}\left(\mathop{\mathrm{argmax}}_{k}p^{{(w)}}_{k}\right)over^ start_ARG italic_p end_ARG = OneHot ( roman_argmax start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) with p(w)=f(𝒜w(u))superscript𝑝𝑤𝑓subscript𝒜𝑤𝑢p^{{(w)}}=f\left(\mathcal{A}_{w}(u)\right)italic_p start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT = italic_f ( caligraphic_A start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_u ) ) provides the target for the prediction p(s)=f(𝒜s(u))superscript𝑝𝑠𝑓subscript𝒜𝑠𝑢p^{{(s)}}=f\left(\mathcal{A}_{s}(u)\right)italic_p start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT = italic_f ( caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_u ) ) with some confident ones to the cross-entropy loss \mathcal{H}caligraphic_H as follow:

Φu(p^,p(s))=𝟙(maxkpk(w)τ)(p^,p(s)),subscriptΦ𝑢^𝑝superscript𝑝𝑠1subscript𝑘subscriptsuperscript𝑝𝑤𝑘𝜏^𝑝superscript𝑝𝑠\Phi_{u}(\hat{p},\,p^{{(s)}})=\mathbbm{1}\left(\max_{k}p^{{(w)}}_{k}\geq\tau% \right)\,\mathcal{H}\left(\hat{p},\,p^{{(s)}}\right),roman_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG , italic_p start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ) = blackboard_1 ( roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ italic_τ ) caligraphic_H ( over^ start_ARG italic_p end_ARG , italic_p start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ) , (1)

where 𝒜wsubscript𝒜𝑤\mathcal{A}_{w}caligraphic_A start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and 𝒜ssubscript𝒜𝑠\mathcal{A}_{s}caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT correspond to weak augmentation and strong augmentation [10] respectively.

3.2 DASO

DASO [11] is a comprehensive framework for imbalanced SSL incorporating distribution-aware blending for both linear and semantic pseudo-labels. The linear and semantic pseudo-label, p^^𝑝\hat{p}over^ start_ARG italic_p end_ARG and q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG are generated by passing z(w)=fθenc(𝒜w(u))superscript𝑧𝑤subscriptsuperscript𝑓enc𝜃subscript𝒜𝑤𝑢z^{(w)}=f^{\text{enc}}_{\theta}(\mathcal{A}_{w}(u))italic_z start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT = italic_f start_POSTSUPERSCRIPT enc end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_A start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_u ) ) through linear and similarity-based classifier respectively. Subsequently, the final pseudo-label p^superscript^𝑝\hat{p}^{\prime}over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is derived through the fusion of p^^𝑝\hat{p}over^ start_ARG italic_p end_ARG and q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG and serves as the target in u=Φu(p^,p)subscript𝑢subscriptΦ𝑢superscript^𝑝𝑝\mathcal{L}_{u}=\Phi_{u}(\hat{p}^{\prime},\,p)caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p ).

The linear pseudo-label p^^𝑝\hat{p}over^ start_ARG italic_p end_ARG is obtained by applying the softmax function to the output of the linear classifier: p^=σ(fϕcls(z(w)))^𝑝𝜎subscriptsuperscript𝑓clsitalic-ϕsuperscript𝑧𝑤\hat{p}=\sigma(f^{\text{cls}}_{\phi}(z^{(w)}))over^ start_ARG italic_p end_ARG = italic_σ ( italic_f start_POSTSUPERSCRIPT cls end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT ) ). The semantic pseudo-label q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG is derived from a similarity-based classifier. Specifically, DASO constructs a set of class prototypes 𝐂={ck}k=1K𝐂subscriptsuperscriptsubscript𝑐𝑘𝐾𝑘1\mathbf{C}=\{c_{k}\}^{K}_{k=1}bold_C = { italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT from 𝒳𝒳\mathcal{X}caligraphic_X and a queue 𝐐={Qk}k=1K𝐐subscriptsuperscriptsubscript𝑄𝑘𝐾𝑘1\mathbf{Q}=\{Q_{k}\}^{K}_{k=1}bold_Q = { italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT where Qksubscript𝑄𝑘Q_{k}italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes a feature queue for class k𝑘kitalic_k with a fixed size |Qk|subscript𝑄𝑘|Q_{k}|| italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT |. The class prototype cksubscript𝑐𝑘c_{k}italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for each class k𝑘kitalic_k can be obtained simply by averaging the feature points in the feature queue Qksubscript𝑄𝑘Q_{k}italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. DASO measures the per-class similarity between a feature point and class prototypes:

q=σ(sim(z(w),𝐂)/Tproto),𝑞𝜎simsuperscript𝑧𝑤𝐂subscript𝑇𝑝𝑟𝑜𝑡𝑜q=\sigma(\text{sim}(z^{(w)},\mathbf{C})/T_{proto}),italic_q = italic_σ ( sim ( italic_z start_POSTSUPERSCRIPT ( italic_w ) end_POSTSUPERSCRIPT , bold_C ) / italic_T start_POSTSUBSCRIPT italic_p italic_r italic_o italic_t italic_o end_POSTSUBSCRIPT ) , (2)

where sim(,)sim\text{sim}(\cdot,\cdot)sim ( ⋅ , ⋅ ) represents cosine similarity and Tprotosubscript𝑇𝑝𝑟𝑜𝑡𝑜T_{proto}italic_T start_POSTSUBSCRIPT italic_p italic_r italic_o italic_t italic_o end_POSTSUBSCRIPT is a temperature hyper-parameter. To prevent an imbalanced prototype representation arising from class-imbalanced labeled data, DASO fixes the size of Qksubscript𝑄𝑘Q_{k}italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for all classes to the same amount, which can compensate for the prototypes of the minority classes with earlier samples remaining in the queue. To stabilize the movement of class prototypes in feature space during training, DASO employs a momentum encoder fθencsubscriptsuperscript𝑓encsuperscript𝜃f^{\text{enc}}_{\theta^{\prime}}italic_f start_POSTSUPERSCRIPT enc end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT with the same architecture as fθencsubscriptsuperscript𝑓enc𝜃f^{\text{enc}}_{\theta}italic_f start_POSTSUPERSCRIPT enc end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, where θsuperscript𝜃\theta^{\prime}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the exponential moving average (EMA) of θ𝜃\thetaitalic_θ with momentum ratio ρ𝜌\rhoitalic_ρ: θρθ+(1ρ)θsuperscript𝜃𝜌superscript𝜃1𝜌𝜃\theta^{\prime}\leftarrow\rho\theta^{\prime}+(1-\rho)\thetaitalic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_ρ italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + ( 1 - italic_ρ ) italic_θ.

4 Method

4.1 Class-dependent Feature Augmentation

DASO introduces a balanced queue to ensure equilibrium between minority and majority class samples. Notably, a significant portion of minority class features in the queue is generated from the same labeled data. To enhance data diversity and alleviate the scarcity of labeled data in minority classes, we employ feature augmentation (FA) within a batch to increase the count of labeled features for minority classes by blending unlabeled data features with labeled data features while preserving the label of the original labeled sample, which is inspired by [12, 13, 14]. The augmented feature is generated as:

zaug=λz(l)+(1λ)z(w),subscript𝑧𝑎𝑢𝑔𝜆superscript𝑧𝑙1𝜆superscript𝑧𝑤z_{aug}=\lambda z^{\prime(l)}+(1-\lambda)z^{\prime(w)},italic_z start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT = italic_λ italic_z start_POSTSUPERSCRIPT ′ ( italic_l ) end_POSTSUPERSCRIPT + ( 1 - italic_λ ) italic_z start_POSTSUPERSCRIPT ′ ( italic_w ) end_POSTSUPERSCRIPT , (3)

where z(l)=fθenc(x)superscript𝑧𝑙subscriptsuperscript𝑓encsuperscript𝜃𝑥z^{\prime(l)}=f^{\text{enc}}_{\theta^{\prime}}(x)italic_z start_POSTSUPERSCRIPT ′ ( italic_l ) end_POSTSUPERSCRIPT = italic_f start_POSTSUPERSCRIPT enc end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) and z(w)=fθenc(𝒜w(u))superscript𝑧𝑤subscriptsuperscript𝑓encsuperscript𝜃subscript𝒜𝑤𝑢z^{\prime(w)}=f^{\text{enc}}_{\theta^{\prime}}(\mathcal{A}_{w}(u))italic_z start_POSTSUPERSCRIPT ′ ( italic_w ) end_POSTSUPERSCRIPT = italic_f start_POSTSUPERSCRIPT enc end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( caligraphic_A start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_u ) ). λ𝜆\lambdaitalic_λ is the mixture coefficient sampled from a Beta distribution denoted as Beta(α,α)Beta𝛼𝛼\text{Beta}(\alpha,\alpha)Beta ( italic_α , italic_α ). To ensure the validity of the label for the augmented feature, we consider λ𝜆\lambdaitalic_λ with a value at least μ𝜇\muitalic_μ: λ=max(λ,1λ,μ)𝜆𝜆1𝜆𝜇\lambda=\max(\lambda,1-\lambda,\mu)italic_λ = roman_max ( italic_λ , 1 - italic_λ , italic_μ ). The FA is applied with a probability that depends on the count of labeled data for each class. Consequently, the more labeled data a class has, the less augmented feature is synthesized. Formally, given a labeled sample from class k𝑘kitalic_k, we apply FA with probability Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT defined as:

Pk=N1NkN1,subscript𝑃𝑘subscript𝑁1subscript𝑁𝑘subscript𝑁1P_{k}=\frac{N_{1}-N_{k}}{N_{1}},italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , (4)

where Nksubscript𝑁𝑘N_{k}italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the number of samples of class k𝑘kitalic_k and N1subscript𝑁1N_{1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the number of samples of the class with the most labeled data. The class-dependent probability encourages more augmented features for minority classes.

We perform concurrent updates of Qksubscript𝑄𝑘Q_{k}italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for all classes by pushing new labeled features and augmented features within the batch and removing the oldest ones when Qksubscript𝑄𝑘Q_{k}italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is full.

4.2 Contrastive Learning with Augmented Features

To reduce the impact of biased pseudo-labels and utilize unlabeled data, we apply contrastive learning using both unlabeled and labeled data. For an unlabeled sample with a pseudo-label, we bring it close to labeled samples sharing the same label as the pseudo-label and push it away from labeled samples with different labels from the pseudo-label.

Following the common approaches in contrastive learning [15], we adopt the encoder-projection head structure in our method. Both raw feature and augmented feature are passed through the projection head to obtain corresponding embedding e𝑒eitalic_e. We construct an extra embedding queue 𝐄𝐄\mathbf{E}bold_E to store embeddings for features with labels, which is updated simultaneously with the feature queue 𝐐𝐐\mathbf{Q}bold_Q. For unlabeled samples, we establish a confidence vector 𝐬𝐬\mathbf{s}bold_s based on the confidence scores of the model’s predictions. Each element sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in 𝐬𝐬\mathbf{s}bold_s is defined as:

si={max(p^i),if max(p^i)>τ,0,otherwise.subscript𝑠𝑖casessubscriptsuperscript^𝑝𝑖if subscriptsuperscript^𝑝𝑖𝜏0otherwises_{i}=\begin{cases}\max(\hat{p}^{\prime}_{i}),&\text{if }\max(\hat{p}^{\prime}% _{i})>\tau,\\ 0,&\text{otherwise}.\end{cases}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL roman_max ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , end_CELL start_CELL if roman_max ( over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_τ , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise . end_CELL end_ROW (5)

where i𝑖iitalic_i is the index of the unlabeled sample. Given the presence of embeddings from augmented features, we construct a label confidence vector 𝐯𝐯\mathbf{v}bold_v based on the mixture coefficient:

vi={λi,if ei corresponds to an augmented feature,1,otherwise.subscript𝑣𝑖casessubscript𝜆𝑖if subscript𝑒𝑖 corresponds to an augmented feature1otherwisev_{i}=\begin{cases}\lambda_{i},&\text{if }e_{i}\text{ corresponds to an % augmented feature},\\ 1,&\text{otherwise}.\end{cases}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , end_CELL start_CELL if italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponds to an augmented feature , end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL otherwise . end_CELL end_ROW (6)

where i𝑖iitalic_i is the index of embedding in the embedding queue. To measure the weights for positive pairs in contrastive loss function, we obtain a weight matrix W𝑊Witalic_W by multiplying elements of 𝐬𝐬\mathbf{s}bold_s and 𝐯𝐯\mathbf{v}bold_v. Each element wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT in W𝑊Witalic_W is defined as wij=sivjsubscript𝑤𝑖𝑗subscript𝑠𝑖subscript𝑣𝑗w_{ij}=s_{i}\cdot v_{j}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where i𝑖iitalic_i and j𝑗jitalic_j represent the indices of unlabeled samples in a batch and embeddings in the embedding queue of the pseudo-label class. The contrastive loss csubscript𝑐\mathcal{L}_{c}caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT can be defined as:

c=1Bi=1Bc,i,subscript𝑐1𝐵superscriptsubscript𝑖1𝐵subscript𝑐𝑖\mathcal{L}_{c}=\frac{1}{B}\sum_{i=1}^{B}\mathcal{L}_{c,i},caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT , (7)

where B𝐵Bitalic_B is the batch size of unlabeled samples. c,isubscript𝑐𝑖\mathcal{L}_{c,i}caligraphic_L start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT has the following format:

c,i=1|Epi|p=1|Epi|wiplogexp(sim(ei(s),ep(l))/t)k=1Kj=1|Ek|exp(sim(ei(s),ej(l))/t),subscript𝑐𝑖1subscript𝐸subscript𝑝𝑖superscriptsubscript𝑝1subscript𝐸subscript𝑝𝑖subscript𝑤𝑖𝑝simsubscriptsuperscript𝑒𝑠𝑖subscriptsuperscript𝑒𝑙𝑝𝑡superscriptsubscript𝑘1𝐾superscriptsubscript𝑗1subscript𝐸𝑘simsubscriptsuperscript𝑒𝑠𝑖subscriptsuperscript𝑒𝑙𝑗𝑡\mathcal{L}_{c,i}=-\frac{1}{|E_{p_{i}}|}\sum_{p=1}^{|E_{p_{i}}|}w_{ip}\cdot% \log\frac{\exp(\text{sim}(e^{(s)}_{i},e^{(l)}_{p})/t)}{\sum_{k=1}^{K}\sum_{j=1% }^{|E_{k}|}\exp(\text{sim}(e^{(s)}_{i},e^{(l)}_{j})/t)},caligraphic_L start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG | italic_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT ⋅ roman_log divide start_ARG roman_exp ( sim ( italic_e start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) / italic_t ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT roman_exp ( sim ( italic_e start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_t ) end_ARG , (8)

where Episubscript𝐸subscript𝑝𝑖E_{p_{i}}italic_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the embedding queue of the pseudo-label class and |Epi|subscript𝐸subscript𝑝𝑖|E_{p_{i}}|| italic_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | represents the capacity of Episubscript𝐸subscript𝑝𝑖E_{p_{i}}italic_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. ei(s)subscriptsuperscript𝑒𝑠𝑖e^{(s)}_{i}italic_e start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ep(l)subscriptsuperscript𝑒𝑙𝑝e^{(l)}_{p}italic_e start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT are embeddings from zi(s)=fθenc(𝒜s(ui))superscriptsubscript𝑧𝑖𝑠subscriptsuperscript𝑓enc𝜃subscript𝒜𝑠subscript𝑢𝑖z_{i}^{(s)}=f^{\text{enc}}_{\theta}(\mathcal{A}_{s}(u_{i}))italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT = italic_f start_POSTSUPERSCRIPT enc end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) and Episubscript𝐸subscript𝑝𝑖E_{p_{i}}italic_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT respectively. t𝑡titalic_t is the temperature hyper-parameter. We calculate total loss using a weighted sum of supervised loss clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT, semi-supervised loss usubscript𝑢\mathcal{L}_{u}caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, semantic alignment loss alignsubscript𝑎𝑙𝑖𝑔𝑛\mathcal{L}_{align}caligraphic_L start_POSTSUBSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUBSCRIPT and contrastive loss csubscript𝑐\mathcal{L}_{c}caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. The final CLAF objective is as below:

CLAF=cls+λuu+λalignalign+λcc,subscript𝐶𝐿𝐴𝐹subscript𝑐𝑙𝑠subscript𝜆𝑢subscript𝑢subscript𝜆𝑎𝑙𝑖𝑔𝑛subscript𝑎𝑙𝑖𝑔𝑛subscript𝜆𝑐subscript𝑐\mathcal{L}_{CLAF}=\mathcal{L}_{cls}+\lambda_{u}\mathcal{L}_{u}+\lambda_{align% }\mathcal{L}_{align}+\lambda_{c}\mathcal{L}_{c},caligraphic_L start_POSTSUBSCRIPT italic_C italic_L italic_A italic_F end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , (9)

where both clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT and usubscript𝑢\mathcal{L}_{u}caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT with λusubscript𝜆𝑢\lambda_{u}italic_λ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT come from the base SSL learner, and alignsubscript𝑎𝑙𝑖𝑔𝑛\mathcal{L}_{align}caligraphic_L start_POSTSUBSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUBSCRIPT is introduced from DASO. λcsubscript𝜆𝑐\lambda_{c}italic_λ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the weight for contrastive loss.

5 Experiments

CIFAR10-LT CIFAR100-LT
γ=100𝛾100\gamma=100italic_γ = 100 γ=150𝛾150\gamma=150italic_γ = 150 γ=10𝛾10\gamma=10italic_γ = 10 γ=20𝛾20\gamma=20italic_γ = 20
Algorithms N1=500subscript𝑁1500N_{1}=500italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 N1=1500subscript𝑁11500N_{1}=1500italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1500 N1=500subscript𝑁1500N_{1}=500italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 N1=1500subscript𝑁11500N_{1}=1500italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1500 N1=50subscript𝑁150N_{1}=50italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 50 N1=150subscript𝑁1150N_{1}=150italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 150 N1=50subscript𝑁150N_{1}=50italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 50 N1=150subscript𝑁1150N_{1}=150italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 150
M1=4000subscript𝑀14000M_{1}=4000italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 4000 M1=3000subscript𝑀13000M_{1}=3000italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 3000 M1=4000subscript𝑀14000M_{1}=4000italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 4000 M1=3000subscript𝑀13000M_{1}=3000italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 3000 M1=400subscript𝑀1400M_{1}=400italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 400 M1=300subscript𝑀1300M_{1}=300italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 300 M1=400subscript𝑀1400M_{1}=400italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 400 M1=300subscript𝑀1300M_{1}=300italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 300
Supervised*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 47.3±plus-or-minus\,\pm±0.95 61.9±plus-or-minus\,\pm±0.41 44.2±plus-or-minus\,\pm±0.33 58.2±plus-or-minus\,\pm±0.29 29.6±plus-or-minus\,\pm±0.57 46.9±plus-or-minus\,\pm±0.22 25.1±plus-or-minus\,\pm±1.14 41.2±plus-or-minus\,\pm±0.15
  w/ LA*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT [16] 53.3±plus-or-minus\,\pm±0.44 70.6±plus-or-minus\,\pm±0.21 49.5±plus-or-minus\,\pm±0.40 67.1±plus-or-minus\,\pm±0.78 30.2±plus-or-minus\,\pm±0.44 48.7±plus-or-minus\,\pm±0.89 26.5±plus-or-minus\,\pm±1.31 44.1±plus-or-minus\,\pm±0.42
FixMatch*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT [6] 67.8±plus-or-minus\,\pm±1.13 77.5±plus-or-minus\,\pm±1.32 62.9±plus-or-minus\,\pm±0.36 72.4±plus-or-minus\,\pm±1.03 45.2±plus-or-minus\,\pm±0.55 56.5±plus-or-minus\,\pm±0.06 40.0±plus-or-minus\,\pm±0.96 50.7±plus-or-minus\,\pm±0.25
  w/ DARP*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT [17] 74.5±plus-or-minus\,\pm±0.78 77.8±plus-or-minus\,\pm±0.63 67.2±plus-or-minus\,\pm±0.32 73.6±plus-or-minus\,\pm±0.73 49.4±plus-or-minus\,\pm±0.20 58.1±plus-or-minus\,\pm±0.44 43.4±plus-or-minus\,\pm±0.87 52.2±plus-or-minus\,\pm±0.66
  w/ CReST+*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT [18] 76.3±plus-or-minus\,\pm±0.86 78.1±plus-or-minus\,\pm±0.42 67.5±plus-or-minus\,\pm±0.45 73.7±plus-or-minus\,\pm±0.34 44.5±plus-or-minus\,\pm±0.94 57.4±plus-or-minus\,\pm±0.18 40.1±plus-or-minus\,\pm±1.28 52.1±plus-or-minus\,\pm±0.21
  w/ DASO*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT [11] 76.0±plus-or-minus\,\pm±0.37 79.1±plus-or-minus\,\pm±0.75 70.1±plus-or-minus\,\pm±1.81 75.1±plus-or-minus\,\pm±0.77 49.8±plus-or-minus\,\pm±0.24 59.2±plus-or-minus\,\pm±0.35 43.6±plus-or-minus\,\pm±0.09 52.9±plus-or-minus\,\pm±0.42
  w/ CLAF (Ours) 76.4±plus-or-minus\,\pm±0.46 80.6±plus-or-minus\,\pm±0.65 72.0±plus-or-minus\,\pm±0.74 75.9±plus-or-minus\,\pm±0.29 50.9±plus-or-minus\,\pm±0.11 59.8±plus-or-minus\,\pm±0.29 44.5±plus-or-minus\,\pm±0.83 54.1±plus-or-minus\,\pm±0.28
FixMatch+++LA*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT [16] 75.3±plus-or-minus\,\pm±2.45 82.0±plus-or-minus\,\pm±0.36 67.0±plus-or-minus\,\pm±2.49 78.0±plus-or-minus\,\pm±0.91 47.3±plus-or-minus\,\pm±0.42 58.6±plus-or-minus\,\pm±0.36 41.4±plus-or-minus\,\pm±0.93 53.4±plus-or-minus\,\pm±0.32
  w/ DASO*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT [11] 77.9±plus-or-minus\,\pm±0.88 82.5±plus-or-minus\,\pm±0.08 70.1±plus-or-minus\,\pm±1.68 79.0±plus-or-minus\,\pm±2.23 50.7±plus-or-minus\,\pm±0.51 60.6±plus-or-minus\,\pm±0.71 44.1±plus-or-minus\,\pm±0.61 55.1±plus-or-minus\,\pm±0.72
  w/ CLAF (Ours) 78.8±plus-or-minus\,\pm±0.59 83.1±plus-or-minus\,\pm±0.32 72.8±plus-or-minus\,\pm±1.39 79.3±plus-or-minus\,\pm±0.33 51.1±plus-or-minus\,\pm±0.25 60.9±plus-or-minus\,\pm±0.22 46.1±plus-or-minus\,\pm±0.19 55.6±plus-or-minus\,\pm±0.51
Table 1: Comparison of accuracy(%) with different methods on CIFAR10-LT and CIFAR100-LT under various γ𝛾\gammaitalic_γ. *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT: reported by [11]. We mark the best results as bold. Our method CLAF consistently outperforms all the baselines under diverse settings.

5.1 Experimental Setup

5.1.1 Datasets

Following common practice [11], we create CIFAR10-LT and CIFAR100-LT for imbalanced SSL by exponentially decreasing the count of images from the head class to the tail class. We denote the head class size as N1(M1)subscript𝑁1subscript𝑀1N_{1}(M_{1})italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and the imbalance ratio as γ𝛾\gammaitalic_γ. We set Nk=N1γk1K1subscript𝑁𝑘subscript𝑁1superscript𝛾𝑘1𝐾1N_{k}=N_{1}\cdot\gamma^{-\frac{k-1}{K-1}}italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_γ start_POSTSUPERSCRIPT - divide start_ARG italic_k - 1 end_ARG start_ARG italic_K - 1 end_ARG end_POSTSUPERSCRIPT for labeled data and Mk=M1γk1K1subscript𝑀𝑘subscript𝑀1superscript𝛾𝑘1𝐾1M_{k}=M_{1}\cdot\gamma^{-\frac{k-1}{K-1}}italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_γ start_POSTSUPERSCRIPT - divide start_ARG italic_k - 1 end_ARG start_ARG italic_K - 1 end_ARG end_POSTSUPERSCRIPT for unlabeled data. For common settings [11], we set N1=500subscript𝑁1500N_{1}=500italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500, M1=4000subscript𝑀14000M_{1}=4000italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 4000 and N1=1500subscript𝑁11500N_{1}=1500italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1500, M1=3000subscript𝑀13000M_{1}=3000italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 3000 for CIFAR10-LT, and N1=50subscript𝑁150N_{1}=50italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 50, M1=400subscript𝑀1400M_{1}=400italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 400 and N1=150subscript𝑁1150N_{1}=150italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 150, M1=300subscript𝑀1300M_{1}=300italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 300 for CIFAR100-LT. We report results of imbalance ratio γ=100𝛾100\gamma=100italic_γ = 100 and 150150150150 for CIFAR10-LT and γ=10𝛾10\gamma=10italic_γ = 10 and 20202020 for CIFAR100-LT.

5.1.2 Training and evaluation

We conduct experiments under the same codebase with DASO [11] for fair comparison. We adopt Wide ResNet-28-2 [19] as our backbone on CIFAR10-LT and CIFAR100-LT. We apply FA in the last 20% iterations and set μ𝜇\muitalic_μ to 0.8 to meet the requirements of FA for structured representation space. λcsubscript𝜆𝑐\lambda_{c}italic_λ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and t𝑡titalic_t are set to 1.0 and 0.07 for all experiments. All hyper-parameters and training details follow DASO [11]. For evaluation, we use the EMA network with parameters updating every training step [11]. We measure the top-1 accuracy on test images every 500 iterations and report the median of the accuracy of the last 20 evaluations. We report the mean and standard deviation of three independent runs.

5.2 Results on CIFAR10/CIFAR100-LT

We report the results of CLAF on CIFAR10-LT and CIFAR100-LT under various settings in Table. 1. We compare CLAF with DARP [17], CReST+ [18] and DASO [11] on FixMatch. The results indicate CLAF achieves superior accuracy compared with baselines on different benchmarks. The results of different methods on re-balancing FixMatch via LA [16] show CLAF can benefit from debiasing pseudo-labels. It is noticeable that CLAF always exhibits performance improvements over DASO in all cases, which verifies the effectiveness of CLAF in representation learning under imbalanced SSL.

5.3 Ablation Study

We perform ablation studies on CIFAR10-LT and investigate the impact of FA. We report the results of CLAF and CLAF without FA in Table. 2. As previously discussed, FA mainly contributes to augmenting features for minority classes and providing minority class features for contrastive learning. The performance gap between CLAF and CLAF without FA indicates that naive contrastive learning brings marginal improvements and FA is beneficial for contrastive learning in imbalanced SSL.

γ=100𝛾100\gamma=100italic_γ = 100 γ=150𝛾150\gamma=150italic_γ = 150
Algorithm N1=500subscript𝑁1500N_{1}=500italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 N1=1500subscript𝑁11500N_{1}=1500italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1500 N1=500subscript𝑁1500N_{1}=500italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 N1=1500subscript𝑁11500N_{1}=1500italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1500
M1=4000subscript𝑀14000M_{1}=4000italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 4000 M1=3000subscript𝑀13000M_{1}=3000italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 3000 M1=4000subscript𝑀14000M_{1}=4000italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 4000 M1=3000subscript𝑀13000M_{1}=3000italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 3000
CLAF 76.4±plus-or-minus\,\pm±0.46 80.6±plus-or-minus\,\pm±0.65 72.0±plus-or-minus\,\pm±0.74 75.9±plus-or-minus\,\pm±0.29
  w/o FA 76.1±plus-or-minus\,\pm±0.25 79.9±plus-or-minus\,\pm±0.24 70.8±plus-or-minus\,\pm±2.15 75.5±plus-or-minus\,\pm±0.41
Table 2: Ablation study of FA in CLAF on CIFAR10-LT.

5.4 Analysis

To assess the representation learning capacity of contrastive learning, we present t-SNE [20] visualization of CIFAR10-LT test data features obtained from DASO and CLAF. As shown in Fig. 2, tail class features in CLAF exhibit distinct decision boundaries while they are close to majority class features in DASO. CLAF achieves the accuracy of 65.3%percent65.365.3\%65.3 % for the 3-least common classes, which is better than 60.8%percent60.860.8\%60.8 % in DASO. The results suggest that CLAF has superior representations for minority classes compared to DASO.

Refer to caption
Fig. 2: Feature space visualization of CIFAR10-LT test data using DASO (left) and CLAF (right). CLAF has clear decision boundaries for tail class in feature space.

6 Conclusion

We propose Contrastive Learning with Augmented Features (CLAF) to apply contrastive learning in imbalanced SSL. We design a class-dependent feature augmentation module to alleviate the scarcity of minority class samples. In contrast to conventional PCL, we select positive and negative samples from labeled data to reduce the impact of biased pseudo-labels. Our experimental results demonstrate that CLAF outperforms the baselines on imbalanced image datasets under various settings, confirming that CLAF exhibits a remarkable capacity for representation learning in imbalanced SSL.

References

  • [1] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan, “Supervised contrastive learning,” in Advances in Neural Information Processing Systems, 2020.
  • [2] Fan Yang, Kai Wu, Shuyi Zhang, Guannan Jiang, Yong Liu, Feng Zheng, Wei Zhang, Chengjie Wang, and Long Zeng, “Class-aware contrastive semi-supervised learning,” in Conference on Computer Vision and Pattern Recognition, 2022, pp. 14401–14410.
  • [3] Junnan Li, Caiming Xiong, and Steven C. H. Hoi, “Comatch: Semi-supervised learning with contrastive graph regularization,” in International Conference on Computer Vision, 2021, pp. 9455–9464.
  • [4] Yuhang Zhang, Xiaopeng Zhang, Jie Li, Robert C. Qiu, Haohang Xu, and Qi Tian, “Semi-supervised contrastive learning with similarity co-calibration,” IEEE Trans. Multim., vol. 25, pp. 1749–1759, 2023.
  • [5] Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness, “Pseudo-labeling and confirmation bias in deep semi-supervised learning,” in International Joint Conference on Neural Networks, 2020, pp. 1–8.
  • [6] Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” in Advances in Neural Information Processing Systems, 2020.
  • [7] Byoungjip Kim, **ho Choo, Yeong-Dae Kwon, Seongho Joe, Seungjai Min, and Youngjune Gwon, “Selfmatch: Combining contrastive self-supervision and consistency for semi-supervised learning,” CoRR, vol. abs/2101.06480, 2021.
  • [8] Yves Grandvalet and Yoshua Bengio, “Semi-supervised learning by entropy minimization,” in Advances in Neural Information Processing Systems, 2004, pp. 529–536.
  • [9] Antti Tarvainen and Harri Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in Advances in Neural Information Processing Systems, 2017, pp. 1195–1204.
  • [10] Ekin Dogus Cubuk, Barret Zoph, Jonathon Shlens, and Quoc Le, “Randaugment: Practical automated data augmentation with a reduced search space,” in Advances in Neural Information Processing Systems, 2020.
  • [11] Youngtaek Oh, Dong-** Kim, and In So Kweon, “DASO: distribution-aware semantics-oriented pseudo-label for imbalanced semi-supervised learning,” in Conference on Computer Vision and Pattern Recognition, 2022, pp. 9776–9786.
  • [12] Yue Fan, Dengxin Dai, Anna Kukleva, and Bernt Schiele, “Cossl: Co-learning of representation and classifier for imbalanced semi-supervised learning,” in Conference on Computer Vision and Pattern Recognition, 2022, pp. 14554–14564.
  • [13] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz, “mixup: Beyond empirical risk minimization,” in International Conference on Learning Representations, 2018.
  • [14] Han-Jia Ye, De-Chuan Zhan, and Wei-Lun Chao, “Procrustean training for imbalanced deep learning,” in International Conference on Computer Vision, 2021, pp. 92–102.
  • [15] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the 37th International Conference on Machine Learning, 2020, vol. 119, pp. 1597–1607.
  • [16] Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, and Sanjiv Kumar, “Long-tail learning via logit adjustment,” in International Conference on Learning Representations, 2021.
  • [17] Jaehyung Kim, Youngbum Hur, Sejun Park, Eunho Yang, Sung Ju Hwang, and **woo Shin, “Distribution aligning refinery of pseudo-label for imbalanced semi-supervised learning,” in Advances in Neural Information Processing Systems, 2020.
  • [18] Chen Wei, Kihyuk Sohn, Clayton Mellina, Alan L. Yuille, and Fan Yang, “Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning,” in Conference on Computer Vision and Pattern Recognition, 2021, pp. 10857–10866.
  • [19] Sergey Zagoruyko and Nikos Komodakis, “Wide residual networks,” in Proceedings of the British Machine Vision Conference, 2016.
  • [20] Laurens Van der Maaten and Geoffrey Hinton, “Visualizing data using t-sne.,” Journal of machine learning research, vol. 9, no. 11, 2008.