SOOD++: Leveraging Unlabeled Data to Boost Oriented Object Detection

Dingkang Liang, Wei Hua, Chunsheng Shi, Zhikang Zou, Xiaoqing Ye, Xiang Bai🖂 Dingkang Liang, Chunsheng Shi and Xiang Bai are with the School of Artificial Intelligence and Automation, Huazhong University of Science and Technology. (dkliang, csshi, xbai)@hust.edu.cn Wei Hua is with the School of Electronic Information and Communications, Huazhong University of Science and Technology. [email protected] Zhikang Zou and Xiaoqing Ye are with Baidu Inc., China. 🖂 Corresponding author.
Abstract

Semi-supervised object detection (SSOD), leveraging unlabeled data to boost object detectors, has become a hot topic recently. However, existing SSOD approaches mainly focus on horizontal objects, leaving multi-oriented objects common in aerial images unexplored. At the same time, the annotation cost of multi-oriented objects is significantly higher than that of their horizontal counterparts. Therefore, in this paper, we propose a simple yet effective Semi-supervised Oriented Object Detection method termed SOOD++. Specifically, we observe that objects from aerial images are usually arbitrary orientations, small scales, and aggregation, which inspires the following core designs: a Simple Instance-aware Dense Sampling (SIDS) strategy is used to generate comprehensive dense pseudo-labels; the Geometry-aware Adaptive Weighting (GAW) loss dynamically modulates the importance of each pair between pseudo-label and corresponding prediction by leveraging the intricate geometric information of aerial objects; we treat aerial images as global layouts and explicitly build the many-to-many relationship between the sets of pseudo-labels and predictions via the proposed Noise-driven Global Consistency (NGC). Extensive experiments conducted on various multi-oriented object datasets under various labeled settings demonstrate the effectiveness of our method. For example, on the DOTA-V1.5 benchmark, the proposed method outperforms previous state-of-the-art (SOTA) by a large margin (+2.92, +2.39, and +2.57 mAP under 10%, 20%, and 30% labeled data settings, respectively) with single-scale training and testing. More importantly, it still improves upon a strong supervised baseline with 70.66 mAP, trained using the full DOTA-V1.5 train-val set, by +1.82 mAP, resulting in a 72.48 mAP, pushing the new state-of-the-art. Moreover, our method demonstrates stable generalization ability across different oriented detectors, even for multi-view oriented 3D object detectors. The code will be made available.

Index Terms:
Oriented object detection, Multi-oriented object detection, Aerial scenes, Semi-supervised learning

1 Introduction

Sufficient labeled data is essential to achieve satisfactory performance for fully-supervised object detection. However, the data labeling process is expensive and laborious. To alleviate this problem, many Semi-Supervised Object Detection (SSOD) methods [39, 64, 76, 28], aiming to learn from labeled data as well as easy-to-obtain unlabeled data, have been proposed recently. By leveraging the potentially useful information from the unlabeled data, the SSOD methods can achieve promising improvement compared with the supervised baselines, i.e., methods only trained from the limited labeled data.

The recent advanced SSOD methods [39, 64, 76, 28] mainly focus on detecting objects with horizontal bounding boxes in general scenes. However, in more complex scenes, notably aerial scenes, we need to utilize oriented bounding boxes to precisely describe the objects. The complexity of annotating objects with oriented bounding boxes significantly surpasses that of their horizontal counterparts, e.g., the annotation process for oriented bounding boxes incurs an approximate 36.5% increase in costs111The cost of annotating an oriented box is approximately 36.5% more expensive than a horizontal box ($86 vs. $63 per 1k) https://cloud.google.com/ai-platform/data-labeling/pricing. Thus, considering the higher annotation cost of oriented boxes, semi-supervised oriented object detection is worth studying.

Refer to caption
Figure 1: (a) The comparison of general scenes and aerial scenes. Different from the general objects, the objects from aerial scenes are arbitrary orientations, small, and dense (aggregation). Correspondingly, the annotation cost of aerial objects is higher than that of general scenes. (b) The proposed SOOD++ outperforms the supervised baseline and SOTA methods [35, 76] by a large margin on the DOTA-V1.5 benchmark (val set) under various settings.

Compared with the general scenes, the distinguishing characteristics of objects from aerial scenes (referred to as aerial objects) are mainly from three aspects: arbitrary orientations, small, and dense, as shown in Fig. 1(a). The mainstream SSOD methods are based on the pseudo-labeling framework [5, 55, 54, 64] consisting of the teacher and student models. The teacher model operates as an Exponential Moving Average (EMA) of the student model, aggregated over historical training iterations. It aims to generate pseudo-labels for images within the unlabeled dataset, thereby facilitating the student model’s learning from a combination of labeled and unlabeled data.

However, through pilot experiments, we find that extending the SSOD framework to oriented object detection is a highly non-trivial problem. We think the following two aspects need to be considered: 1) First, the geometry information (e.g., orientation and aspect ratio) is an essential cue for representing the multi-oriented objects. How to use the geometry information for the interaction between teacher and student in a semi-supervised framework is crucial. In other words, when applying SSOD methods in such a scenario, it is necessary to consider the geometry information of pseudo-labels and predictions. 2) Second, aerial objects are often dense and regularly distributed in an image. These properties lead to a challenge for obtaining accurate pseudo-labels from the teacher, resulting in inaccurate knowledge transfer towards the student and damaging the learning effectiveness within a teacher-student framework. A simple yet effective solution is to consider the global perspective. The global appearance of pseudo-labels and predictions can naturally reflect their patterns and suppress local noise, benefiting a more consistent and steady learning process.

Based on these observations and analysis, in this paper, we propose a practical Semi-supervised Oriented Object Detection method termed SOOD++. SOOD++ is built upon the dense pseudo-labeling framework, where the pseudo-labels are filtered from the pixel-wise predictions (including box coordinates and confidence scores). The key designs are from three aspects: a Simple Instance-aware Dense Sampling (SIDS) strategy, Geometry-aware Adaptive Weighting (GAW) loss, and Noise-driven Global Consistency (NGC). The SIDS is used to construct comprehensive dense pseudo-labels from the teacher, similar to pre-processing. GAW and NGC aim to measure the discrepancy between the teacher and student from a one-to-one and many-to-many (global) perspective, respectively.

Specifically, existing dense pseudo-labeling methods [76, 35] directly sample the fixed number of pseudo-labels from the predicted maps, which easily overlooks the dense, small, and ambiguous objects that usually exist in aerial scenes. To construct more comprehensive pseudo-labels, we propose a Simple Instance-aware Dense Sampling (SIDS) strategy to sample the dynamic number of easy and hard cases (e.g., objects intermingled with the background) from the teacher’s output. Second, considering the pseudo-label-prediction pairs are not equally informative, we propose the Geometry-aware Adaptive Weighting (GAW) loss. It utilizes the intrinsic geometry information (e.g., orientation gap and aspect ratio) of each teacher-student pair, reflecting the difficulty of the sample by weighting the corresponding loss dynamically. In this way, we can softly pick those more useful supervision signals to guide the learning of the student.

Furthermore, we innovatively treat the outputs of the teacher and student as two independent global distributions. These distributions implicitly reflect the characteristics of the overall scenario, such as object density and spatial layout across the image. This naturally inspires us to construct global constraints, leading to the proposal of Noise-driven Global Consistency (NGC). Specifically, we first add random noise to disturb the teacher and student outputs (global distributions). Then, We use optimal transport (OT) to align the original teacher’s distribution with the original student’s distribution and similarly align the disturbed teacher’s distribution with the disturbed student’s distribution. Besides, the transport plans from the former two alignments reflect the intricate relationships within the distributions, so we propose aligning them to further evaluate noise impact and provide auxiliary guidance for the model. Applying such multi-perspective alignments will increase the models’ tolerance to minor input variations, enabling them to better capture essential features of the data rather than relying excessively on noise or specific details in the unlabeled data. By promoting global consistency, our NGC not only relieves the negative effect of some inaccuracy pseudo-labels in the unsupervised training stage but also encourages a holistic learning approach, considering objects’ spatial and relational dynamics.

Extensive experiments conducted on four oriented object detection datasets in various data settings demonstrate the effectiveness of our methods. For example, as shown in Fig. 1(b), on the challenging large-scale DOTA-V1.5 (val set), we achieve 50.48, 57.44, and 61.51 mAP under 10%, 20%, and 30% labeled data settings, significantly surpassing the state-of-the-art SSOD method [35] (using the same supervised baseline) by 2.92, 2.39, and 2.57 mAP, respectively. More importantly, it can still improve a 70.66 mAP strong supervised baseline trained using the full DOTA-V1.5 train-val set by +1.82 mAP, resulting in 72.48 mAP on the DOTA-V1.5 (test set) with single-scale training and testing. Moreover, we also achieve consistent improvement on the DOTA-V1.0, HRSC2016, and even nuScenes dataset (a representative multi-view oriented 3D object detection dataset).

In conclusion, this paper presents an early and solid exploration of semi-supervised learning for oriented object detection. By analyzing the distinct characteristics of oriented objects from general objects, we customize the key components to adapt the pseudo-labeling framework to this task. We hope this work will establish a robust foundation for the burgeoning field of semi-supervised oriented object detection and offer an effective benchmark.

This paper is an extended version of our conference paper, published in CVPR 2023 [23], where we make the following new contributions: 1) We propose a Simple Instance-aware Dense Sampling strategy, which not only considers the easily obtained dense pseudo-labels from the foreground but also mine the hard case merged with the background. This design significantly improves the quality of dense pseudo-labels. 2) We propose Geometry-aware Adaptive Weighting (GAW) loss, which leverages the intrinsic geometry information (e.g., orientation gap and aspect ratios) of aerial objects to dynamic weight the importance of a given teacher-student pair. 3) We propose Noise-driven Global Consistency (NGC) to achieve multi-perspective global alignments, encompassing the alignments between the original teacher and original student, the disturbed teacher and disturbed student, and the transport plans generated from the former alignments. 4) Through the above substantial technical improvements, our SOOD++ significantly outperforms its conference version SOOD [23] by +1.85, +1.86, and +2.28 mAP on the DOTA-V1.5 dataset under the 10%, 20%, and 30% labeled settings, respectively. 5) We have made significant improvements in the quality of the manuscript in various aspects, e.g., conducted experiments on more benchmarks (even 3D scenes), more profound insights into performance improvements, more thorough ablation studies, added more meaningful illustrative figures, and reorganized the contents and tables to better illustrate the paper, etc.

2 Related works

Semi-Supervised Object Detection.

In recent years, semi-supervised learning (SSL)[52, 1] has demonstrated remarkable performance in image classification. These methods exploit unlabeled data through various techniques such as pseudo-labeling[27, 18, 61, 29], consistency regularization [55, 60, 1], data augmentation [51, 6], and adversarial training [44]. Unlike semi-supervised image classification, semi-supervised object detection (SSOD) necessitates instance-level predictions and the additional task of bounding box regression, thereby increasing its complexity. In [48, 77, 53, 25], pseudo-labels are generated by combining predictions from various augmentation.

Subsequent studies [39, 54, 64, 67] have incorporated EMA from Mean Teacher [55] to update the teacher model after each training iteration. ISMT [67] integrates current pseudo-labels with historical ones. Unbiased Teacher [39] addresses the class-imbalance issue by replacing the cross-entropy loss with focal loss [34]. Humble Teacher [54] utilizes soft pseudo-labels as the student’s training targets, enabling the student to distill richer information from the teacher. Soft Teacher [64] adaptively weights the loss of each pseudo-box based on classification scores and introduces box jittering to select reliable pseudo-labels. Unbiased Teacher v2 [40] employs an anchor-free detector and uses uncertainty predictions to select pseudo-labels for the regression branch. Dense Teacher [76] replaces post-processed instance-level pseudo-labels with dense pixel-level pseudo-labels, effectively eliminating the influence of post-processing hyper-parameters. To ensure both label-level and feature-level consistency, PseCo [28] proposes Multi-view Scale-invariant Learning, which aligns shifted feature pyramids between two images with identical content but different scales. Consistent Teacher [57] combines adaptive anchor assignment, a 3D feature alignment module, and a Gaussian Mixture Model to reduce matching and feature inconsistencies during training. MixTeacher [36] explicitly enhances the quality of pseudo-labels to handle the scale variation of objects. The Semi-DETR [75] method is specifically designed for DETR-based detectors, mitigating the training inefficiencies caused by bipartite matching with noisy pseudo-labels. To address assignment ambiguity, ARSL [35] proposes combining Joint-Confidence Estimation and Task-Separation Assignment strategies.

However, all the methods mentioned above are designed for the general scenes. How to unleash the potential of the semi-supervised framework in aerial scenes has yet to be fully explored. This paper aims to fill this blank and offer a solid baseline for future research.

Oriented Object Detection.

Unlike general object detectors [17, 50, 37, 49], representing objects with Horizontal Bounding Boxes (HBBs), oriented object detectors use Oriented Bounding Boxes (OBBs) to capture the orientation of objects, which is practical for detecting aerial objects. In recent years, numerous methods [70, 32, 43, 10, 65, 42, 14, 73, 7, 8] have been developed to enhance the performance of oriented object detection. For instance, CSL [69] addresses the out-of-bound issue by transforming the angle regression problem into a classification task. R3Det [70] improves detection speed by predicting HBBs in the first stage and then aligning features in the second stage to detect oriented objects. Oriented R-CNN [62] introduces a simplified multi-oriented region proposal network and uses midpoint offsets to represent arbitrarily oriented objects. ReDet [21] incorporates rotation-equivariant networks into the detector to extract rotation-equivariant features. Oriented RepPoints [31] incorporates a quality assessment module and a sample assignment scheme for adaptive point learning, which helps obtain non-axis features from neighboring objects while ignoring background noise. LSKNet [32] extends the large and selective kernel mechanisms to improve the performance. The recent method COBB [59] employs nine parameters derived from continuous functions based on the outer HBB and OBB area.

The above methods typically focus on the fully supervised paradigm, requiring expensive labeling costs. Thus, several weakly supervised object detectors have been proposed recently, such as point-based methods [42] or HBB-based methods [74, 71]. However, they are either inferior to their fully supervised counterparts or fail to leverage unlabeled data for performance improvement. In this paper, we explore semi-supervised oriented object detection, which reduces annotation costs and boosts detectors, even those well-trained on large-scale labeled data, by utilizing additional unlabeled data.

3 Preliminary

This section revisits the mainstream pseudo-labeling paradigm in SSOD and then introduces the concept of Monge-Kantorovich optimal transport theory [47].

3.1 Pseudo-labeling Paradigm

Pseudo-labeling frameworks [64, 40, 76, 35] inherited designs from the Mean Teacher [55], which consists of two parts, i.e., a teacher model and a student model. The teacher is an Exponential Moving Average (EMA) of the student. They are learned iteratively by the following steps: 1) Generate pseudo-labels for the unlabeled data in a batch. The pseudo-labels are filtered from the teacher’s predictions, e.g., the box coordinates and the classification scores. Meanwhile, the student makes predictions for labeled and unlabeled data in the batch. 2) Compute loss for the student’s predictions. It consists of two parts: the unsupervised loss usubscript𝑢\mathcal{L}_{u}caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and the supervised loss ssubscript𝑠\mathcal{L}_{s}caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. They are computed for the unlabeled data with the pseudo-labels and the labeled data with the ground truth (GT) labels, respectively. The overall loss \mathcal{L}caligraphic_L is the sum of them. 3) Update the parameters of the student according to the overall loss. The teacher is updated simultaneously in an EMA manner.

Based on the sparsity of pseudo-labels, pseudo-labeling frameworks can be further categorized into sparse pseudo-labeling [64, 40] and dense pseudo-labeling [76, 35], termed SPL and DPL, respectively. The SPL selects the teacher’s predictions after the post-processing operations (e.g., NMS and score filtering), shown in Fig. 3(b). It obtains sparse labels like bounding boxes and categories to supervise the student. The DPL directly samples the pixel-level post-sigmoid logits predicted by the teacher, which are dense and informative, shown in Fig. 3(c).

In this paper, we consider the limitations of both sparse and dense pseudo-label sampling strategies when dealing with small-scale, densely distributed remote-sensing objects. Specifically, the SPL may fail to detect small objects, resulting in insufficient supervisory information. Conversely, DPL, despite its abundance, may introduce substantial noise due to the illegibility between dense small objects and the background. Thus, we propose a Simple Instance-aware Dense Sampling (SIDS) strategy while sampling dense pseudo-labels from the predicted boxes of foreground and background, generating more comprehensive dense pseudo-labels.

3.2 Optimal Transport

The Monge-Kantorovich Optimal Transport (OT) [47] aims to solve the problem of simultaneously moving items from one set to another set with minimum cost. The mathematical formulations of OT are described in detail below.

Let 𝑿={𝒙i|𝒙id}i=1n𝑿superscriptsubscriptconditional-setsubscript𝒙𝑖subscript𝒙𝑖superscript𝑑𝑖1𝑛\boldsymbol{X}=\{\boldsymbol{x}_{i}|\boldsymbol{x}_{i}\in\mathbb{R}^{d}\}_{i=1% }^{n}bold_italic_X = { bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝒀={𝒚j|𝒚jd}j=1n𝒀superscriptsubscriptconditional-setsubscript𝒚𝑗subscript𝒚𝑗superscript𝑑𝑗1𝑛\boldsymbol{Y}=\{\boldsymbol{y}_{j}|\boldsymbol{y}_{j}\in\mathbb{R}^{d}\}_{j=1% }^{n}bold_italic_Y = { bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT denote as two d𝑑ditalic_d-dimensional vector space. And we assume 𝒙^,𝒚^n^𝒙^𝒚superscript𝑛\hat{\boldsymbol{x}},\hat{\boldsymbol{y}}\in\mathbb{R}^{n}over^ start_ARG bold_italic_x end_ARG , over^ start_ARG bold_italic_y end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are two probability measures on 𝑿𝑿\boldsymbol{X}bold_italic_X and 𝒀𝒀\boldsymbol{Y}bold_italic_Y. The all possible transportation ways ΓΓ\Gammaroman_Γ from 𝑿𝑿\boldsymbol{X}bold_italic_X to 𝒀𝒀\boldsymbol{Y}bold_italic_Y are formed as:

Γ={𝑷n×n|𝑷𝟏n=𝒙^,𝑷T𝟏n=𝒚^},Γconditional-set𝑷superscript𝑛𝑛formulae-sequence𝑷subscript1𝑛^𝒙superscript𝑷𝑇subscript1𝑛^𝒚\Gamma=\{\boldsymbol{P}~{}\in\mathbb{R}^{n\times n}|\boldsymbol{P}\mathbf{1}_{% n}=\hat{\boldsymbol{x}},\boldsymbol{P}^{T}\mathbf{1}_{n}=\hat{\boldsymbol{y}}\},roman_Γ = { bold_italic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT | bold_italic_P bold_1 start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG bold_italic_x end_ARG , bold_italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG bold_italic_y end_ARG } , (1)

where 𝟏nsubscript1𝑛\mathbf{1}_{n}bold_1 start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a n𝑛nitalic_n-dimensional column vector with all elements equal to 1. The constraints (𝑷𝟏n=𝒙^𝑷subscript1𝑛^𝒙\boldsymbol{P}\mathbf{1}_{n}=\hat{\boldsymbol{x}}bold_italic_P bold_1 start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG bold_italic_x end_ARG and 𝑷T𝟏n=𝒚^superscript𝑷𝑇subscript1𝑛^𝒚\boldsymbol{P}^{T}\mathbf{1}_{n}=\hat{\boldsymbol{y}}bold_italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG bold_italic_y end_ARG) ensure the probability mass from each point in the source distribution 𝒙^^𝒙\hat{\boldsymbol{x}}over^ start_ARG bold_italic_x end_ARG is accurately transported to the target distribution 𝒚^^𝒚\hat{\boldsymbol{y}}over^ start_ARG bold_italic_y end_ARG, maintaining the total mass invariant. The Monge-Kantorovich’s Optimal Transport (OT) cost between 𝒙^^𝒙\hat{\boldsymbol{x}}over^ start_ARG bold_italic_x end_ARG and 𝒚^^𝒚\hat{\boldsymbol{y}}over^ start_ARG bold_italic_y end_ARG is then defined as:

𝒲ot(𝒙^,𝒚^)=min𝑷Γ𝑪,𝑷,subscript𝒲𝑜𝑡^𝒙^𝒚subscript𝑷Γ𝑪𝑷\mathcal{W}_{ot}(\hat{\boldsymbol{x}},\hat{\boldsymbol{y}})=\min_{\boldsymbol{% P}\in\Gamma}\left\langle\boldsymbol{C},\boldsymbol{P}\right\rangle,caligraphic_W start_POSTSUBSCRIPT italic_o italic_t end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG , over^ start_ARG bold_italic_y end_ARG ) = roman_min start_POSTSUBSCRIPT bold_italic_P ∈ roman_Γ end_POSTSUBSCRIPT ⟨ bold_italic_C , bold_italic_P ⟩ , (2)

where 𝑪n×n𝑪superscript𝑛𝑛\boldsymbol{C}\in\mathbb{R}^{n\times n}bold_italic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT represents the cost matrix between two sets, and delimited-⟨⟩\left\langle\cdot\right\rangle⟨ ⋅ ⟩ represents inner product. In common, the OT problem is solved by the following dual formulation:

𝒲ot(𝒙^,𝒚^)=max𝝀,𝝁n{𝝀,𝒙^+𝝁,𝒚^},s.t.𝝀i+𝝁j𝑪i,j,i,jn,formulae-sequencesubscript𝒲𝑜𝑡^𝒙^𝒚subscript𝝀𝝁superscript𝑛𝝀^𝒙𝝁^𝒚stformulae-sequencesubscript𝝀𝑖subscript𝝁𝑗subscript𝑪𝑖𝑗for-all𝑖𝑗𝑛\begin{split}\mathcal{W}_{ot}(\hat{\boldsymbol{x}},\hat{\boldsymbol{y}})=\max_% {\boldsymbol{\lambda},\boldsymbol{\mu}\in\mathbb{R}^{n}}\left\{\left\langle% \boldsymbol{\lambda},\hat{\boldsymbol{x}}\right\rangle+\left\langle\boldsymbol% {\mu},\hat{\boldsymbol{y}}\right\rangle\right\},\\ \mathrm{s.t.}~{}~{}\boldsymbol{\lambda}_{i}+\boldsymbol{\mu}_{j}\leq% \boldsymbol{C}_{i,j},~{}\forall i,j\leq n,\end{split}start_ROW start_CELL caligraphic_W start_POSTSUBSCRIPT italic_o italic_t end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG , over^ start_ARG bold_italic_y end_ARG ) = roman_max start_POSTSUBSCRIPT bold_italic_λ , bold_italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { ⟨ bold_italic_λ , over^ start_ARG bold_italic_x end_ARG ⟩ + ⟨ bold_italic_μ , over^ start_ARG bold_italic_y end_ARG ⟩ } , end_CELL end_ROW start_ROW start_CELL roman_s . roman_t . bold_italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ bold_italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , ∀ italic_i , italic_j ≤ italic_n , end_CELL end_ROW (3)

where 𝝀𝝀\boldsymbol{\lambda}bold_italic_λ and 𝝁𝝁\boldsymbol{\mu}bold_italic_μ are the solutions of the OT problem, which can be approximated in an iterative manner [9].

In this paper, we extend optimal transport to construct the global consistency for the teacher-student pairs, which not only enhances the robustness of the detection process against inaccuracies in pseudo-labeling but also encourages a holistic learning approach where both spatial information and relational context are taken into consideration.

4 Method

Fig. 2 illustrates the pipeline of our SOOD++. Specifically, the SOOD++ is a dense pseudo-labeling framework, which consists of three key components: a Simple Instance-aware Dense Sampling (SIDS) strategy is used to construct the high-quality dense pseudo-labels, the Geometry-aware Adaptive Weighting (GAW) loss and Noise-driven Global Consistency (NGC) are adopted to measure the discrepancy between the teacher and student from a one-to-one and set-to-set (global) perspectives, respectively. In the following contents, we describe the overall framework in Sec. 4.1, and introduce the key design of the proposed components, i.e., SIDS, GAW, and NGC, in the following Sec. 4.2, Sec. 4.3, and Sec. 4.4, respectively.

Refer to caption
Figure 2: The pipeline of our SOOD++. Each training batch includes labeled and unlabeled images, omitting the regular supervised part for simplicity. For the unsupervised part, we use the Simple Instance-aware Dense Sampling (SIDS) strategy to create high-quality pseudo-labels and then pair them with the student’s predictions. The Geometry-aware Adaptive Weighting (GAW) loss dynamically weighs each pair’s unsupervised loss based on geometry information. Additionally, we treat the pseudo-labels and student predictions as global discrete distributions, measuring their similarity via Noise-driven Global Consistency (NGC). Yellow points indicate sampled dense pseudo-labels and predictions.

4.1 The Overall Framework

Since aerial objects are usually dense and small, the sparse-pseudo-labeling paradigm might miss massive potential objects. Thus, we choose the dense pseudo-labeling (DPL) paradigm. The training process includes the supervised and unsupervised parts. For the supervised part, the student is trained regularly with labeled data. For the unsupervised part, we adopt the following steps:

  • First, given the output of the teacher, we utilize a Simple Instance-aware Dense Sampling (SIDS) strategy to generate the comprehensive dense pseudo-labels 𝒫tsuperscript𝒫𝑡\mathcal{P}^{t}caligraphic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. We also select the predictions 𝒫ssuperscript𝒫𝑠\mathcal{P}^{s}caligraphic_P start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT at the same positions of the student. Therefore, we obtain the teacher-student pairs (i.e., 𝒫tsuperscript𝒫𝑡\mathcal{P}^{t}caligraphic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and 𝒫ssuperscript𝒫𝑠\mathcal{P}^{s}caligraphic_P start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT).

  • Next, we use the proposed Geometry-aware Adaptive Weighting (GAW) loss to dynamically weigh each teacher-student pair’s unsupervised loss by leveraging the intrinsic geometric information, including the orientation gap and aspect ratio.

  • Then, our proposed Noise-driven Global Consistency (NGC) will construct a relaxed constraint by viewing the 𝒫tsuperscript𝒫𝑡\mathcal{P}^{t}caligraphic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and 𝒫ssuperscript𝒫𝑠\mathcal{P}^{s}caligraphic_P start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT as two global distributions. The random noise is used to disturb the global distributions of both the teacher and student. Following this, we achieve multi-perspective global alignments, encompassing the alignments between the original teacher and student, the perturbed teacher and perturbed student, and the transport plans generated from the former alignments.

We adopt the widely-used rotated version of FCOS [56] as the teacher and student. Note that at each pixel-level point, rotated-FCOS will predict the confidence score, centerness, and bounding box. The basic unsupervised loss contains classification loss, regression loss, and center-ness loss, corresponding to the output of FCOS. We adopt smooth 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss for regression loss, Binary Cross Entropy (BCE) loss for classification and center-ness losses. Based on these, our GAW and NGC will be used to measure the consistency of the teacher and student from different perspectives.

4.2 Simple Instance-aware Dense Sampling Strategy

Refer to caption
Figure 3: (a) GT box. (b) Sparse pseudo-label (bounding box) [64, 40]. (c) Previous methods [76, 35] sample dense pseudo-labels from the entire score map using Top-K. (d) We sample dense pseudo-labels from the predicted instance (foreground) and background with dynamic numbers.

Constructing the pseudo-labels is a crucial pre-processing for the semi-supervised framework. As discussed in Sec. 3.1, since objects in aerial scenes are usually small and dense, using a dense pseudo-labeling paradigm can better identify potential objects in the unlabeled data compared with the sparse pseudo-boxes paradigm.

Previous dense pseudo-labeling methods [76, 35] typically extract a fixed number (i.e., Top-K) of dense pseudo-labels from the predicted score map (Fig. 3(c)). In contrast, we dynamically sample a variable number of dense pseudo-labels from the predicted instance (box) areas and background (Fig. 3(d)). Specifically, we categorize the pseudo-labels from unlabeled data into two types: the easy case obtained from the predicted results and the hard case, which often blends seamlessly into the background. Given the predicted results from the teacher after post-processing (e.g., NMS), considered as the foreground (easy case), we randomly sample pixel-level prediction results within the region indicated by these predicted boxes with a sample ratio \mathcal{R}caligraphic_R, resulting in 𝒫esuperscript𝒫𝑒\mathcal{P}^{e}caligraphic_P start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT. Note that due to the dynamic nature of the number of instances, our sampling number varies across different scenes, better matching the distribution of the objects.

However, merely adopting the above strategy to sample dense pseudo-labels from the foreground often overlooks dense, small objects that blend seamlessly into the background. To address this issue, we propose an intuitive approach to extract potential dense pseudo-labels from the background. Specifically, we empirically find that those potential high-quality positive samples, despite their low confidence, align well with the target location. Consequently, we can use the Intersection over Union (IoU) score between the predicted box and ground-truth box as a criterion for discerning pseudo-labels amidst background noise. However, given the lack of ground truth in the unlabeled dataset, explicit IoU scores are unattainable. Therefore, we introduce an IoU estimation branch trained with the labeled data that predicts the IoU of each pixel for the unlabeled data. Then, we apply threshold 𝒯hsubscript𝒯\mathcal{T}_{h}caligraphic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT to choose the pixel-level points with high IoU scores from the background, resulting in dense hard pseudo-labels 𝒫hsuperscript𝒫\mathcal{P}^{h}caligraphic_P start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT. We merge 𝒫hsuperscript𝒫\mathcal{P}^{h}caligraphic_P start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT and 𝒫esuperscript𝒫𝑒\mathcal{P}^{e}caligraphic_P start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT to establish the final dense pseudo-labels 𝒫tsuperscript𝒫𝑡\mathcal{P}^{t}caligraphic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, which will serve as the supervisory signal for the unsupervised training process. Based on 𝒫tsuperscript𝒫𝑡\mathcal{P}^{t}caligraphic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, we sample the predictions from the student at the same coordinates, defined as 𝒫ssuperscript𝒫𝑠\mathcal{P}^{s}caligraphic_P start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. After obtaining 𝒫tsuperscript𝒫𝑡\mathcal{P}^{t}caligraphic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and 𝒫ssuperscript𝒫𝑠\mathcal{P}^{s}caligraphic_P start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, the semi-supervised framework aims to make 𝒫ssuperscript𝒫𝑠\mathcal{P}^{s}caligraphic_P start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT approach 𝒫tsuperscript𝒫𝑡\mathcal{P}^{t}caligraphic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, and we will describe how to construct one-to-one and many-to-many constraints in the following sections.

Admittedly, IoU estimation is indeed a widely used technique. However, our objective fundamentally differs from previous methods [13, 26, 30], which primarily focus on box refinement. In contrast, our method aims to mine potential dense pseudo-labels from the background in a semi-supervised framework. It should be noted that this paper does not claim the novelty of the proposed SIDS but rather presents a simple and effective sampling strategy that yields high-quality dense pseudo-labels.

4.3 Geometry-aware Adaptive Weighting Loss

Orientation is one essential property of an oriented object. As shown in Fig. 1(a), their orientations remain clear even if objects are dense and small. Some oriented object detection methods [46, 68, 72] have already employed such property in loss calculation. These works are under the assumption that the angles of the labels are reliable. Strictly forcing the prediction close to the ground truth is natural.

However, this assumption is not true under the semi-supervised setting, as the pseudo-labels may be inaccurate. Simply forcing the student to align with the teacher may lead to noise accumulation, adversely affecting the model’s training. Additionally, the impact of orientation varies significantly for oriented objects with diverse aspect ratios. For instance, as illustrated in Fig. 4, an oriented object with a large aspect ratio is highly sensitive to rotation changes, and even slight misalignment can cause significant IoU change.

Based on the above observations, we propose to softly utilize orientation information and aspect ratios, enabling the semi-supervised framework to better understand the geometric priors of oriented objects. Specifically, The difference in orientation between a prediction and a pseudo-label can indicate the difficulty of the instance to some extent and can be used to adjust the unsupervised loss dynamically. Moreover, since oriented objects with different aspect ratios exhibit varying sensitivities to rotation changes, this can be leveraged to modulate the importance of orientation differences in indicating instance difficulty. Therefore, we propose a geometry-aware modulating factor based on these two critical geometric priors. Similar to focal loss [34], this factor dynamically weights the loss of each pseudo-label-prediction pair by considering its intrinsic geometric information (e.g., orientation difference and aspect ratio).

Specifically, the geometry-aware modulating factor ωigeosuperscriptsubscript𝜔𝑖𝑔𝑒𝑜\omega_{i}^{geo}italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_e italic_o end_POSTSUPERSCRIPT for the i𝑖iitalic_i-th pair is defined as:

ωigeo=1+σi,superscriptsubscript𝜔𝑖𝑔𝑒𝑜1subscript𝜎𝑖\omega_{i}^{geo}=1+\sigma_{i},italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_e italic_o end_POSTSUPERSCRIPT = 1 + italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (4)
σi=ψ|ritris|π(ait+ais)2,rit,ris[π2,π2),ait,ais1,formulae-sequencesubscript𝜎𝑖𝜓superscriptsubscript𝑟𝑖𝑡superscriptsubscript𝑟𝑖𝑠𝜋superscriptsubscript𝑎𝑖𝑡superscriptsubscript𝑎𝑖𝑠2superscriptsubscript𝑟𝑖𝑡formulae-sequencesuperscriptsubscript𝑟𝑖𝑠𝜋2𝜋2superscriptsubscript𝑎𝑖𝑡superscriptsubscript𝑎𝑖𝑠1\sigma_{i}=\psi\frac{\left|r_{i}^{t}-r_{i}^{s}\right|}{\pi}\frac{(a_{i}^{t}+a_% {i}^{s})}{2},r_{i}^{t},r_{i}^{s}\in[-\frac{\pi}{2},\frac{\pi}{2}),a_{i}^{t},a_% {i}^{s}\geq 1,italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ψ divide start_ARG | italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT | end_ARG start_ARG italic_π end_ARG divide start_ARG ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 end_ARG , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ [ - divide start_ARG italic_π end_ARG start_ARG 2 end_ARG , divide start_ARG italic_π end_ARG start_ARG 2 end_ARG ) , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ≥ 1 , (5)

where ritsuperscriptsubscript𝑟𝑖𝑡r_{i}^{t}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and rissuperscriptsubscript𝑟𝑖𝑠r_{i}^{s}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT represent the rotation angles in radians of the i𝑖iitalic_i-th pseudo-label and prediction, respectively. aitsuperscriptsubscript𝑎𝑖𝑡a_{i}^{t}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and aissuperscriptsubscript𝑎𝑖𝑠a_{i}^{s}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT denote their aspect ratios. Adding a constant to σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ensures that the original unsupervised loss is preserved when the orientations of the pseudo-label and prediction coincide. The modulating factor σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT considers both orientation differences and average aspect ratios. A toy example is that when the difference in orientation is marginal, but the object’s aspect ratio is large, the proposed factor can still efficiently highlight such a case as a possible hard example and help improve the learning process, and vice versa.

With the geometry-aware modulating factor, the overall Geometry-aware Adaptive Weighting (GAW) loss is formulated as:

GAW=i=1Npωigeoui,subscript𝐺𝐴𝑊superscriptsubscript𝑖1subscript𝑁𝑝superscriptsubscript𝜔𝑖𝑔𝑒𝑜superscriptsubscript𝑢𝑖\mathcal{L}_{GAW}={\textstyle\sum_{i=1}^{N_{p}}}\omega_{i}^{geo}\mathcal{L}_{u% }^{i},caligraphic_L start_POSTSUBSCRIPT italic_G italic_A italic_W end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_e italic_o end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , (6)

where Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the number of pseudo-labels and uisuperscriptsubscript𝑢𝑖\mathcal{L}_{u}^{i}caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the basic unsupervised loss of the i𝑖iitalic_i-th pseudo-label-prediction pair. Using the geometry-aware modulating factor, the GAW loss provides a clear direction for optimization, facilitating the semi-supervised learning process.

Refer to caption
Figure 4: IoU vs. Rotation angle over different aspect-ratios.

4.4 Noise-driven Global Consistency

The objects in an aerial image are usually dense and regularly distributed, as shown in Fig. 1(a). We argue that such an arrangement is similar to the structured layout of the text within a document. The spatial configuration of the object set is implicitly referred to as the layout, conveying the inter-object relationships and the overarching pattern intrinsic to the image. Ideally, the layout consistency between the student’s and the teacher’s predictions will be preserved if each pseudo-label-prediction pair is aligned. However, such a condition is too strict and may hurt performance when there are noises in pseudo-labels. Therefore, it is reasonable to add the consistency between layouts as an additional relaxed optimization objective, encouraging the student to learn information from the teacher from a global perspective. Besides, by treating the predictions from students as a global distribution, the relations between these predictions can be regularized implicitly, which provides implicit constraint to the student to some extent.

Refer to caption
Figure 5: The details of our proposed Noise-driven Global Consistency (NGC). The random noise is added to the output of the teacher and student, and then we use optimal transport to construct global consistency for the before and after disturbances. Furthermore, we make an alignment between the two transport plans.

The details of our Noise-driven Global Consistency (NGC) are shown in Fig. 5. Given the output of teacher and student, we will treat them as two global distributions. First, we add different random noises to disturb both the teacher and the student distributions. Then, we propose to make multi-perspective alignments from three aspects: 1) aligning distributions between the original teacher and original student; 2) aligning distributions between the disturbed teacher and disturbed student; 3) aligning the transport plans from 1) and 2). Applying the multi-perspective alignments is beneficial for maintaining consistency and stability during the learning process. In other words, this global constraint makes the output distribution of the student model as close as possible to that of the teacher model despite disturbance. It will increase the models’ tolerance to minor variations in inputs, i.e., allowing them to more effectively capture the essential features of the unlabeled data without depending on noise or specific details.

Specifically, we propose to leverage the optimal transport theory to measure the global similarity of layouts between the teacher’s and the student’s predictions. Specifically, let K𝐾Kitalic_K denote the number of classes. We define the global distributions of classification scores predicted by the teacher (𝐬tNp×Ksuperscript𝐬𝑡superscriptsubscript𝑁𝑝𝐾\mathbf{s}^{t}\in\mathbb{R}^{N_{p}\times K}bold_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_K end_POSTSUPERSCRIPT) and the student (𝐬sNp×Ksuperscript𝐬𝑠superscriptsubscript𝑁𝑝𝐾\mathbf{s}^{s}\in\mathbb{R}^{N_{p}\times K}bold_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_K end_POSTSUPERSCRIPT) as follows:

𝐝it=e𝐬i,cls(i)t,𝐝is=e𝐬i,cls(i)s,formulae-sequencesuperscriptsubscript𝐝𝑖𝑡superscript𝑒superscriptsubscript𝐬𝑖𝑐𝑙𝑠𝑖𝑡superscriptsubscript𝐝𝑖𝑠superscript𝑒superscriptsubscript𝐬𝑖𝑐𝑙𝑠𝑖𝑠\mathbf{d}_{i}^{t}=e^{\mathbf{s}_{i,cls(i)}^{t}},\quad\mathbf{d}_{i}^{s}=e^{% \mathbf{s}_{i,cls(i)}^{s}},bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_e start_POSTSUPERSCRIPT bold_s start_POSTSUBSCRIPT italic_i , italic_c italic_l italic_s ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = italic_e start_POSTSUPERSCRIPT bold_s start_POSTSUBSCRIPT italic_i , italic_c italic_l italic_s ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , (7)

where cls(i){0,1,,K1}𝑐𝑙𝑠𝑖01𝐾1cls(i)\in\{0,1,\cdots,K-1\}italic_c italic_l italic_s ( italic_i ) ∈ { 0 , 1 , ⋯ , italic_K - 1 } is the class index with the largest score for the i𝑖iitalic_i-th pseudo-label. And 𝐬i,cls(i)tsuperscriptsubscript𝐬𝑖𝑐𝑙𝑠𝑖𝑡\mathbf{s}_{i,cls(i)}^{t}bold_s start_POSTSUBSCRIPT italic_i , italic_c italic_l italic_s ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT indicates the score of cls(i)𝑐𝑙𝑠𝑖cls(i)italic_c italic_l italic_s ( italic_i )-th class for the i𝑖iitalic_i-th pseudo-label. The 𝐬i,cls(i)ssuperscriptsubscript𝐬𝑖𝑐𝑙𝑠𝑖𝑠\mathbf{s}_{i,cls(i)}^{s}bold_s start_POSTSUBSCRIPT italic_i , italic_c italic_l italic_s ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT has a similar definition but refers to the predictions of the student. The exponential used here is empirical, aiming for numerical stability and ensuring the value is greater than 0.

To make the model generate robust consistency, we further add noises to disturb the distributions:

𝐝~t=𝐝t+βϵt,𝐝~s=𝐝s+βϵs,formulae-sequencesuperscript~𝐝𝑡superscript𝐝𝑡𝛽superscriptbold-italic-ϵ𝑡superscript~𝐝𝑠superscript𝐝𝑠𝛽superscriptbold-italic-ϵ𝑠\widetilde{\mathbf{d}}^{t}=\mathbf{d}^{t}+\beta\boldsymbol{\epsilon}^{t},\quad% \widetilde{\mathbf{d}}^{s}=\mathbf{d}^{s}+\beta\boldsymbol{\epsilon}^{s},over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = bold_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_β bold_italic_ϵ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT + italic_β bold_italic_ϵ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , (8)

where ϵt,ϵs𝒩(0,1)similar-tosuperscriptbold-italic-ϵ𝑡superscriptbold-italic-ϵ𝑠𝒩01\boldsymbol{\epsilon}^{t},\boldsymbol{\epsilon}^{s}\sim\mathcal{N}(0,1)bold_italic_ϵ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_ϵ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , 1 ). β𝛽\betaitalic_β is a hyper-parameter used to control the noise intensity. The original and disturbed global distributions both reflect the many-to-many relationships between the teacher and the student, highlighting the characteristics of objects in the aerial scenario. To construct the cost map for solving the OT problem, we consider the spatial distance and the score difference of each possible pair. Specifically, we measure the transport costs between pseudo-labels and predictions as follows:

𝑪i,j=𝑪i,jdist+𝑪i,jscore,subscript𝑪𝑖𝑗superscriptsubscript𝑪𝑖𝑗𝑑𝑖𝑠𝑡superscriptsubscript𝑪𝑖𝑗𝑠𝑐𝑜𝑟𝑒\boldsymbol{C}_{i,j}=\boldsymbol{C}_{i,j}^{dist}+\boldsymbol{C}_{i,j}^{score},bold_italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = bold_italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_i italic_s italic_t end_POSTSUPERSCRIPT + bold_italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUPERSCRIPT , (9)
𝑪i,jdist=𝐳it𝐳js2max1a,bNp𝐳at𝐳bs2,superscriptsubscript𝑪𝑖𝑗𝑑𝑖𝑠𝑡subscriptnormsuperscriptsubscript𝐳𝑖𝑡superscriptsubscript𝐳𝑗𝑠2subscriptformulae-sequence1𝑎𝑏subscript𝑁𝑝subscriptnormsuperscriptsubscript𝐳𝑎𝑡superscriptsubscript𝐳𝑏𝑠2\boldsymbol{C}_{i,j}^{dist}=\frac{\|\mathbf{z}_{i}^{t}-\mathbf{z}_{j}^{s}\|_{2% }}{\max_{1\leq a,b\leq N_{p}}\|\mathbf{z}_{a}^{t}-\mathbf{z}_{b}^{s}\|_{2}},bold_italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_i italic_s italic_t end_POSTSUPERSCRIPT = divide start_ARG ∥ bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG roman_max start_POSTSUBSCRIPT 1 ≤ italic_a , italic_b ≤ italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_z start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , (10)
𝑪i,jscore=𝐬i,cls(i)t𝐬j,cls(j)s1max1a,bNp𝐬a,cls(a)t𝐬b,cls(b)s1,superscriptsubscript𝑪𝑖𝑗𝑠𝑐𝑜𝑟𝑒subscriptnormsuperscriptsubscript𝐬𝑖𝑐𝑙𝑠𝑖𝑡superscriptsubscript𝐬𝑗𝑐𝑙𝑠𝑗𝑠1subscriptformulae-sequence1𝑎𝑏subscript𝑁𝑝subscriptnormsuperscriptsubscript𝐬𝑎𝑐𝑙𝑠𝑎𝑡superscriptsubscript𝐬𝑏𝑐𝑙𝑠𝑏𝑠1\boldsymbol{C}_{i,j}^{score}=\frac{\|\mathbf{s}_{i,cls(i)}^{t}-\mathbf{s}_{j,% cls(j)}^{s}\|_{1}}{\max_{1\leq a,b\leq N_{p}}\|\mathbf{s}_{a,cls(a)}^{t}-% \mathbf{s}_{b,cls(b)}^{s}\|_{1}},bold_italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUPERSCRIPT = divide start_ARG ∥ bold_s start_POSTSUBSCRIPT italic_i , italic_c italic_l italic_s ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_s start_POSTSUBSCRIPT italic_j , italic_c italic_l italic_s ( italic_j ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG roman_max start_POSTSUBSCRIPT 1 ≤ italic_a , italic_b ≤ italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_s start_POSTSUBSCRIPT italic_a , italic_c italic_l italic_s ( italic_a ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_s start_POSTSUBSCRIPT italic_b , italic_c italic_l italic_s ( italic_b ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , (11)

where 1\|\cdot\|_{1}∥ ⋅ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the 1superscript1\ell^{1}roman_ℓ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT norm of a vector and 2\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represents the Euclidean distance. 𝐳itsuperscriptsubscript𝐳𝑖𝑡\mathbf{z}_{i}^{t}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and 𝐳jssuperscriptsubscript𝐳𝑗𝑠\mathbf{z}_{j}^{s}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT are pixel-level 2D coordinates of the i𝑖iitalic_i-th sample in the teacher and the j𝑗jitalic_j-th sample in the student, respectively.

Here, we define the global consistencies between the original (resp. noisy) teacher and original (resp. noisy) student as follows:

GC(𝐝t,𝐝s)subscript𝐺𝐶superscript𝐝𝑡superscript𝐝𝑠\displaystyle\mathcal{L}_{GC}(\mathbf{d}^{t},\mathbf{d}^{s})caligraphic_L start_POSTSUBSCRIPT italic_G italic_C end_POSTSUBSCRIPT ( bold_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) =𝒲ot(𝐝t,𝐝s)absentsubscript𝒲𝑜𝑡superscript𝐝𝑡superscript𝐝𝑠\displaystyle=\mathcal{W}_{ot}(\mathbf{d}^{t},\mathbf{d}^{s})= caligraphic_W start_POSTSUBSCRIPT italic_o italic_t end_POSTSUBSCRIPT ( bold_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) (12)
=𝝀,𝐝t𝐝t1+𝝁,𝐝s𝐝s1,absentsuperscript𝝀superscript𝐝𝑡subscriptnormsuperscript𝐝𝑡1superscript𝝁superscript𝐝𝑠subscriptnormsuperscript𝐝𝑠1\displaystyle=\left\langle\boldsymbol{\lambda}^{*},\frac{\mathbf{d}^{t}}{\|% \mathbf{d}^{t}\|_{1}}\right\rangle+\left\langle\boldsymbol{\mu}^{*},\frac{% \mathbf{d}^{s}}{\|\mathbf{d}^{s}\|_{1}}\right\rangle,= ⟨ bold_italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , divide start_ARG bold_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∥ bold_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ⟩ + ⟨ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , divide start_ARG bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG ∥ bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ⟩ ,
GC(𝐝~t,𝐝~s)subscript𝐺𝐶superscript~𝐝𝑡superscript~𝐝𝑠\displaystyle\mathcal{L}_{GC}(\widetilde{\mathbf{d}}^{t},\widetilde{\mathbf{d}% }^{s})caligraphic_L start_POSTSUBSCRIPT italic_G italic_C end_POSTSUBSCRIPT ( over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) =𝒲ot(𝐝~t,𝐝~s)absentsubscript𝒲𝑜𝑡superscript~𝐝𝑡superscript~𝐝𝑠\displaystyle=\mathcal{W}_{ot}(\widetilde{\mathbf{d}}^{t},\widetilde{\mathbf{d% }}^{s})= caligraphic_W start_POSTSUBSCRIPT italic_o italic_t end_POSTSUBSCRIPT ( over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) (13)
=𝝀~,𝐝~t𝐝~t1+𝝁~,𝐝~s𝐝~s1,absentsuperscript~𝝀superscript~𝐝𝑡subscriptnormsuperscript~𝐝𝑡1superscript~𝝁superscript~𝐝𝑠subscriptnormsuperscript~𝐝𝑠1\displaystyle=\left\langle\widetilde{\boldsymbol{\lambda}}^{*},\frac{% \widetilde{\mathbf{d}}^{t}}{\|\widetilde{\mathbf{d}}^{t}\|_{1}}\right\rangle+% \left\langle\widetilde{\boldsymbol{\mu}}^{*},\frac{\widetilde{\mathbf{d}}^{s}}% {\|\widetilde{\mathbf{d}}^{s}\|_{1}}\right\rangle,= ⟨ over~ start_ARG bold_italic_λ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , divide start_ARG over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∥ over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ⟩ + ⟨ over~ start_ARG bold_italic_μ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , divide start_ARG over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG ∥ over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ⟩ ,

where (𝝀,𝝁)\boldsymbol{\lambda}^{*},\boldsymbol{\mu}^{*})bold_italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and (𝝀~,𝝁~superscript~𝝀superscript~𝝁\widetilde{\boldsymbol{\lambda}}^{*},\widetilde{\boldsymbol{\mu}}^{*}over~ start_ARG bold_italic_λ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_μ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) are the solutions of Eq. 3. We solve the OT problem by a fast Sinkhorn distances algorithm [9], obtaining the approximate solution. Based on the defined GC(𝐝t,𝐝s)subscript𝐺𝐶superscript𝐝𝑡superscript𝐝𝑠\mathcal{L}_{GC}(\mathbf{d}^{t},\mathbf{d}^{s})caligraphic_L start_POSTSUBSCRIPT italic_G italic_C end_POSTSUBSCRIPT ( bold_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) and GC(𝐝~t,𝐝~s)subscript𝐺𝐶superscript~𝐝𝑡superscript~𝐝𝑠\mathcal{L}_{GC}(\widetilde{\mathbf{d}}^{t},\widetilde{\mathbf{d}}^{s})caligraphic_L start_POSTSUBSCRIPT italic_G italic_C end_POSTSUBSCRIPT ( over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ), their gradients222Note that under the semi-supervised learning framework, only the predictions of the student model have gradients. with respect to 𝐝ssuperscript𝐝𝑠\mathbf{d}^{s}bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝐝~ssuperscript~𝐝𝑠\widetilde{\mathbf{d}}^{s}over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT follows:

GC(𝐝t,𝐝s)𝐝s=𝝁𝐝s1𝝁,𝐝s𝐝s12,subscript𝐺𝐶superscript𝐝𝑡superscript𝐝𝑠superscript𝐝𝑠superscript𝝁subscriptnormsuperscript𝐝𝑠1superscript𝝁superscript𝐝𝑠superscriptsubscriptnormsuperscript𝐝𝑠12\frac{\partial\mathcal{L}_{GC}(\mathbf{d}^{t},\mathbf{d}^{s})}{\partial\mathbf% {d}^{s}}=\frac{\boldsymbol{\mu}^{*}}{\|\mathbf{d}^{s}\|_{1}}-\frac{\left% \langle\boldsymbol{\mu}^{*},\mathbf{d}^{s}\right\rangle}{\|\mathbf{d}^{s}\|_{1% }^{2}},divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_G italic_C end_POSTSUBSCRIPT ( bold_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG = divide start_ARG bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∥ bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG - divide start_ARG ⟨ bold_italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ⟩ end_ARG start_ARG ∥ bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (14)
GC(𝐝~t,𝐝~s)𝐝~s=𝝁~𝐝~s1𝝁~,𝐝~s𝐝~s12,subscript𝐺𝐶superscript~𝐝𝑡superscript~𝐝𝑠superscript~𝐝𝑠superscript~𝝁subscriptnormsuperscript~𝐝𝑠1superscript~𝝁superscript~𝐝𝑠superscriptsubscriptnormsuperscript~𝐝𝑠12\frac{\partial\mathcal{L}_{GC}(\widetilde{\mathbf{d}}^{t},\widetilde{\mathbf{d% }}^{s})}{\partial\widetilde{\mathbf{d}}^{s}}=\frac{\widetilde{\boldsymbol{\mu}% }^{*}}{\|\widetilde{\mathbf{d}}^{s}\|_{1}}-\frac{\left\langle\widetilde{% \boldsymbol{\mu}}^{*},\widetilde{\mathbf{d}}^{s}\right\rangle}{\|\widetilde{% \mathbf{d}}^{s}\|_{1}^{2}},divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_G italic_C end_POSTSUBSCRIPT ( over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG = divide start_ARG over~ start_ARG bold_italic_μ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG ∥ over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG - divide start_ARG ⟨ over~ start_ARG bold_italic_μ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ⟩ end_ARG start_ARG ∥ over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (15)

the gradients can be back-propagated to learn the parameters of the detector. In addition, there is an OT plan 𝑷Np×Np𝑷superscriptsubscript𝑁𝑝subscript𝑁𝑝\boldsymbol{P}\in\mathbb{R}^{N_{p}\times N_{p}}bold_italic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (resp. 𝑷~Np×Np~𝑷superscriptsubscript𝑁𝑝subscript𝑁𝑝\widetilde{\boldsymbol{P}}\in\mathbb{R}^{N_{p}\times N_{p}}over~ start_ARG bold_italic_P end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT between the original (resp. noisy) teacher and the original (resp. noisy) student (refer to the Eq. 2). The OT plans reflect the map** from the source distribution to the target one. We suggest aligning transport plans between without and with disturbance, which can evaluate the impact of noise and offer auxiliary guidance for the model (Fig. 5). Given the OT plan, we will multiply it with the student distribution to be better aware of the global prior and then utilize the MSEMSE\mathrm{MSE}roman_MSE to measure the difference between the two types of OT plans:

plan=MSE(𝑷𝑷~)subscript𝑝𝑙𝑎𝑛MSE𝑷~𝑷\mathcal{L}_{plan}=\mathrm{MSE}(\boldsymbol{P}-\widetilde{\boldsymbol{P}})caligraphic_L start_POSTSUBSCRIPT italic_p italic_l italic_a italic_n end_POSTSUBSCRIPT = roman_MSE ( bold_italic_P - over~ start_ARG bold_italic_P end_ARG ) (16)

Overall, we obtain the training objective of NGC:

NGC=GC(𝐝t,𝐝s)+GC(𝐝~t,𝐝~s)+plan,subscript𝑁𝐺𝐶subscript𝐺𝐶superscript𝐝𝑡superscript𝐝𝑠subscript𝐺𝐶superscript~𝐝𝑡superscript~𝐝𝑠subscript𝑝𝑙𝑎𝑛\mathcal{L}_{NGC}=\mathcal{L}_{GC}(\mathbf{d}^{t},\mathbf{d}^{s})+\mathcal{L}_% {GC}(\widetilde{\mathbf{d}}^{t},\widetilde{\mathbf{d}}^{s})+\mathcal{L}_{plan},caligraphic_L start_POSTSUBSCRIPT italic_N italic_G italic_C end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_G italic_C end_POSTSUBSCRIPT ( bold_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT italic_G italic_C end_POSTSUBSCRIPT ( over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT italic_p italic_l italic_a italic_n end_POSTSUBSCRIPT , (17)

when we optimize such OT-based loss, the predictions converge closely to the pseudo-labels from a global perspective. We argue that promoting such multiple global consistency not only relieves the negative effect of inaccuracy pseudo-labels when detection processing, but also encourages a holistic learning approach, considering objects’ spatial and relational dynamics.

It should be noted that the feature representation is considered an ideal global distribution only when supported by a sufficient number of samples (e.g., treating a few sample points as a global distribution is unrealistic). If the sample number Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is smaller than a fixed number 𝒯gsubscript𝒯𝑔\mathcal{T}_{g}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, we do not treat 𝐬tsuperscript𝐬𝑡\mathbf{s}^{t}bold_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and 𝐬ssuperscript𝐬𝑠\mathbf{s}^{s}bold_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT as global distributions. Accordingly, the semi-supervised framework will not adopt the NGCsubscript𝑁𝐺𝐶\mathcal{L}_{NGC}caligraphic_L start_POSTSUBSCRIPT italic_N italic_G italic_C end_POSTSUBSCRIPT.

Although optimal transport theory has been explored in other methods [15, 45, 16, 66], our goal of using OT is significantly different. They mainly focus on utilizing OT to implement neural architecture search [66, 45], label assignment [16, 4], and image matching [38]. However, the goal of our NGC is to establish the many-to-many relationship between the teacher and the student under the semi-supervised setting, providing robust global consistency, which is complementary to the GAW loss.

4.5 Training Objective

SOOD++ is trained with the proposed GAW and NGC for unlabeled data as well as the supervised loss for labeled data. The overall loss \mathcal{L}caligraphic_L is defined as:

=GAW+NGCu+s,subscriptsubscript𝐺𝐴𝑊subscript𝑁𝐺𝐶subscript𝑢subscript𝑠\mathcal{L}=\underbrace{\mathcal{L}_{GAW}+\mathcal{L}_{NGC}}_{\mathcal{L}_{u}}% +\mathcal{L}_{s},caligraphic_L = under⏟ start_ARG caligraphic_L start_POSTSUBSCRIPT italic_G italic_A italic_W end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_N italic_G italic_C end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , (18)

where usubscript𝑢\mathcal{L}_{u}caligraphic_L start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and ssubscript𝑠\mathcal{L}_{s}caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT indicate the semi-supervised and supervised loss, respectively. Note that the supervised loss ssubscript𝑠\mathcal{L}_{s}caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the same as defined in the supervised baseline (rotated-FCOS), and our designs only modify the unsupervised part.

5 Experiments

5.1 Dataset and Evaluation Metric

Following conventions in SSOD, for each conducted dataset, we consider two experiment settings, partially labeled data and fully labeled data, to validate the performance of a method on limited and abundant labeled data, respectively.

TABLE I: The detailed hyper-parameters of our method.
     Hyper-parameter      Value
     ψ𝜓\psiitalic_ψ      50
     Sample ratio \mathcal{R}caligraphic_R      0.25
     IoU threshold 𝒯hsubscript𝒯\mathcal{T}_{h}caligraphic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT      0.1
     Noisy intensity β𝛽\betaitalic_β      0.3
     Global number threshold 𝒯gsubscript𝒯𝑔\mathcal{T}_{g}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT      150

DOTA-V1.5 [58] contains 2,806 large aerial images and 402,089 annotated oriented objects. It includes three subsets: DOTA-V1.5-train, DOTA-V1.5-val, and DOTA-V1.5-test, containing 1,411, 458, and 937 images, respectively. The annotations of the DOTA-V1.5-test are not released. There are 16 categories in this dataset. Compared with its previous version, DOTA-V1.0, DOTA-V1.5 contains more small instances (less than 10 pixels), which makes it more challenging. For the partially labeled data setting, we randomly sample 10%, 20%, and 30% of the images from DOTA-V1.5-train as labeled data and set the remaining images as unlabeled data. For the fully labeled data setting, we set DOTA-V1.5-train as labeled data and DOTA-V1.5-test as unlabeled data. We report the performance with the standard mean average precision (mAP).

DOTA-V1.0 [58] contains the same images and data splits as DOTA-V1.5, including 1,411 training images, 458 validation images, and 937 test images. As an early version of the DOTA-series datasets, it contains 15 categories identical to DOTA-V1.5, except for annotating container-crane. For the partially labeled data setting, we use the same labeled and unlabeled splits as in DOTA-V1.5, including 10%, 20%, and 30%. For the fully labeled data setting, we set DOTA-V1.0-train as labeled data and DOTA-V1.0-test as unlabeled. The mAP is used as the evaluation metric.

HRSC2016 [41] is a widely-used dataset for arbitrarily oriented ship detection, consisting of 436 training images, 181 validation images, and 444 test images, with sizes ranging from 300×300 to 1,500×900. For the partially labeled data setting, we randomly sample 10%, 20%, and 30% images from the training set as labeled data and set the remaining images as unlabeled data. For the fully labeled data setting, we set the training set as labeled data and the test set as unlabeled data. We report the AP85 and mAP50:95 performance on the validation set.

The nuScenes Dataset [2] is a large-scale oriented 3D object detection dataset containing 1,000 distinct scenarios and 1.4M oriented 3D bounding boxes from 10 categories. These scenarios are recorded with a 32-beam LiDAR, six surround-view cameras, and five radars. The scenes are officially split into 700/150/150 scenes for the training/validation/test set. Each scene has 20s video frames and is fully annotated with 3D bounding boxes every 0.5 seconds. For the partially labeled data setting, we randomly sample 1/12, 1/6, and 1/3 annotated frames from the training set as labeled data and set the remaining training frames as unlabeled data. For the fully labeled data setting, we set all annotated frames from the training set as labeled data and unannotated frames as unlabeled data. We use the mAP and NDS as the primary metrics. Higher values indicate better performance, and the detailed information can be found in  [2].

5.2 Implementation Details

For the DOTA-V1.5, DOTA-V1.0, and HRSC2016 datasets, we take FCOS [56] as the representative anchor-free detector (the ResNet-50 [22] with FPN [33] used as the backbone), where BEVDet [24] is adopted for the nuScenes dataset. Following previous works [58, 20, 21], we crop the original images into 1,024×\times×1,024 patches with a stride of 824. This means the pixel overlap between two adjacent patches is 200 for the DOTA series datasets. We utilize asymmetric data augmentation for unlabeled data. We use strong augmentation (random flip**, color jittering, grayscale, and Gaussian blur) for the student model and weak augmentation (random flip**) for the teacher model. The models are trained for 180k iterations on 2 ×\times× RTX3090 GPUs (except the nuScenes dataset, using 8 ×\times× V100). With the SGD optimizer, the initial learning rate of 0.0025 is divided by ten at 120k and 160k. The momentum and the weight decay are set to 0.9 and 0.0001, respectively. Each GPU takes three images as input, where the proportion between unlabeled and labeled data is set to 1:2. Following previous SSOD works [76, 39], we use the “burn-in” strategy to initialize the teacher model. The detailed hyper-parameters are shown in Tab. I.

TABLE II: Experimental results on the DOTA-V1.5 (val set) with single-scale training and testing under 10%, 20%, and 30% partially labeled data. and indicate implementations with rotated-Faster R-CNN and rotated-FCOS, respectively. Colored numbers show performance improvement over the previous conference version (also one of the SOTA methods).
Setting Method Publication 10% 20% 30%
Supervised-baseline Rotated-Faster R-CNN [50] NeurIPS 2016 43.43 51.32 53.14
Rotated-FCOS [56] ICCV 2019 42.78 50.11 54.79
Semi-supervised Unbiased Teacher [39] ICLR 2021 44.51 52.80 53.33
Soft Teacher [64] ICCV 2021 48.46 54.89 57.83
Dense Teacher [76] ECCV 2022 46.90 53.93 57.86
ARSL [35] CVPR 2023 47.56 55.05 58.94
SOOD [23] (conference version) CVPR 2023 48.63 55.58 59.23
SOOD++ (ours) - 50.48(+1.85) 57.44(+1.86) 61.51(+2.28)
TABLE III: Experimental results on DOTA-V1.5 (val set) with single-scale training and testing under fully labeled data. Numbers before the arrow are the supervised baseline ( for rotated-Faster R-CNN and for rotated-FCOS).
Method Publication mAP
Unbiased Teacher [39] ICLR 2021 66.12 1.271.27\xrightarrow{-1.27}start_ARROW start_OVERACCENT - 1.27 end_OVERACCENT → end_ARROW 64.85
Soft Teacher [64] ICCV 2021 66.12 +0.280.28\xrightarrow{+0.28}start_ARROW start_OVERACCENT + 0.28 end_OVERACCENT → end_ARROW 66.40
Dense Teacher [76] ECCV 2022 65.46 +0.920.92\xrightarrow{+0.92}start_ARROW start_OVERACCENT + 0.92 end_OVERACCENT → end_ARROW 66.38
ARSL [35] CVPR 2023 65.46 +1.221.22\xrightarrow{+1.22}start_ARROW start_OVERACCENT + 1.22 end_OVERACCENT → end_ARROW 66.68
SOOD++ (ours) - 65.46 +3.583.58\xrightarrow{+3.58}start_ARROW start_OVERACCENT + 3.58 end_OVERACCENT → end_ARROW 69.04

6 Results and Analysis

In this section, we compare our method with the state-of-the-art SSOD methods [76, 64, 39, 35] on several challenging datasets. For a fair comparison, we re-implement these methods and have carefully tuned their hyper-parameters to obtain their best performance on oriented object detectors with the same augmentation setting.

6.1 Main Results

6.1.1 Results on the DOTA-V1.5 dataset

We first evaluate our method on the large-scale, challenging, and most convincing oriented object detection dataset, i.e., the DOTA-V1.5 dataset.

Partially labeled data. We conduct a comprehensive evaluation of our SOOD++ across various proportions of labeled data, as shown in Tab. II. We can make the following observations: 1) Our SOOD++ achieves state-of-the-art performance under all proportions, achieving 50.48, 57.44, and 61.51 mAP on 10%, 20%, and 30% proportions, respectively, significantly surpassing our supervised baseline rotated-FOCS [56] by 7.70, 7.33, and 6.72 mAP. 2) Compared with the state-of-the-art anchor-free method ARSL [35] and our conference version [23], we significantly outperform them by 2.92/1.85, 2.39/1.86 and 2.57/2.28 mAP, highly demonstrating the substantial improvements we made over the conference version. 3) Compared with Unbiased Teacher [39] and Soft Teacher [64], two representative anchor-based methods, on 10% and 20% proportions, our SOOD++ achieves higher performance even though our supervised baseline (rotated-FCOS) is weaker than they used (rotated-Faster R-CNN). Under 30% data proportion, SOOD++ surpasses Unbiased Teacher [39] and Soft Teacher [64] by a large margin, i.e., 8.18 and 3.68 mAP, respectively.

Fully labeled data. We further compare our SOOD++ with other semi-supervised object detection (SSOD) approaches under the fully labeled data setting, which is more challenging. The goal of this setting is to use additional unlabeled data to enhance the performance of a well-trained detector on large-scale labeled data. As shown in Tab. III, the results indicate that we surpass previous SSOD methods [76, 64, 39] by at least 2.36 mAP. Besides, compared with our baseline, even though it has been trained well in massive labeled data, we still achieve 3.58 mAP improvement, demonstrating our method’s ability to learn from unlabeled data. An interesting phenomenon is that the performance of Unbiased Teacher [39] drops after adding unlabeled data. We argue the reason might be that Unbiased Teacher does not apply unsupervised losses for bounding box regression, which is essential for oriented object detection.

Figure 6: Experimental results on the DOTA-V1.0 (val set) with single-scale training and testing under 10%, 20%, and 30% partially labeled data. All methods are implemented with rotated-FCOS.
Setting Method Publication 10% 20% 30%
Supervised-baseline Rotated-FCOS [56] ICCV 2019 46.20 53.88 57.30
Semi-supervised Dense Teacher [76] ECCV 2022 50.87 58.02 60.91
ARSL [35] CVPR 2023 52.89 59.84 62.92
SOOD++ (ours) - 54.17 60.53 64.93
Figure 7: Experimental results on DOTA-V1.0 (val set) with single-scale training and testing under fully labeled data. Numbers before the arrow are the supervised baseline (rotated-FCOS).
Method Publication mAP
Dense Teacher [76] ECCV 2022 70.22 +0.980.98\xrightarrow{+0.98}start_ARROW start_OVERACCENT + 0.98 end_OVERACCENT → end_ARROW 71.20
ARSL [35] CVPR 2023 70.22 +1.411.41\xrightarrow{+1.41}start_ARROW start_OVERACCENT + 1.41 end_OVERACCENT → end_ARROW 71.63
SOOD++ (ours) - 70.22 +2.182.18\xrightarrow{+2.18}start_ARROW start_OVERACCENT + 2.18 end_OVERACCENT → end_ARROW 72.40
TABLE IV: Experimental results on HRSC2016 (val set) with single-scale training and testing under 10%, 20%, and 30% partially labeled data. All methods are implemented with rotated-FCOS.
Setting Method Publication 10% 20% 30%
AP85(%) mAP50:90(%) AP85(%) mAP50:90(%) AP85(%) mAP50:90(%)
Supervised Rotated-FCOS [56] ICCV 2019 7.30 37.49 30.20 52.57 37.40 58.78
Semi-supervised Dense Teacher [76] ECCV 2022 21.90 49.93 32.80 59.63 41.40 63.08
ARSL [35] CVPR 2023 21.50 49.17 33.60 59.76 41.10 63.35
SOOD++ (ours) - 22.90 52.62 40.30 62.14 47.80 65.27

6.1.2 Results on the DOTA-V1.0 dataset

We then conduct the experiments on the DOTA-V1.0 dataset, which has the same images as the DOTA-V1.5 dataset but provides coarse annotations (e.g., many tiny objects with fewer pixels are not annotated).

Partially labeled data. We evaluate our method under different labeled data proportions on the DOTA-V1.0 dataset, as shown in Tab. 7. It is similar to the comparison of the DOTA-V1.5 dataset, where our SOOD++ achieves state-of-the-art performance under all proportions. Specifically, it surpasses the SOTA method [35] by 1.28, 0.69, and 2.01 mAP under 10%, 20%, and 30% labeled data settings, respectively. These impressive results demonstrate our method handles the coarse annotation dataset well.

Fully labeled data. Under this setting, we can explore the potential of the proposed semi-supervised framework even if it has been trained with extensive labeled data. As shown in Tab. 7, although the supervised baseline has already achieved high performance, our SOOD++ still takes 2.18 mAP improvements on it, an observed performance gain when using the unlabeled data. In addition, our method performs better compared with other semi-supervised methods [76, 35] with the same baseline.

6.1.3 Results on the HRSC2016 dataset

This part evaluates the proposed method on another representative aerial scenes dataset, i.e., HRSC2016 [41]. Notably, the commonly utilized mAP(07)07(07)( 07 ) metric fails to adequately show the performance discrepancies among different methodologies. Consequently, we use more stringent evaluation metrics to provide a more precise performance evaluation, including AP85 and mAP50:90.

Partially labeled data. As shown in Tab. IV, our SOOD++ consistently outperforms other semi-supervised methods [35, 76] in various settings. Especially under the rigorous metric (i.e., AP)85{}_{85})start_FLOATSUBSCRIPT 85 end_FLOATSUBSCRIPT ), our method shows a more pronounced improvement, which surpasses ARSL [35] by 6.7 and 6.7 mAP under 20% and 30% labeled data settings, indicating the effectiveness of our method.

Fully labeled data. Similar to the above datasets, we also compare the performance under the fully labeled data setting. As shown in Tab. V, our method outperforms previous SSOD methods [35, 76] and surpasses our supervised baseline by a large margin with 8.90 AP85.

Refer to caption
Figure 8: Some typical examples from the DOTA-V1.5 dataset. From left to right, there are ground truth, supervised baseline (rotated-FCOS[56]), the SOTA method ARSL [35], and our proposed SOOD++. Row 2 refers to scenarios with extreme aspect ratios case. Row 3 and row 4 are two extremely dense scenes. The red dashed and solid red circles represent false negative and false positive, respectively.
TABLE V: Experimental results on the HRSC2016 (val set) dataset with single-scale training and testing under fully labeled data. Numbers before the arrow are the supervised baseline (rotated-FCOS).
Method Publication AP85
Dense Teacher [76] ECCV 2022 49.50 +3.203.20\xrightarrow{+3.20}start_ARROW start_OVERACCENT + 3.20 end_OVERACCENT → end_ARROW 52.70
ARSL [35] CVPR 2023 49.50 +5.005.00\xrightarrow{+5.00}start_ARROW start_OVERACCENT + 5.00 end_OVERACCENT → end_ARROW 54.50
SOOD++ (ours) - 49.50 +8.908.90\xrightarrow{+8.90}start_ARROW start_OVERACCENT + 8.90 end_OVERACCENT → end_ARROW 58.40
TABLE VI: The generalization ability of our SOOD++. The results are reported with single-scale training and testing.
(a) The effectiveness of SOOD++ on different methods under DOTA-V1.5 (val set) fully labeled data.
Detector Publication Method mAP
CFA [19] CVPR 2021 Supervised 65.75
SOOD++ (ours) 68.49
Oriented R-CNN [62] ICCV 2021 Supervised 67.26
SOOD++ (ours) 69.50
(b) Experimental results on DOTA-V1.5 (test set). * indicates the strong supervised baseline we implemented.
Setting Method Publication mAP
Supervised SOTA DCFL [63] CVPR 23 71.03
LSKNet [32] ICCV 23 70.26
PKINet [3] CVPR 24 71.47
Strong Oriented R-CNN* - 70.66
Semi-supervised SOOD++ (ours) - 72.48

6.2 Visualization analysis

We further visualize the qualitative results in Fig. 8. These samples are selected from typical scenes on the DOTA-V1.5 dataset (validation set). Specifically, the first sample is a sparse and relatively simpler case, and most semi-supervised methods can work well. However, when facing challenging cases, e.g., the second sample is the large aspect ratio case, and the third and fourth samples are two dense and small cases, we can find that compared with the SOTA method [35], our method presents better visualization. We believe this is because SOOD++ can exploit more potential semantic information from unlabeled data by leveraging the geometry prior and global consistency. These qualitative results further highlight the effectiveness of our method.

6.3 The generalization of our method

This section explores the generalization capabilities of our method from two perspectives: 1) We evaluate our method on different detectors, e.g., CFA [19] and Oriented R-CNN [62]; 2) We conduct experiments on the semi-supervised multi-view oriented 3D object detection to demonstrate the generalization on 3D scenes.

6.3.1 The generalization on different detectors

To examine the adaptability of our method, we conduct experiments on a suite of representative oriented object detectors under the fully labeled data setting. Specifically, we choose the CFA [19] and Oriented R-CNN [62] for evaluation, as shown in Tab. VI(a). When trained with the full DOTA-V1.5 training data, the fully supervised Oriented R-CNN [62] achieves 67.26 mAP, which is indeed a superior performance. Even so, our method improves it by 2.24 mAP and reaches 69.50 mAP by leveraging the unlabeled data. In addition, our method also reports 2.74 mAP improvement on the CFA [19]. These results are significant, demonstrating our method’s ability to seamlessly integrate with diverse detectors and effectively leverage the intrinsic strengths of the proposed key designs.

We further implement a strong supervised baseline (oriented R-CNN with EVA pre-trained weight [12]) and combine it with our SOOD++ to compare with SOTA oriented detectors [32, 63, 3], as shown in Tab. VI(b). Despite the strong baseline achieving 70.66 mAP on the DOTA-V1.5 test set, our approach shows a 1.82 mAP improvement by leveraging unlabeled data333We utilize DOTA-V2.0 [11] dataset as the unlabeled data in this setting.. It outperforms the PKINet [3] by a notable margin, pushing the new state-of-the-art.

TABLE VII: The generalization on the oriented 3D object detection dataset (nuScenes [2] val set). All semi-supervised methods use the same detector (BEVDet [24]).
Setting Method 1/12 data 1/6 data 1/3 data Full data
mAP NDS mAP NDS mAP NDS mAP NDS
Supervised BEVDet [24] 8.29 11.47 13.75 19.52 20.31 25.52 25.97 33.02
Semi- ARSL [35] 8.39 11.90 15.58 20.91 21.56 27.13 27.00 34.57
SOOD++ (ours) 8.63 13.26 17.42 22.03 23.24 28.24 29.66 36.32
TABLE VIII: The effectiveness of our key components: Simple Instance-aware Dense Sampling (SIDS) strategy, Geometry-aware Adaptive Weighting (GAW) loss, and Noise-driven Global Consistency (NGC).
Sampling strategy GAW NGC mAP
10% 20% 30%
From Dense Teacher [76] - - 46.90 53.93 57.86
From SOOD [23] - - 47.24 54.07 57.74
SIDS - - 48.21 55.54 58.69
SIDS - 49.27 56.45 60.11
- 49.32 56.63 59.76
50.48 57.44 61.51
TABLE IX: The analysis of our Simple Instance-aware Dense Sampling (SIDS) strategy. The experiments are equipped with GAW and NGC.
(a) The effect of sample ratio \mathcal{R}caligraphic_R.
       Sample Ratio \mathcal{R}caligraphic_R        mAP
       0.125        50.13
       0.25        50.48
       0.5        49.92
       1.0        49.11
(b) The effect of different filtering criteria adopted in the background.
        Criteria         mAP
        Classification         49.41
        Centerness         49.43
        IoU         50.48
(c) The effect of IoU threshold 𝒯hsubscript𝒯\mathcal{T}_{h}caligraphic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.
          Threshold 𝒯hsubscript𝒯\mathcal{T}_{h}caligraphic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT           mAP
          0.05           50.39
          0.1           50.48
          0.2           50.25
          0.3           50.17
Refer to caption
Figure 9: Qualitative results of our method on semi-supervised multi-view oriented 3D object detection task. The above multi-view images show the 3D detection results (green boxes). We also draw the predicted results on the BEV view (the lower part) for better visualizations.

6.3.2 The generalization on semi-supervised multi-view oriented 3D object detection

We find that the multi-view oriented 3D objects are very similar to the objects from aerial scenes, as both of them have arbitrary orientations and are usually dense. In addition, the advanced multi-view oriented 3D object detection methods usually detect 3D objects by transforming the perspective from the camera view to the Bird’s Eye View (BEV), similar to aerial scenes captured from a BEV. Thus, to further demonstrate the generalization of our method, we conduct experiments on the most convincing large-scale multi-view oriented 3D object detection dataset, i.e., nuScenes [2]. As shown in Tab. VII, our method reports superior performance, consistently outperforming the supervised baseline and ARSL [35] by a large margin in terms of mAP and NDS metrics under both partially and fully labeled data settings.

In addition, we provide some visualization to present the effectiveness of our method, as shown in Fig. 9. These quantitative and qualitative results prove SOOD++’s ability to be generalized to semi-supervised multi-view oriented 3D object detection.

6.4 Analysis and Ablation Study

The following experiments are conducted on the large-scale DOTA-V1.5 dataset (val set). Unless specified, all the ablation experiments are performed using 10% of labeled data with single-scale training and testing.

6.4.1 The effectiveness of key components

We first study the effects of the proposed key components, including the Simple Instance-aware Dense Sampling (SIDS) strategy, Geometry-aware Adaptive Weighting (GAW) loss, and Noise-driven Global Consistency (NGC), the results listed in Tab. VIII.

There are various dense sampling strategies for constructing dense pseudo-labels. Dense Teacher [76] and our conference version [23] sample dense pseudo-labels from the entire predicted score maps and box area of the foreground, respectively. This work not only considers the easy pseudo-labels of the foreground but also mines the potential pseudo-labels of the background, leading to the proposal of SIDS. Although our SIDS is relatively simple, compared with the strategies in [76, 23], we demonstrate notable improvement, which indicates the constructed higher-quality pseudo-labels of our method.

Based on SIDS, when GAW or NGC is individually incorporated, an increase in mAP is observed. It demonstrates the effectiveness of constructing the one-to-one or many-to-many (global) consistency. Finally, when combining all our key components (NGC, GAW, and SIDS), the improvement will be amplified, achieving satisfactory 50.48, 57.44, and 61.51 mAP under the 10%, 20%, and 30% labeled data settings, respectively. It is reasonable, as the one-to-one and many-to-many relationships are highly complementary. These detailed ablation studies prove the consistent improvements and the value of each component.

6.4.2 The analysis of Simple Instance-aware Dense Sampling (SIDS) strategy

In this part, we discuss the design of our Simple Instance-aware Dense Sampling (SIDS) strategy, which is used to construct high-quality pseudo-dense labels.

The effect of sample ratio \mathcal{R}caligraphic_R. Tab. IX(a) studies the effect of the sample ratio \mathcal{R}caligraphic_R used in the foreground. The larger the number, the more pseudo-labels are sampled, and vice versa. We can see that the best performance, 50.48 mAP, is achieved when the sample ratio \mathcal{R}caligraphic_R is set to 0.25. We hypothesize that this value ensures a good balance between positive and negative pseudo-labels. Increasing it will introduce more noise pseudo-labels that harm the training process, and decreasing it will lead to the loss of some positive pseudo-labels and failure in learning. When set to 1.0, all pixel-level predictions within the instance box are sampled, which inevitably introduces a lot of noise dense pseudo-labels, leading to poor performance.

The filtering criteria used in the background. As discussed in Sec. 4.2, there are many potential hard objects that merge with the background, which are hard to extract as pseudo-labels. Thus, to construct comprehensive dense pseudo-labels, we utilize the predicted IoU score as an auxiliary metric to help sample dense pseudo-labels from the background. We study the effect of using different auxiliary metrics to mine the potential labels, including classification and centeredness scores of default rotated-FCOS [56], as shown in Tab. IX(b). The results prove the necessity of using IoU as the filtering criteria for mining the hard case of background. We argue the main reason is that the IoU score directly reflects the alignment between the predicted and ground truth, making it a more comprehensive metric than the other two.

The effect of threshold 𝒯hsubscript𝒯\mathcal{T}_{h}caligraphic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. We also discuss the effect of the IoU threshold 𝒯hsubscript𝒯\mathcal{T}_{h}caligraphic_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, as shown in Tab. IX(c). It is insensitive to the chosen value, and we empirically find that the threshold set as 0.1 can achieve a better performance.

TABLE X: The analysis of our Geometry-aware Adaptive Weighting (GAW) loss. The experiments are equipped with SIDS and NGC.
(a) The effect of used geometry information.
Orientation gap Aspect ratio mAP
- - 49.32
- 49.81
- 49.76
50.48
(b) The effect of ψ𝜓\psiitalic_ψ.
ψ𝜓\psiitalic_ψ mAP
w/o GAW 49.32
1 49.83
10 49.95
50 50.48
100 50.13
TABLE XI: The analysis of our Noise-driven Global Consistency (NGC). The experiments are equipped with SIDS strategy and GAW loss.
(a) The effect of different global alignments.
Components mAP
- 49.27
GC(𝐝t,𝐝s)subscript𝐺𝐶superscript𝐝𝑡superscript𝐝𝑠\mathcal{L}_{GC}(\mathbf{d}^{t},\mathbf{d}^{s})caligraphic_L start_POSTSUBSCRIPT italic_G italic_C end_POSTSUBSCRIPT ( bold_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) 49.83
GC(𝐝t,𝐝s)+GC(𝐝~t,𝐝~s)subscript𝐺𝐶superscript𝐝𝑡superscript𝐝𝑠subscript𝐺𝐶superscript~𝐝𝑡superscript~𝐝𝑠\mathcal{L}_{GC}(\mathbf{d}^{t},\mathbf{d}^{s})+\mathcal{L}_{GC}(\widetilde{% \mathbf{d}}^{t},\widetilde{\mathbf{d}}^{s})caligraphic_L start_POSTSUBSCRIPT italic_G italic_C end_POSTSUBSCRIPT ( bold_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT italic_G italic_C end_POSTSUBSCRIPT ( over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) 50.17
GC(𝐝t,𝐝s)+GC(𝐝~t,𝐝~s)+plansubscript𝐺𝐶superscript𝐝𝑡superscript𝐝𝑠subscript𝐺𝐶superscript~𝐝𝑡superscript~𝐝𝑠subscript𝑝𝑙𝑎𝑛\mathcal{L}_{GC}(\mathbf{d}^{t},\mathbf{d}^{s})+\mathcal{L}_{GC}(\widetilde{% \mathbf{d}}^{t},\widetilde{\mathbf{d}}^{s})+\mathcal{L}_{plan}caligraphic_L start_POSTSUBSCRIPT italic_G italic_C end_POSTSUBSCRIPT ( bold_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT italic_G italic_C end_POSTSUBSCRIPT ( over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT italic_p italic_l italic_a italic_n end_POSTSUBSCRIPT 50.48
(b) The effect of different cost maps.
Distance 𝑪i,jdistsuperscriptsubscript𝑪𝑖𝑗𝑑𝑖𝑠𝑡\boldsymbol{C}_{i,j}^{dist}bold_italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_i italic_s italic_t end_POSTSUPERSCRIPT Score 𝑪i,jscoresuperscriptsubscript𝑪𝑖𝑗𝑠𝑐𝑜𝑟𝑒\boldsymbol{C}_{i,j}^{score}bold_italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUPERSCRIPT mAP
- - 49.27
- 49.80
- 50.09
50.48
(c) The effect of noisy intensity β𝛽\betaitalic_β.
        β𝛽\betaitalic_β         mAP
        0.1         50.31
        0.2         50.38
        0.3         50.48
        0.4         50.13
        0.5         50.21
(d) The effect of 𝒯gsubscript𝒯𝑔\mathcal{T}_{g}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT.
𝒯gsubscript𝒯𝑔\mathcal{T}_{g}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT mAP
1 49.06
100 49.83
150 50.48
200 49.95
250 49.68

6.4.3 The analysis of Geometry-aware Adaptive Weighting (GAW) loss

The effect of geometry information in the GAW loss. In GAW, we utilize the intrinsic geometry information (e.g., orientation gap and aspect ratios) to construct the geometry-aware modulating factor (Eq. 5). To investigate their influence, we conduct experiments on different combinations of the GAW loss. As shown in Tab. X(a), without the GAW, we just achieve 49.32 mAP. By individually utilizing the orientation gap or aspect ratio to construct the GAW, they report similar performance gains. When we combine the two information, the 50.48 mAP performance is reached. This substantial improvement confirms that using both geometric information together provides a more thorough understanding of the oriented objects’ characteristics, leading to better detection performance.

The effect of hyper-parameter ψ𝜓\psiitalic_ψ. We also study the influence of the hyper-parameter ψ𝜓\psiitalic_ψ in GAW. As shown in Tab. X(b), we set ψ𝜓\psiitalic_ψ to 1 and get the performance of 49.83 mAP. As ψ𝜓\psiitalic_ψ increases, the performance of our method improves when ψ𝜓\psiitalic_ψ varies from 1 to 50. However, further increasing it to 100 slightly hurt the performance. Therefore, we set it to 50 by default. For this observation, we conjecture that increasing the weight ψ𝜓\psiitalic_ψ will enlarge the influence of geometry information but also amplify the impact of the teacher’s inaccurate labels. Notably, GAW consistently provides performance gains regardless of the ψ𝜓\psiitalic_ψ value.

6.4.4 The analysis of Noise-driven Global Consistency (NGC).

Our NGC aims to establish a relaxed global constraint for the teacher-student pair, and this part will ablate it.

The effect of different global alignments. As mentioned in Sec. 4.4, we treat the output of teacher and student as two global distributions and propose to use random noise to disturb the distributions. Then, we make alignment from multi-perspective, including the alignment (GC(𝐝t,𝐝s)subscript𝐺𝐶superscript𝐝𝑡superscript𝐝𝑠\mathcal{L}_{GC}(\mathbf{d}^{t},\mathbf{d}^{s})caligraphic_L start_POSTSUBSCRIPT italic_G italic_C end_POSTSUBSCRIPT ( bold_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT )) between the original teacher and the original student, the alignment (GC(𝐝~t,𝐝~s)subscript𝐺𝐶superscript~𝐝𝑡superscript~𝐝𝑠\mathcal{L}_{GC}(\widetilde{\mathbf{d}}^{t},\widetilde{\mathbf{d}}^{s})caligraphic_L start_POSTSUBSCRIPT italic_G italic_C end_POSTSUBSCRIPT ( over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT )) between the disturbed teacher and the disturbed student, and the alignment (plansubscript𝑝𝑙𝑎𝑛\mathcal{L}_{plan}caligraphic_L start_POSTSUBSCRIPT italic_p italic_l italic_a italic_n end_POSTSUBSCRIPT) between two OT plans generated by the two former alignments. Thus, we first study the effectiveness of using different alignments. As shown in Tab. XI(a), only using the GC(𝐝t,𝐝s)subscript𝐺𝐶superscript𝐝𝑡superscript𝐝𝑠\mathcal{L}_{GC}(\mathbf{d}^{t},\mathbf{d}^{s})caligraphic_L start_POSTSUBSCRIPT italic_G italic_C end_POSTSUBSCRIPT ( bold_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_d start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ), we achieve 49.83 mAP, 0.56 mAP gain compared with the method that equipped with SIDS strategy and GAW loss, indicating the necessary of many-to-many relationship. As expected, by further aligning the two disturbed distributions (GC(𝐝~t,𝐝~s)subscript𝐺𝐶superscript~𝐝𝑡superscript~𝐝𝑠\mathcal{L}_{GC}(\widetilde{\mathbf{d}}^{t},\widetilde{\mathbf{d}}^{s})caligraphic_L start_POSTSUBSCRIPT italic_G italic_C end_POSTSUBSCRIPT ( over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , over~ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT )) and two transport plans (plansubscript𝑝𝑙𝑎𝑛\mathcal{L}_{plan}caligraphic_L start_POSTSUBSCRIPT italic_p italic_l italic_a italic_n end_POSTSUBSCRIPT), we obtain a noticeable improvement, finally resulting in 50.48 mAP. We argue that using random noise to disturb the global distribution mitigates the detrimental effects of label noise and reinforces the model’s ability to learn robust features from the unlabeled data.

Refer to caption
Figure 10: An intuitive illustration444Our method, with and without NGC, results in different numbers of pseudo-labels. Therefore, we randomly selected 500 pixel-level samples for visualization. of the distribution of the absolute difference between the teacher and student’s classification scores. Lower values indicate that the teacher and the student have more consistent distributions.

The effect of different cost maps. To construct the cost map for solving the OT problem, we consider the spatial distance and the score difference of each possible matching pair. Thus, this part studies the effects of different cost maps of optimal transport in NGC. As shown in Tab. XI(b), individually utilizing spatial distance or score difference results in a maximum improvement of 0.82 mAP, indicating that relying on a single type of information is insufficient for capturing the comprehensive global prior necessary for effective learning. When both spatial distance and score difference are integrated into the cost maps, the improvement is amplified to 1.21 mAP. It highlights the mutually beneficial relationship between these two factors. With their help, we effectively model the many-to-many relationship between the teacher and the student, providing informative guidance to the model.

The effect of β𝛽\betaitalic_β and 𝒯gsubscript𝒯𝑔\mathcal{T}_{g}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. The former hyper-parameter β𝛽\betaitalic_β controls the intensity of adding noise, and the latter 𝒯gsubscript𝒯𝑔\mathcal{T}_{g}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT determines how long a sequence can be considered as a global distribution. As listed in Tab. XI(c) and Tab. XI(d), we empirically find that the best performance is achieved when the values of β𝛽\betaitalic_β and 𝒯gsubscript𝒯𝑔\mathcal{T}_{g}caligraphic_T start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT are set as 0.3 and 150, respectively.

Qualitative visualizations. We further provide qualitative visualizations to analyze the effect of the proposed NGC. Specifically, we visualize detection results with and without NGC in Fig. 4, along with the distribution of absolute difference values between the teacher and the student’s predictions. For the distribution map, lower values indicate that the teacher and the student’s distributions are more consistent. Fig. 4 clearly demonstrates that NGC can improve the consistency between the teacher and the student, leading to better prediction results.

6.5 Limitation and Related Discussion

This paper provides a simple yet solid framework for semi-supervised oriented object detection. Although our method achieves promising results, the usage of aerial objects’ characteristics is limited. Apart from orientation and global layout, many other properties of aerial objects should be considered, e.g., scale variations. In addition, this paper has explored the generalization of semi-supervised oriented 3D object detection tasks. However, some other oriented detection tasks, such as scene text detection, still need to be studied, and we leave these for the future work.

7 Conclusion

This paper proposes a practical solution named SOOD++ for semi-supervised oriented object detection. Focusing on oriented objects’ characteristics in aerial scenes, we customize Geometry-aware Adaptive Weighting (GAW) loss and Noise-driven Global Consistency (NGC). The former considers the importance of geometry information for oriented objects. The latter introduces the global layout concept to SSOD, measuring the global similarity between the teacher and the student in a many-to-many manner. Additionally, a Simple Instance-aware Dense Sampling (SIDS) strategy is proposed to construct high-quality pseudo-labels. To validate the effectiveness of our method, we have conducted extensive experiments on several challenging datasets. SOOD++ achieves consistent improvements under partially and fully labeled data settings, compared with SOTA methods. We hope this work will encourage the community to pay attention to semi-supervised oriented object detection and facilitate future research.

References

  • [1] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. Proc. of Advances in Neural Information Processing Systems, 32, 2019.
  • [2] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 11621–11631, 2020.
  • [3] Xinhao Cai, Qiuxia Lai, Yuwei Wang, Wenguan Wang, Zeren Sun, and Yazhou Yao. Poly kernel inception network for remote sensing detection. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 27706–27716, 2024.
  • [4] Wanxing Chang, Ye Shi, and **gya Wang. Csot: Curriculum and structure-aware optimal transport for learning with noisy labels. Proc. of Advances in Neural Information Processing Systems, 36:8528–8541, 2023.
  • [5] Binghui Chen, Pengyu Li, Xiang Chen, Biao Wang, Lei Zhang, and Xian-Sheng Hua. Dense learning based semi-supervised object detection. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 4815–4824, 2022.
  • [6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Proc. of Intl. Conf. on Machine Learning, pages 1597–1607. PMLR, 2020.
  • [7] Gong Cheng, Jiabao Wang, Ke Li, Xingxing Xie, Chunbo Lang, Yanqing Yao, and Junwei Han. Anchor-free oriented proposal generator for object detection. IEEE Transactions on Geoscience and Remote Sensing, 60:1–11, 2022.
  • [8] Gong Cheng, Xiang Yuan, Xiwen Yao, Kebing Yan, Qinghua Zeng, Xingxing Xie, and Junwei Han. Towards large-scale small object detection: Survey and benchmarks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [9] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Proc. of Advances in Neural Information Processing Systems, 26, 2013.
  • [10] Linhui Dai, Hong Liu, Hao Tang, Zhiwei Wu, and Pinhao Song. Ao2-detr: Arbitrary-oriented object detection transformer. IEEE Transactions on Circuits and Systems for Video Technology, 2022.
  • [11] Jian Ding, Nan Xue, Gui-Song Xia, Xiang Bai, Wen Yang, Michael Ying Yang, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, et al. Object detection in aerial images: A large-scale benchmark and challenges. volume 44, pages 7778–7796. IEEE, 2021.
  • [12] Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.
  • [13] Chengjian Feng, Yujie Zhong, Yu Gao, Matthew R Scott, and Weilin Huang. Tood: Task-aligned one-stage object detection. In Porc. of IEEE Intl. Conf. on Computer Vision, pages 3490–3499. IEEE Computer Society, 2021.
  • [14] Xiaoxu Feng, Xiwen Yao, Gong Cheng, and Junwei Han. Weakly supervised rotation-invariant aerial object detection network. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 14146–14155, 2022.
  • [15] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a wasserstein loss. Proc. of Advances in Neural Information Processing Systems, 28, 2015.
  • [16] Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, and Jian Sun. Ota: Optimal transport assignment for object detection. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 303–312, 2021.
  • [17] Ross Girshick. Fast r-cnn. In Porc. of IEEE Intl. Conf. on Computer Vision, pages 1440–1448, 2015.
  • [18] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. Proc. of Advances in Neural Information Processing Systems, 17, 2004.
  • [19] Zonghao Guo, Chang Liu, Xiaosong Zhang, Jianbin Jiao, Xiangyang Ji, and Qixiang Ye. Beyond bounding-box: Convex-hull feature adaptation for oriented and densely packed object detection. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 8792–8801, 2021.
  • [20] Jiaming Han, Jian Ding, Jie Li, and Gui-Song Xia. Align deep features for oriented object detection. IEEE Transactions on Geoscience and Remote Sensing, 60:1–11, 2021.
  • [21] Jiaming Han, Jian Ding, Nan Xue, and Gui-Song Xia. Redet: A rotation-equivariant detector for aerial object detection. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 2786–2795, 2021.
  • [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [23] Wei Hua, Dingkang Liang, **gyu Li, Xiaolong Liu, Zhikang Zou, Xiaoqing Ye, and Xiang Bai. Sood: Towards semi-supervised oriented object detection. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 15558–15567, 2023.
  • [24] Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  • [25] Jisoo Jeong, Seungeui Lee, Jeesoo Kim, and Nojun Kwak. Consistency-based semi-supervised learning for object detection. Proc. of Advances in Neural Information Processing Systems, 32, 2019.
  • [26] Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yuning Jiang. Acquisition of localization confidence for accurate object detection. In Proc. of European Conference on Computer Vision, pages 784–799, 2018.
  • [27] Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Proc. of Intl. Conf. on Machine Learning, volume 3, page 896, 2013.
  • [28] Gang Li, Xiang Li, Yujie Wang, Shanshan Zhang, Yichao Wu, and Ding Liang. Pseco: Pseudo labeling and consistency training for semi-supervised object detection. In Proc. of European Conference on Computer Vision, 2022.
  • [29] **gyu Li, Zhe Liu, **ghua Hou, and Dingkang Liang. Dds3d: Dense pseudo-labels with dynamic threshold for semi-supervised 3d object detection. Proc. of IEEE Intl. Conf. on Robotics and Automation, 2023.
  • [30] Jiale Li, Shujie Luo, Ziqi Zhu, Hang Dai, Andrey S Krylov, Yong Ding, and Ling Shao. 3d iou-net: Iou guided 3d object detector for point clouds. arXiv preprint arXiv:2004.04962, 2020.
  • [31] Wentong Li, Yijie Chen, Kaixuan Hu, and Jianke Zhu. Oriented reppoints for aerial object detection. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 1829–1838, 2022.
  • [32] Yuxuan Li, Qibin Hou, Zhaohui Zheng, Ming-Ming Cheng, Jian Yang, and Xiang Li. Large selective kernel network for remote sensing object detection. In Porc. of IEEE Intl. Conf. on Computer Vision, pages 16794–16805, 2023.
  • [33] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 2117–2125, 2017.
  • [34] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Porc. of IEEE Intl. Conf. on Computer Vision, pages 2980–2988, 2017.
  • [35] Chang Liu, Weiming Zhang, Xiangru Lin, Wei Zhang, Xiao Tan, Junyu Han, Xiaomao Li, Errui Ding, and **gdong Wang. Ambiguity-resistant semi-supervised learning for dense object detection. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 15579–15588, 2023.
  • [36] Liang Liu, Boshen Zhang, Jiangning Zhang, Wuhao Zhang, Zhenye Gan, Guanzhong Tian, Wenbing Zhu, Yabiao Wang, and Chengjie Wang. Mixteacher: Mining promising labels with mixed scale teacher for semi-supervised object detection. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 7370–7379, 2023.
  • [37] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In Proc. of European Conference on Computer Vision, pages 21–37. Springer, 2016.
  • [38] Yanbin Liu, Linchao Zhu, Makoto Yamada, and Yi Yang. Semantic correspondence as an optimal transport problem. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 4463–4472, 2020.
  • [39] Yen-Cheng Liu, Chih-Yao Ma, Zijian He, Chia-Wen Kuo, Kan Chen, Peizhao Zhang, Bichen Wu, Zsolt Kira, and Peter Vajda. Unbiased teacher for semi-supervised object detection. In Proc. of International Conference on Learning Representations, 2021.
  • [40] Yen-Cheng Liu, Chih-Yao Ma, and Zsolt Kira. Unbiased teacher v2: Semi-supervised object detection for anchor-free and anchor-based detectors. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 9819–9828, 2022.
  • [41] Zikun Liu, Hongzhen Wang, Lubin Weng, and Yi** Yang. Ship rotated bounding box space for ship extraction from high-resolution optical satellite images with complex backgrounds. IEEE geoscience and remote sensing letters, 13(8):1074–1078, 2016.
  • [42] Junwei Luo, Xue Yang, Yi Yu, Qingyun Li, Junchi Yan, and Yansheng Li. Pointobb: Learning oriented object detection via single point supervision. Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2024.
  • [43] Chengqi Lyu, Wenwei Zhang, Haian Huang, Yue Zhou, Yudong Wang, Yanyi Liu, Shilong Zhang, and Kai Chen. Rtmdet: An empirical study of designing real-time object detectors. arXiv preprint arXiv:2212.07784, 2022.
  • [44] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8):1979–1993, 2018.
  • [45] Vu Nguyen, Tam Le, Makoto Yamada, and Michael A Osborne. Optimal transport kernels for sequential and parallel neural architecture search. In Proc. of Intl. Conf. on Machine Learning, pages 8084–8095. PMLR, 2021.
  • [46] Wen Qian, Xue Yang, Silong Peng, Junchi Yan, and Yue Guo. Learning modulated loss for rotated object detection. In Proc. of the AAAI Conf. on Artificial Intelligence, volume 35, pages 2458–2466, 2021.
  • [47] Svetlozar T Rachev. The monge–kantorovich mass transference problem and its stochastic applications. Theory of Probability & Its Applications, 29(4):647–676, 1985.
  • [48] Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, and Kaiming He. Data distillation: Towards omni-supervised learning. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 4119–4128, 2018.
  • [49] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 779–788, 2016.
  • [50] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Proc. of Advances in Neural Information Processing Systems, 28, 2015.
  • [51] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Proc. of Advances in Neural Information Processing Systems, 29, 2016.
  • [52] Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Proc. of Advances in Neural Information Processing Systems, 33:596–608, 2020.
  • [53] Kihyuk Sohn, Zizhao Zhang, Chun-Liang Li, Han Zhang, Chen-Yu Lee, and Tomas Pfister. A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757, 2020.
  • [54] Yihe Tang, Weifeng Chen, Yijun Luo, and Yuting Zhang. Humble teachers teach better students for semi-supervised object detection. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 3132–3141, 2021.
  • [55] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Proc. of Advances in Neural Information Processing Systems, 30, 2017.
  • [56] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In Porc. of IEEE Intl. Conf. on Computer Vision, pages 9627–9636, 2019.
  • [57] Xinjiang Wang, Xingyi Yang, Shilong Zhang, Yijiang Li, Litong Feng, Shijie Fang, Chengqi Lyu, Kai Chen, and Wayne Zhang. Consistent-teacher: Towards reducing inconsistent pseudo-targets in semi-supervised object detection. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 3240–3249, 2023.
  • [58] Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liangpei Zhang. Dota: A large-scale dataset for object detection in aerial images. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 3974–3983, 2018.
  • [59] Zikai Xiao, Guoye Yang, Xue Yang, Taijiang Mu, Junchi Yan, and Shimin Hu. Theoretically achieving continuous representation of oriented bounding boxes. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 16912–16922, 2024.
  • [60] Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. Unsupervised data augmentation for consistency training. Proc. of Advances in Neural Information Processing Systems, 33:6256–6268, 2020.
  • [61] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classification. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 10687–10698, 2020.
  • [62] Xingxing Xie, Gong Cheng, Jiabao Wang, Xiwen Yao, and Junwei Han. Oriented r-cnn for object detection. In Porc. of IEEE Intl. Conf. on Computer Vision, pages 3520–3529, 2021.
  • [63] Chang Xu, Jian Ding, **wang Wang, Wen Yang, Huai Yu, Lei Yu, and Gui-Song Xia. Dynamic coarse-to-fine learning for oriented tiny object detection. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 7318–7328, 2023.
  • [64] Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. End-to-end semi-supervised object detection with soft teacher. In Porc. of IEEE Intl. Conf. on Computer Vision, pages 3060–3069, 2021.
  • [65] Yongchao Xu, Mingtao Fu, Qimeng Wang, Yukang Wang, Kai Chen, Gui-Song Xia, and Xiang Bai. Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(4):1452–1459, 2020.
  • [66] Jiechao Yang, Yong Liu, and Hongteng Xu. Hotnas: Hierarchical optimal transport for neural architecture search. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 11990–12000, 2023.
  • [67] Qize Yang, Xihan Wei, Biao Wang, Xian-Sheng Hua, and Lei Zhang. Interactive self-training with mean teachers for semi-supervised object detection. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 5941–5950, 2021.
  • [68] Xue Yang, Li** Hou, Yue Zhou, Wentao Wang, and Junchi Yan. Dense label encoding for boundary discontinuity free rotation detection. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 15819–15829, 2021.
  • [69] Xue Yang and Junchi Yan. On the arbitrary-oriented object detection: Classification based approaches revisited. International Journal of Computer Vision, 130(5):1340–1365, 2022.
  • [70] Xue Yang, Junchi Yan, Ziming Feng, and Tao He. R3det: Refined single-stage detector with feature refinement for rotating object. In Proc. of the AAAI Conf. on Artificial Intelligence, volume 35, pages 3163–3171, 2021.
  • [71] Xue Yang, Gefan Zhang, Wentong Li, Xuehui Wang, Yue Zhou, and Junchi Yan. H2rbox: Horizontal box annotation is all you need for oriented object detection. Proc. of International Conference on Learning Representations, 2023.
  • [72] Xue Yang, Gefan Zhang, Xiaojiang Yang, Yue Zhou, Wentao Wang, ** Tang, Tao He, and Junchi Yan. Detecting rotated objects as gaussian distributions and its 3-d generalization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4335–4354, 2022.
  • [73] Hongtian Yu, Yunjie Tian, Qixiang Ye, and Yunfan Liu. Spatial transform decoupling for oriented object detection. In Proc. of the AAAI Conf. on Artificial Intelligence, volume 38, pages 6782–6790, 2024.
  • [74] Yi Yu, Xue Yang, Qingyun Li, Yue Zhou, Feipeng Da, and Junchi Yan. H2rbox-v2: Incorporating symmetry for boosting horizontal box supervised oriented object detection. Proc. of Advances in Neural Information Processing Systems, 36, 2023.
  • [75] Jiacheng Zhang, Xiangru Lin, Wei Zhang, Kuo Wang, Xiao Tan, Junyu Han, Errui Ding, **gdong Wang, and Guanbin Li. Semi-detr: Semi-supervised object detection with detection transformers. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 23809–23818, 2023.
  • [76] Hongyu Zhou, Zheng Ge, Songtao Liu, Weixin Mao, Zeming Li, Haiyan Yu, and Jian Sun. Dense teacher: Dense pseudo-labels for semi-supervised object detection. In Proc. of European Conference on Computer Vision, 2022.
  • [77] Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin Dogus Cubuk, and Quoc Le. Rethinking pre-training and self-training. Proc. of Advances in Neural Information Processing Systems, 33:3833–3845, 2020.