(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: Tsinghua University, Haidian District, Bei**g, 100084, China

Improving Detection in Aerial Images by Capturing Inter-Object Relationships

Botao Ren 11 Botian Xu 11 Yifan Pu 11 **gyi Wang 11 Zhidong Deng Corresponding author. Zhidong Deng is with Bei**g National Research Center for Information Science and Technology (BNRist), Institute for Artificial Intelligence at Tsinghua University (THUAI), Department of Computer Science, State Key Laboratory of Intelligent Technology and Systems, Tsinghua University, Bei**g 100084, China.11

Abstract

In many image domains, the spatial distribution of objects in a scene exhibits meaningful patterns governed by their semantic relationships. In most modern detection pipelines, however, the detection proposals are processed independently, overlooking the underlying relationships between objects. In this work, we introduce a transformer-based approach to capture these inter-object relationships to refine classification and regression outcomes for detected objects. Building on two-stage detectors, we tokenize the region of interest (RoI) proposals to be processed by a transformer encoder. Specific spatial and geometric relations are incorporated into the attention weights and adaptively modulated and regularized. Experimental results demonstrate that the proposed method achieves consistent performance improvement on three benchmarks including DOTA-v1.0, DOTA-v1.5, and HRSC 2016, especially ranking first on both DOTA-v1.5 and HRSC 2016. Specifically, our new method has an increase of 1.59 mAP on DOTA-v1.0, 4.88 mAP on DOTA-v1.5, and 2.1 mAP on HRSC 2016, respectively, compared to the baselines.

Keywords:

Aerial image processing Object detection Inter-object relationship utilization

1 Introduction

Object detection has been one of the most studied problems in computer vision due to its great value in practical applications ranging from surveillance and autonomous driving to natural disaster management. The field has seen impressive advancements due to novel models and training techniques developed in the past few years [17, 22, 1]. Among various domains, object detection in aerial images stands out with characteristics and challenges different from those presented in natural images: objects are distributed with drastically varying scales, orientations, and spatial densities.

To tackle this challenge, prior works proposed to improve the detection performance from different perspectives, achieving various degrees of success. Many efforts have focused on learning more appropriate features by exploiting the geometric properties, e.g., symmetry and rotational invariance, leading to novel architectures and data augmentation techniques [10, 27, 32, 30, 21]. Others have developed metrics and objectives [33, 31, 34] that better capture the nuances of aerial object detection.

Nevertheless, despite these advancements, most present-day detection models classify and localize objects independently [10, 27, 32], possibly due to the lack of an effective tool for modeling the co-presence of an arbitrary number of objects in an image. In other words, the spatial and semantic relationships among objects are not fully captured, often leading to false detections that overlook contextual dependencies and inter-object dynamics. As a motivating example, Fig. 1 illustrates the challenge of detecting each object instance based solely on its features, without considering these critical relationships. Aerial images, in particular, offer a unique setting where objects generally share the same plane, with little occlusion and perspective distortion, and therefore have stable inter-object relationships.

Addressing the aforementioned gap, in this paper, we propose a Transformer-based model on top of two-stage detectors to effectively capture and leverage the inter-object relationships. Concretely, we organize the Region of Interest (RoI) [8, 23] feature maps proposed in the first stage and the independent detection results on them into embeddings. The embeddings are then fed into a transformer where the features of candidate detections interact and aggregate. However, the self-attention module in ordinary Transformers, which computes the pairwise attention weights as dot products of embeddings, does not capture the spatial and geometric relationship directly. To overcome this, we design and incorporate additional encodings and attention functions, weighing the mutual influence between objects according to distances. The attention functions are adaptive to the scales and densities of the object distribution, which is crucial for the model to generalize across different image scenarios. We conduct comprehensive experiments on DOTA-v1.0, DOTA-v1.5 [26], and HRSC2016 [18] to evaluate the efficacy of our model, achieving an improvement of 1.59, 4.88, and 2.1 mAP over the baseline.

Our main contribution can be summarized as follows:

•

We introduce a novel Transformer-based model that extends the capability of two-stage detectors, enabling the effective encapsulation and utilization of inter-object relationships in aerial image detection.
•

Our model innovatively incorporates additional encodings and attention mechanisms that directly address spatial and geometric relationships, enhancing adaptability to the varying scales and densities in object distribution, a critical step forward for generalization in diverse aerial scenarios.

2 Related Works

2.1 Object Detection in Aerial Images

In the realm of aerial object detection, extensive research has been conducted to tackle the unique challenges posed by the diverse characteristics of aerial imagery. Numerous studies have explored both single-stage and two-stage methodologies. Notable two-stage methods include ReDet [10], which focuses on handling scale, orientation, and aspect ratio variations, Oriented RCNN [27] that introduces improved representations of oriented bounding boxes, and SCRDet [32] designed for addressing the challenges of dense clusters of small objects. Additionally, SASM[13], Gliding vertex[29], and Region of Interest (RoI) Transformer[6] have contributed to the advancement of two-stage approaches. On the other hand, single-stage methods such as R³Det[30], S²ANet[9], and DAL[20] have been developed, demonstrating the diversity of strategies employed in the pursuit of efficient aerial object detection. These methodologies often incorporate modifications to convolution layers, novel loss functions like GWD[31], KLD[33], and KFIoU[34], as well as multi-scale training and testing strategies to enhance the robustness of object detection in aerial imagery. The evolving landscape of aerial object detection research reflects the ongoing efforts to address the complex challenges inherent in this field. In addition, ReDet[10] and ARC[21] modified the convolution layers to explicitly cope with rotation.

2.2 Capturing Inter-Object Relationships

Common detection systems handle candidate object instances individually. They implicitly assume that the distribution of objects in an image is conditionally independent, which is generally false in reality. Transformer-based architecture [25, 7] has shown impressive capability in relational modeling across multiple domains. To address the oversight mentioned above, [14] introduced an object relation module that computes attention [25] weights from both geometric and appearance features of object proposal. The module is also responsible for learning duplicate removal in place of Non-Maximum Suppression (NMS), leading to an end-to-end detector. More recently, DETR [2, 12] formulates detection as a set-prediction problem and sets up object queries that interact with each other in a Transformer-decoder [25]. Its successors [4, 24] improved the framework’s efficiency by operating directly on features instead of object queries. Graph Neural Networks have also been explored as a powerful alternative in relation modeling for object detection and scene understanding. Typically, one constructs the graph with the objects being the nodes and the spatial relations as edges [15, 35, 5]. [28] instead models region-to-region relations with learned edges. They also differ in how edges are obtained. In comparison to prior works, our method focuses on aerial images where the inter-object relationships are stable, with a more explicit design.

3 Methodology

We build our method on the two-stage object detection framework presented in [10]. We start by following the original pipeline to obtain features and preliminary detections, which are then transformed into embeddings and input into the Transformer with additional encodings. To better leverage the Transformer, we introduce a novel attention function on top of the common scaled dot product and a set of spatial relations. It aims to reflect the degree of correlation between objects based on distances in the image, emphasizing neighboring detections while being aware of object scale and density. Eventually, we perform another detection on the features given by the Transformer to obtain the final results. The overview of our model is shown in Fig. 2.

3.1 RoI Tokens

In a two-stage detector, the Region Proposal Network (RPN) proposes for each image $N$ RoIs from which we extract features $\{\mathbf{f}_{i}\}_{i=1}^{N}$ . We apply the standard detection objective, namely classification and bounding box regression, on the features to obtain for each RoI a class label $c_{i}$ and a bounding box pose $(x_{i},y_{i},w_{i},h_{i},\alpha_{i})$ representing the center coordinates, width, height and orientation, respectively.

Subsequently, we map $w_{i},h_{i}$ through linear layers to high-dimensional embeddings $\mathbf{w}_{i},\mathbf{h}_{i}$ . They are then concatenated with the logits of the class distribution $\mathbf{c}_{i}$ to form the RoI Token:

\text{Token}_{i}=(\mathbf{f}_{i}\oplus\mathbf{c}_{i}\oplus\mathbf{w}_{i}\oplus% \mathbf{h}_{i})+\text{pos}(x_{i},y_{i})

(1)

where $\oplus$ denotes concatenation and the position encoding $\text{pos}(x_{i},y_{i})$ is computed as in [7] and added to enforce spatial information.

We will show in the experiments that, having two detection phases (preliminary and final) is vital to the success of our model. This also distinguishes our work from prior works [14].

3.2 Spatial and Geometric Relations

Our method uses an encoder-only Transformer to capture the relationships between objects. However, the cosine distance self-attention computed between tokens in Transformers associates more closely to their semantic similarity but not spatial relations. Therefore, we introduce a series of $k$ relations $\{P^{i}\}_{i=1}^{k}$ accounting for the relative position and geometry between the preliminary detections as listed in Tab. 1. Similar to self-attention, each relation is computed in a pair-wise manner, i.e., $P^{i}\in\mathrm{R}^{N\times N}$ . We concatenate them into a $N\times N\times k$ tensor and aggregate them to $N\times N\times 1$ by passing through a linear layer:

Table 1: Spatial and geometric relations considered in computing the attention weights between two RoIs.

Name	Formula	Description
$dx$	$x_{2}-x_{1}$	X-axis distance
$dy$	$y_{2}-y_{1}$	Y-axis distance
dist	$\sqrt{dx^{2}+dy^{2}}$	Euclidean distance
$d\alpha$	$\alpha_{2}-\alpha_{1}$	Angular difference
IoU	$intersect/union$	Intersection over Union
area	$(w_{1}h_{1})/(w_{2}h_{2})$	Relative area ratio

P=\text{linear}(\text{stack}(P^{1},\dots,P^{k})).

(2)

3.3 Adaptive Attention Weights

An inherent challenge in aerial images is that objects in a scene can vary drastically in size, orientation, and aspect ratio. Also, certain object types tend to cluster densely (like cars in parking lots) or align in specific patterns (such as parallel tennis courts). Thus the relationship between an object and others should be highly specific to the object instance and the contextual information around it. Based on this observation, we devise a novel scheme to adaptively adjust the attention weights with the following considerations.

3.3.1 Spatial Distance

The scale of aerial images could be several kilometers. To focus more on the neighboring objects that are expected to be more relevant, we introduce a distance-decaying coefficient for each object pair:

A_{ij}=\underbrace{\exp(-(\epsilon_{i}d_{ij})^{2}/\sigma^{2})}_{\text{distance% , scale and density}}\circ\ \underbrace{\mathbf{1}\{\text{IoU}_{ij}<\delta\}}_% {\text{overlap**}}

(3)

where $d_{ij}=\sqrt{(x_{i}-x_{j})^{2}+(y_{i}-y_{j})^{2}}$ is the pair-wise distance, $\circ$ the element-wise product, and $\mathbf{1}\{\cdot\}$ the indicator function. $\epsilon_{i}$ is detailed next. $\sigma$ is a hyperparameter.

3.3.2 Object Scale and Density

It is a natural intuition that the influence of one object on another relates to the scale (size) of the object, e.g., smaller objects tend to be more influenced by nearby objects, whereas larger objects need to capture the impact at longer ranges. The density around a proposed detection is another important factor. We assume that when there are fewer other RoIs around a detection (i.e., lower density), it should capture the influence of RoIs from further away. In contrast, in areas with high density (many RoIs), it should mainly interact with nearby RoIs. To qualitatively model these factors, we compute $\epsilon_{i}$ as:

\epsilon_{i}=\frac{S}{\sqrt{w_{i}h_{i}}}\times\exp(\bar{\rho}_{i})

(4)

where $S$ is a global (dataset-wide) scale factor determined by the input image.

To account for the density around an RoI, we first calculate for the $i$ -th RoI:

\rho_{i}=\sum_{i}w_{i}h_{i}\exp(-(\frac{S}{\sqrt{w_{i}h_{i}}}d_{ij})^{2}/% \sigma^{2}),

(5)

then image-wise normalize $\rho_{i}$ and map them into $(-1,1)$ :

\bar{\rho}_{i}=\text{tanh}((\rho_{i}-\text{mean}(\rho))/\text{std}(\rho)).

(6)

3.3.3 RoI Overlap**

In addition to the aforementioned aspects, it is also necessary to mitigate the self-influence among multiple overlap** RoIs corresponding to the same object. Specifically, if we do not exclude these closely overlap** RoIs, their proximity to each other could lead to them being overly emphasized in the attention calculation while neglecting the interactions between RoIs of different objects. Therefore, we mask the attention weights to only consider RoIs with IoU below a certain threshold $\delta$ .

The overall attention weights are calculated as

A\circ\text{softmax}(Q^{\text{T}}K+P)

(7)

where $P$ is the aggregated spatial and geometric relations computed in Eq. 2.

4 Experiment

4.1 Dataset

DOTA is a large-scale aerial object detection dataset.

DOTA-v1.0 contains 2,806 images, with sizes ranging from $800\times 800$ to $4000\times 4000$ pixels. It includes 188,282 instances across 15 categories, annotated as: Plane (PL), Baseball Diamond (BD), Bridge (BR), Ground Track Field (GTF), Small Vehicle (SV), Large Vehicle (LV), Ship (SH), Tennis Court (TC), Basketball Court (BC), Storage Tank (ST), Soccer Ball Field (SBF), Roundabout (RA), Harbor (HA), Swimming Pool (SP), and Helicopter (HC). Following the common practice [10], we use both the training and validation sets for training and the test set for testing. We report mAP in PASCAL VOC2007 format and submit the testing result on the official dataset server.

DOTA-v1.5 uses the same image set but with increased annotations. This version features 402,089 instances and introduces an additional category, Container Crane (CC), broadening the dataset’s applicability in aerial object detection.

HRSC2016 focuses on ship detection in aerial images, containing 1,061 images with a total of 2,976 instances. Image sizes in this dataset range from 300×300 to 1500×900 pixels. The dataset is divided into training, validation, and test sets with 436, 181, and 444 images, respectively.

4.2 Implementation Details

Our implementation is based on the MMRotate [36] library and adopts ReDet’s framework and hyperparameter settings. We train our model for $12$ epochs using the AdamW [19] optimizer with an initial learning rate of $1e-4$ , reduced to $1e-5$ and $1e-6$ at epochs 8 and 11. We also use a weight decay of 0.05. The experiments were conducted using two RTX 3090 GPUs.

The Transformer module consists of 6 encoder layers, similar to the ViT structure, and integrates sinusoidal two-dimensional absolute position encoding, hyperparameters $\sigma$ is set to 4. A dropout rate of 0.1 is employed during the training phase of the Transformer.

Table 2: Results of each object class on the DOTA-v1.5 dataset. We highlight the best in blue.

Method PL BD BR GTF SV LV SH TC BC ST SBF RA HA SP HC CC mAP RetinaNet-O[16] 71.43 77.64 42.12 64.65 44.53 56.79 73.31 90.84 76.02 59.96 46.95 69.24 59.65 64.52 48.06 0.83 59.16 RF. R-CNN [23] 72.20 76.43 47.58 69.91 51.99 70.52 80.27 90.87 79.16 68.63 59.57 72.34 66.44 66.07 55.29 6.87 64.63 Mask R-CNN[11] 76.84 73.51 49.90 57.80 51.31 71.34 79.75 90.46 74.21 66.07 46.21 70.61 63.07 64.46 57.81 9.42 62.67 HTC [3] 77.80 73.67 51.40 63.99 51.54 73.31 80.31 90.48 75.21 67.34 48.51 70.63 64.84 64.48 55.87 5.15 63.40 RoI-Trans. [6] 72.27 81.95 54.47 70.02 52.49 76.31 81.03 90.90 84.19 69.12 62.85 72.73 68.67 65.89 57.09 7.12 66.69 ReDet [10] 79.20 82.81 51.92 71.41 52.38 75.73 80.92 90.83 75.81 68.64 49.29 72.03 73.36 70.55 63.33 11.53 66.86 Ours 80.79 85.58 53.80 71.74 52.72 77.84 88.71 90.89 86.24 74.73 64.97 73.31 76.76 72.99 73.84 22.87 71.74

Table 3: More experiments on DOTA-v1.5. We highlight our method’s improvements in blue

. Method mAP RoI Transformer 66.69 RoI Transformer + Ours 68.61(+1.92) Oriented RCNN 65.94 Oriented RCNN + Ours 66.38(+0.44) Rotated Faster RCNN 64.63 Rotated Faster RCNN + Ours 65.28(+0.65) ReDet 66.86 ReDet + Ours 71.74(+4.88)

4.3 Comparison with Baselines

First, we evaluate our model against the baselines on DOTA-v1.0, DOTA-v1.5, and HRSC2016 to demonstrate the efficacy of the proposed method. The results are shown in Tab. 2, Tab. 4, respectively. To further illustrate the versatility of our approach, we conducted additional experiments on DOTA-v1.5 with our method applied to various models, yielding consistent improvements, as shown in Tab. 3. These results demonstrate that our method consistently outperforms the baselines across different datasets. Notably, however, the improvement achieved in HRSC2016 is marginal compared to that on DOTA-v1.5. This is possibly due to the number of instances in a single image being much fewer in HRSC (typically less than 4), thus there are limited opportunities to leverage the inter-object relationships. These findings suggest that our model’s strengths are most pronounced in scenarios rich in object interactions and contextual dynamics, aligning with our design’s focus on capturing and utilizing inter-object relationships.

Table 4: Results in COCO style on DOTA-v1.0 and HRSC2016. We highlight our method’s improvements in blue

. Dataset Method AP50 AP75 mAP DOTA-v1.0 ReDet 76.25 50.86 47.11 Ours 77.84(+1.59) 51.42(+0.56) 48.32(+1.21) HRSC2016 ReDet 90.46 89.46 70.41 Ours 90.49(+0.03) 89.67(+0.21) 72.51(+2.10)

Table 5: Detection performance with and without preliminary detection training.

Table 6: Effect of the components in computing the attention weights via Eq. 7

pre cls supervision	mAP
w/o	69.86
w	71.74

Transformer	Relations ( $P$ )	Ada. Weight ( $A$ )	mAP
			66.86
✓			69.02
✓	✓		71.09
✓		✓	70.87
✓	✓	✓	71.74

Table 6: Effect of the components in computing the attention weights via Eq. 7

Table 7: Design choices of the relations and adaptive weights.

(a) Effect of the spatial and geometric relations.

Method	mAP
$dx$ $dy$ $da$	70.84
+ dist	70.96
+ IoU	71.51
+ area	71.74

(b) Effect of the factors in computing

\beta

Attention Weights ( $\beta$ )	mAP
baseline (ReDet)	66.86
- scale ( $\epsilon=\sqrt{S}=32$ )	70.35(+3.49)
- overlap ( $\delta=1$ )	67.32(+0.46)
- density ( $\bar{\rho}_{i}=0$ )	69.85(+2.99)
Ours	71.74(+4.88)

4.4 Ablation Study

To understand the effect of each design choice in our model, we conduct a series of experiments to shed light on the following:

4.4.1 Preliminary Detection Phase

Compared to the standard detection pipeline, our model incorporates two detection heads - placed before and after the Transformer module. The output from the initial detection phase, termed ’Preliminary Detections’, includes a classification result (parameterized as a softmax distribution) from the first head, which forms a component of the RoI token. We posit that knowing the class information with uncertainties would help with reasoning about the inter-object relationships. To empirically validate this hypothesis, we compared the performance of our model with and without training the first detection head. As Tab. 6 shows, although solely incorporating the Transformer offers an improvement of mAP to the baseline, omitting the preliminary detection leads to a notable decline in performance. This suggests that relying only on the Transformer for RoIs to interact lacks efficacy. In contrast, the explicit inclusion of preliminary classification data, despite its potential inaccuracies, enhances the model’s ability to reason about semantical and contextual relationships. The results underscore the value of early classification cues in guiding the relational reasoning process within our proposed architecture.

4.4.2 Spatial and Geometric Relations

The different terms presented in Tab. 1 characterize various aspects of the spatial and geometric relationships among objects (RoI Tokens) within an image. In this section, we aim to empirically evaluate the individual and collective contributions of these spatial and geometric relational terms to the overall performance of our detection model. As shown in Tab. 7(a), IoU and rel. area contribute the most. Intuitively, they are particularly helpful when reasoning about the co-occurrence and spatial arrangement of objects. For example, IoU helps to disambiguate the overlap**, potentially duplicate or conflicting detections. Similarly, relative area aids in discerning the size relationship between objects. Consequently, our method can effectively solve the problem in the motivation example. See Sec. 5.2 for details.

4.4.3 Adaptive Attention Weights

As mentioned in Sec. 3.3, making the attention weights adaptive to specific RoI Tokens is essential to cope with the diversity and complexities in a scene. We evaluate our density- and scale-aware attention weighting scheme which is designed to augment the scaled-dot-product self-attention and allow the model to dynamically adjust its focus based on the scale of objects and their surrounding density. Findings in Tab. 7(b) indicate that masking the influence of overlap** RoIs plays a crucial role. This observation aligns with our initial understanding that indiscriminately emphasizing neighboring RoIs, without considering overlap, could lead to skewed attention distributions and potentially impair the model’s ability to accurately discern between distinct objects.

5 Analysis

To gain insights into how inter-object relationships have improved detection performance, we collect and analyze dataset-wise statistics and specific examples.

5.1 Evidential Statistics

By examining the data we found that many false detections deviate far from the typical scales associated with their respective categories. To investigate this observation, we compute for each category the mean and standard deviation of object scale $\sqrt{w_{i}h_{i}}$ using detections with confidence $>0.9$ on the test set. We then identified outliers as those detections deviating from the mean by more than three times the standard deviation. This method provides a rough measure of the frequency of incorrect scale detections. As shown in Fig. 4, the detections produced by our methods have substantially fewer outliers compared to the baseline. This result suggests that our model better maintains scale consistency across different object categories. This improvement is particularly vital in aerial image analysis, where scale variance is substantial and often indicative of the detection model’s reliability and robustness.

Additionally, our visual analysis revealed a common misclassification of many land-based objects as Ship. To quantify this observation, we compute the average chamfer distance between certain categories $S_{1}$ and $S_{2}$ in an image:

d(S_{1},S_{2})=\frac{1}{2}(\frac{1}{|S_{1}|}\sum_{i\in S_{1}}\min_{j\in S_{2}}% \|\text{dist}_{ij}\|^{2}_{2}+\frac{1}{|S_{2}|}\sum_{j\in S_{2}}\min_{i\in S_{1% }}\|\text{dist}_{ij}\|^{2}_{2}).

(8)

Table 8: Average Chamfer distance between detections of ship, small-vehicle, plane, and harbor.

Categories	Baseline	Ours
Ship $\Leftrightarrow$ Small Vehicle	504.68	845.17 $\uparrow$
Ship $\Leftrightarrow$ Plane	1000.48	1699.70 $\uparrow$
Harbor $\Leftrightarrow$ Ship	266.54	234.90 $\downarrow$

As Table 8 shows, the results are in line with the logical expectation that Ship instances should be found in water, near Harbor, but distant from Small Vehicle. This finding underscores our model’s effectiveness in accurately understanding the spatial arrangement of objects, further validating the benefits of our approach in handling complex aerial imagery.

5.2 Qualitative Comparison

5.2.1 Conflict Removal

Perhaps the most obvious advantage of considering inter-object relationships is that it helps to ensure the arrangement of detected objects aligns with real-world expectations and dataset distribution. For example, a basketball or tennis court should be no larger than an airplane. Two airplanes in the same image should have consistent sizes. And most importantly, the nose of a plane, despite the resemblance in some way, should not be classified as a ship. These examples (Fig. 1, Fig. 5) underscore the model’s proficiency in removing conflicts and ensuring logical and physically feasible detections.

5.2.2 Co-occurrence Recognition

Beyond resolving conflicts, our model exhibits a remarkable ability to leverage patterns of co-occurrence for enhanced detection accuracy. An illustrative case is the identification of a fleet of aligned planes (Fig. 5, third column). This aspect of co-occurrence pattern recognition is particularly beneficial in complex aerial images where objects often appear in structured groups or formations.

5.2.3 Failure Case

Unfortunately, there are cases where incorrect understanding of relationships could lead to even worse results, as depicted in Fig. 6.

6 Conclusion

In conclusion, this study presents a novel approach to object detection in aerial imagery, leveraging a Transformer-based architecture augmented with a two-stage detection process and adaptive attention mechanisms. Our model effectively captures and utilizes the spatial and geometric relationships among objects, as evidenced by its improved performance in handling diverse and complex scenes. The introduction of preliminary detection heads and the innovative use of scale- and density-aware attention weighting schemes have been shown to be particularly effective in enhancing detection accuracy.

Despite its strengths, our model has certain limitations. Firstly, the inclusion of the Transformer module notably increases computational demands, although the impact on training time is less pronounced. Secondly, inaccuracies in understanding object relationships can occasionally degrade detection results, as highlighted in some failure cases. Lastly, the model’s design incorporates several choices that may be specific to the dataset and application domain. Future efforts should aim to mitigate these limitations, striving towards a more universally applicable framework.

References

[1] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020)
[2] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020)
[3] Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., et al.: Hybrid task cascade for instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4974–4983 (2019)
[4] Chen, P., Zhang, M., Shen, Y., Sheng, K., Gao, Y., Sun, X., Li, K., Shen, C.: Efficient decoder-free object detection with transformers. In: European Conference on Computer Vision. pp. 70–86. Springer (2022)
[5] Chen, S., Li, Z., Huang, F., Zhang, C., Ma, H.: Object detection using dual graph network. 2020 25th International Conference on Pattern Recognition (ICPR) pp. 3280–3287 (2021), https://api.semanticscholar.org/CorpusID:233876987
[6] Ding, J., Xue, N., Long, Y., Xia, G.S., Lu, Q.: Learning roi transformer for oriented object detection in aerial images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2849–2858 (2019)
[7] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
[8] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 580–587 (2014)
[9] Han, J., Ding, J., Li, J., Xia, G.S.: Align deep features for oriented object detection. IEEE Transactions on Geoscience and Remote Sensing 60, 1–11 (2021)
[10] Han, J., Ding, J., Xue, N., Xia, G.S.: Redet: A rotation-equivariant detector for aerial object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2786–2795 (2021)
[11] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017)
[12] He, L., Todorovic, S.: Destr: Object detection with split transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9377–9386 (2022)
[13] Hou, L., Lu, K., Xue, J., Li, Y.: Shape-adaptive selection and measurement for oriented object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 923–932 (2022)
[14] Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3588–3597 (2018)
[15] Kim, J., Baek, J., Hwang, S.J.: Object detection in aerial images with uncertainty-aware graph network. In: European Conference on Computer Vision. pp. 521–536. Springer (2022)
[16] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980–2988 (2017)
[17] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. pp. 21–37. Springer (2016)
[18] Liu, Z., Yuan, L., Weng, L., Yang, Y.: A high resolution optical satellite image dataset for ship recognition and some new baselines. In: ICPRAM. pp. 324–331 (2017)
[19] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019), https://openreview.net/forum?id=Bkg6RiCqY7
[20] Ming, Q., Zhou, Z., Miao, L., Zhang, H., Li, L.: Dynamic anchor learning for arbitrary-oriented object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 2355–2363 (2021)
[21] Pu, Y., Wang, Y., Xia, Z., Han, Y., Wang, Y., Gan, W., Wang, Z., Song, S., Huang, G.: Adaptive rotated convolution for rotated object detection. arXiv preprint arXiv:2303.07820 (2023)
[22] Redmon, J., Farhadi, A.: Yolov3: An incremental improvement (2018)
[23] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)
[24] Sun, Z., Cao, S., Yang, Y., Kitani, K.M.: Rethinking transformer-based set prediction for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3611–3620 (2021)
[25] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
[26] Xia, G.S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., Zhang, L.: Dota: A large-scale dataset for object detection in aerial images. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3974–3983 (2018)
[27] Xie, X., Cheng, G., Wang, J., Yao, X., Han, J.: Oriented r-cnn for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3520–3529 (2021)
[28] Xu, H., Jiang, C., Liang, X., Li, Z.: Spatial-aware graph relation network for large-scale object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9298–9307 (2019)
[29] Xu, Y., Fu, M., Wang, Q., Wang, Y., Chen, K., Xia, G.S., Bai, X.: Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE transactions on pattern analysis and machine intelligence 43(4), 1452–1459 (2020)
[30] Yang, X., Yan, J., Feng, Z., He, T.: R3det: Refined single-stage detector with feature refinement for rotating object. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 3163–3171 (2021)
[31] Yang, X., Yan, J., Ming, Q., Wang, W., Zhang, X., Tian, Q.: Rethinking rotated object detection with gaussian wasserstein distance loss. In: International Conference on Machine Learning. pp. 11830–11841. PMLR (2021)
[32] Yang, X., Yang, J., Yan, J., Zhang, Y., Zhang, T., Guo, Z., Sun, X., Fu, K.: Scrdet: Towards more robust detection for small, cluttered and rotated objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8232–8241 (2019)
[33] Yang, X., Yang, X., Yang, J., Ming, Q., Wang, W., Tian, Q., Yan, J.: Learning high-precision bounding box for rotated object detection via kullback-leibler divergence. Advances in Neural Information Processing Systems 34, 18381–18394 (2021)
[34] Yang, X., Zhou, Y., Zhang, G., Yang, J., Wang, W., Yan, J., Zhang, X., Tian, Q.: The kfiou loss for rotated object detection. arXiv preprint arXiv:2201.12558 (2022)
[35] Zhao, J., Chu, J., Leng, L., Pan, C., Jia, T.: Rgrn: Relation-aware graph reasoning network for object detection. Neural Computing and Applications 35, 16671 – 16688 (2023), https://api.semanticscholar.org/CorpusID:258271811
[36] Zhou, Y., Yang, X., Zhang, G., Wang, J., Liu, Y., Hou, L., Jiang, X., Liu, X., Yan, J., Lyu, C., et al.: Mmrotate: A rotated object detection benchmark using pytorch. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 7331–7334 (2022)