HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: axessibility

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2404.04140v1 [cs.CV] 05 Apr 2024

(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: Tsinghua University, Haidian District, Bei**g, 100084, China

Improving Detection in Aerial Images by Capturing Inter-Object Relationships

Botao Ren 11    Botian Xu 11    Yifan Pu 11    **gyi Wang 11    Zhidong Deng Corresponding author. Zhidong Deng is with Bei**g National Research Center for Information Science and Technology (BNRist), Institute for Artificial Intelligence at Tsinghua University (THUAI), Department of Computer Science, State Key Laboratory of Intelligent Technology and Systems, Tsinghua University, Bei**g 100084, China.11
Abstract

In many image domains, the spatial distribution of objects in a scene exhibits meaningful patterns governed by their semantic relationships. In most modern detection pipelines, however, the detection proposals are processed independently, overlooking the underlying relationships between objects. In this work, we introduce a transformer-based approach to capture these inter-object relationships to refine classification and regression outcomes for detected objects. Building on two-stage detectors, we tokenize the region of interest (RoI) proposals to be processed by a transformer encoder. Specific spatial and geometric relations are incorporated into the attention weights and adaptively modulated and regularized. Experimental results demonstrate that the proposed method achieves consistent performance improvement on three benchmarks including DOTA-v1.0, DOTA-v1.5, and HRSC 2016, especially ranking first on both DOTA-v1.5 and HRSC 2016. Specifically, our new method has an increase of 1.59 mAP on DOTA-v1.0, 4.88 mAP on DOTA-v1.5, and 2.1 mAP on HRSC 2016, respectively, compared to the baselines.

Keywords:
Aerial image processing Object detection Inter-object relationship utilization

1 Introduction

Refer to caption
(a) Baseline
Refer to caption
(b) Our method
Figure 1: Visualization of a motivating example. (a) Detections obtained by the baseline [10] have erroneous identifications: the upper left image shows a false detection of ship on top of an airplane; the bottom left image shows an incorrect airplane detection with unrealistic size. (b) Improved results obtained by our method. The false positives are effectively addressed. Highlighting the importance of considering inter-object relationship in detection.

Object detection has been one of the most studied problems in computer vision due to its great value in practical applications ranging from surveillance and autonomous driving to natural disaster management. The field has seen impressive advancements due to novel models and training techniques developed in the past few years [17, 22, 1]. Among various domains, object detection in aerial images stands out with characteristics and challenges different from those presented in natural images: objects are distributed with drastically varying scales, orientations, and spatial densities.

To tackle this challenge, prior works proposed to improve the detection performance from different perspectives, achieving various degrees of success. Many efforts have focused on learning more appropriate features by exploiting the geometric properties, e.g., symmetry and rotational invariance, leading to novel architectures and data augmentation techniques [10, 27, 32, 30, 21]. Others have developed metrics and objectives [33, 31, 34] that better capture the nuances of aerial object detection.

Nevertheless, despite these advancements, most present-day detection models classify and localize objects independently [10, 27, 32], possibly due to the lack of an effective tool for modeling the co-presence of an arbitrary number of objects in an image. In other words, the spatial and semantic relationships among objects are not fully captured, often leading to false detections that overlook contextual dependencies and inter-object dynamics. As a motivating example, Fig. 1 illustrates the challenge of detecting each object instance based solely on its features, without considering these critical relationships. Aerial images, in particular, offer a unique setting where objects generally share the same plane, with little occlusion and perspective distortion, and therefore have stable inter-object relationships.

Addressing the aforementioned gap, in this paper, we propose a Transformer-based model on top of two-stage detectors to effectively capture and leverage the inter-object relationships. Concretely, we organize the Region of Interest (RoI) [8, 23] feature maps proposed in the first stage and the independent detection results on them into embeddings. The embeddings are then fed into a transformer where the features of candidate detections interact and aggregate. However, the self-attention module in ordinary Transformers, which computes the pairwise attention weights as dot products of embeddings, does not capture the spatial and geometric relationship directly. To overcome this, we design and incorporate additional encodings and attention functions, weighing the mutual influence between objects according to distances. The attention functions are adaptive to the scales and densities of the object distribution, which is crucial for the model to generalize across different image scenarios. We conduct comprehensive experiments on DOTA-v1.0, DOTA-v1.5 [26], and HRSC2016 [18] to evaluate the efficacy of our model, achieving an improvement of 1.59, 4.88, and 2.1 mAP over the baseline.

Our main contribution can be summarized as follows:

  • We introduce a novel Transformer-based model that extends the capability of two-stage detectors, enabling the effective encapsulation and utilization of inter-object relationships in aerial image detection.

  • Our model innovatively incorporates additional encodings and attention mechanisms that directly address spatial and geometric relationships, enhancing adaptability to the varying scales and densities in object distribution, a critical step forward for generalization in diverse aerial scenarios.

2 Related Works

2.1 Object Detection in Aerial Images

In the realm of aerial object detection, extensive research has been conducted to tackle the unique challenges posed by the diverse characteristics of aerial imagery. Numerous studies have explored both single-stage and two-stage methodologies. Notable two-stage methods include ReDet [10], which focuses on handling scale, orientation, and aspect ratio variations, Oriented RCNN [27] that introduces improved representations of oriented bounding boxes, and SCRDet [32] designed for addressing the challenges of dense clusters of small objects. Additionally, SASM[13], Gliding vertex[29], and Region of Interest (RoI) Transformer[6] have contributed to the advancement of two-stage approaches. On the other hand, single-stage methods such as R3Det[30], S2ANet[9], and DAL[20] have been developed, demonstrating the diversity of strategies employed in the pursuit of efficient aerial object detection. These methodologies often incorporate modifications to convolution layers, novel loss functions like GWD[31], KLD[33], and KFIoU[34], as well as multi-scale training and testing strategies to enhance the robustness of object detection in aerial imagery. The evolving landscape of aerial object detection research reflects the ongoing efforts to address the complex challenges inherent in this field. In addition, ReDet[10] and ARC[21] modified the convolution layers to explicitly cope with rotation.

2.2 Capturing Inter-Object Relationships

Common detection systems handle candidate object instances individually. They implicitly assume that the distribution of objects in an image is conditionally independent, which is generally false in reality. Transformer-based architecture [25, 7] has shown impressive capability in relational modeling across multiple domains. To address the oversight mentioned above, [14] introduced an object relation module that computes attention [25] weights from both geometric and appearance features of object proposal. The module is also responsible for learning duplicate removal in place of Non-Maximum Suppression (NMS), leading to an end-to-end detector. More recently, DETR [2, 12] formulates detection as a set-prediction problem and sets up object queries that interact with each other in a Transformer-decoder [25]. Its successors [4, 24] improved the framework’s efficiency by operating directly on features instead of object queries. Graph Neural Networks have also been explored as a powerful alternative in relation modeling for object detection and scene understanding. Typically, one constructs the graph with the objects being the nodes and the spatial relations as edges [15, 35, 5]. [28] instead models region-to-region relations with learned edges. They also differ in how edges are obtained. In comparison to prior works, our method focuses on aerial images where the inter-object relationships are stable, with a more explicit design.

3 Methodology

We build our method on the two-stage object detection framework presented in [10]. We start by following the original pipeline to obtain features and preliminary detections, which are then transformed into embeddings and input into the Transformer with additional encodings. To better leverage the Transformer, we introduce a novel attention function on top of the common scaled dot product and a set of spatial relations. It aims to reflect the degree of correlation between objects based on distances in the image, emphasizing neighboring detections while being aware of object scale and density. Eventually, we perform another detection on the features given by the Transformer to obtain the final results. The overview of our model is shown in Fig. 2.

Refer to caption
Figure 2: Overview of our model. Left: the detection pipeline where the Transformer Encoder is trained to capture the relationships between RoI Tokens. Right: calculation of the adaptive attention weights. Slashes indicate gradient stop**.

3.1 RoI Tokens

In a two-stage detector, the Region Proposal Network (RPN) proposes for each image N𝑁Nitalic_N RoIs from which we extract features {𝐟i}i=1Nsuperscriptsubscriptsubscript𝐟𝑖𝑖1𝑁\{\mathbf{f}_{i}\}_{i=1}^{N}{ bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. We apply the standard detection objective, namely classification and bounding box regression, on the features to obtain for each RoI a class label cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a bounding box pose (xi,yi,wi,hi,αi)subscript𝑥𝑖subscript𝑦𝑖subscript𝑤𝑖subscript𝑖subscript𝛼𝑖(x_{i},y_{i},w_{i},h_{i},\alpha_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) representing the center coordinates, width, height and orientation, respectively.

Subsequently, we map wi,hisubscript𝑤𝑖subscript𝑖w_{i},h_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT through linear layers to high-dimensional embeddings 𝐰i,𝐡isubscript𝐰𝑖subscript𝐡𝑖\mathbf{w}_{i},\mathbf{h}_{i}bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. They are then concatenated with the logits of the class distribution 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to form the RoI Token:

Tokeni=(𝐟i𝐜i𝐰i𝐡i)+pos(xi,yi)subscriptToken𝑖direct-sumsubscript𝐟𝑖subscript𝐜𝑖subscript𝐰𝑖subscript𝐡𝑖possubscript𝑥𝑖subscript𝑦𝑖\text{Token}_{i}=(\mathbf{f}_{i}\oplus\mathbf{c}_{i}\oplus\mathbf{w}_{i}\oplus% \mathbf{h}_{i})+\text{pos}(x_{i},y_{i})Token start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊕ bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊕ bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊕ bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + pos ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (1)

where direct-sum\oplus denotes concatenation and the position encoding pos(xi,yi)possubscript𝑥𝑖subscript𝑦𝑖\text{pos}(x_{i},y_{i})pos ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is computed as in [7] and added to enforce spatial information.

We will show in the experiments that, having two detection phases (preliminary and final) is vital to the success of our model. This also distinguishes our work from prior works [14].

3.2 Spatial and Geometric Relations

Our method uses an encoder-only Transformer to capture the relationships between objects. However, the cosine distance self-attention computed between tokens in Transformers associates more closely to their semantic similarity but not spatial relations. Therefore, we introduce a series of k𝑘kitalic_k relations {Pi}i=1ksuperscriptsubscriptsuperscript𝑃𝑖𝑖1𝑘\{P^{i}\}_{i=1}^{k}{ italic_P start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT accounting for the relative position and geometry between the preliminary detections as listed in Tab. 1. Similar to self-attention, each relation is computed in a pair-wise manner, i.e., PiRN×Nsuperscript𝑃𝑖superscriptR𝑁𝑁P^{i}\in\mathrm{R}^{N\times N}italic_P start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ roman_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT. We concatenate them into a N×N×k𝑁𝑁𝑘N\times N\times kitalic_N × italic_N × italic_k tensor and aggregate them to N×N×1𝑁𝑁1N\times N\times 1italic_N × italic_N × 1 by passing through a linear layer:

Table 1: Spatial and geometric relations considered in computing the attention weights between two RoIs.
Name Formula Description
dx𝑑𝑥dxitalic_d italic_x x2x1subscript𝑥2subscript𝑥1x_{2}-x_{1}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT X-axis distance
dy𝑑𝑦dyitalic_d italic_y y2y1subscript𝑦2subscript𝑦1y_{2}-y_{1}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Y-axis distance
dist dx2+dy2𝑑superscript𝑥2𝑑superscript𝑦2\sqrt{dx^{2}+dy^{2}}square-root start_ARG italic_d italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_d italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG Euclidean distance
dα𝑑𝛼d\alphaitalic_d italic_α α2α1subscript𝛼2subscript𝛼1\alpha_{2}-\alpha_{1}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Angular difference
IoU intersect/union𝑖𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑢𝑛𝑖𝑜𝑛intersect/unionitalic_i italic_n italic_t italic_e italic_r italic_s italic_e italic_c italic_t / italic_u italic_n italic_i italic_o italic_n Intersection over Union
area (w1h1)/(w2h2)subscript𝑤1subscript1subscript𝑤2subscript2(w_{1}h_{1})/(w_{2}h_{2})( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) / ( italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) Relative area ratio
P=linear(stack(P1,,Pk)).𝑃linearstacksuperscript𝑃1superscript𝑃𝑘P=\text{linear}(\text{stack}(P^{1},\dots,P^{k})).italic_P = linear ( stack ( italic_P start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ) . (2)

3.3 Adaptive Attention Weights

An inherent challenge in aerial images is that objects in a scene can vary drastically in size, orientation, and aspect ratio. Also, certain object types tend to cluster densely (like cars in parking lots) or align in specific patterns (such as parallel tennis courts). Thus the relationship between an object and others should be highly specific to the object instance and the contextual information around it. Based on this observation, we devise a novel scheme to adaptively adjust the attention weights with the following considerations.

3.3.1 Spatial Distance

The scale of aerial images could be several kilometers. To focus more on the neighboring objects that are expected to be more relevant, we introduce a distance-decaying coefficient for each object pair:

Aij=exp((ϵidij)2/σ2)distance, scale and density𝟏{IoUij<δ}overlap**subscript𝐴𝑖𝑗subscriptsuperscriptsubscriptitalic-ϵ𝑖subscript𝑑𝑖𝑗2superscript𝜎2distance, scale and densitysubscript1subscriptIoU𝑖𝑗𝛿overlap**A_{ij}=\underbrace{\exp(-(\epsilon_{i}d_{ij})^{2}/\sigma^{2})}_{\text{distance% , scale and density}}\circ\ \underbrace{\mathbf{1}\{\text{IoU}_{ij}<\delta\}}_% {\text{overlap**}}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = under⏟ start_ARG roman_exp ( - ( italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT distance, scale and density end_POSTSUBSCRIPT ∘ under⏟ start_ARG bold_1 { IoU start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT < italic_δ } end_ARG start_POSTSUBSCRIPT overlap** end_POSTSUBSCRIPT (3)

where dij=(xixj)2+(yiyj)2subscript𝑑𝑖𝑗superscriptsubscript𝑥𝑖subscript𝑥𝑗2superscriptsubscript𝑦𝑖subscript𝑦𝑗2d_{ij}=\sqrt{(x_{i}-x_{j})^{2}+(y_{i}-y_{j})^{2}}italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = square-root start_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG is the pair-wise distance, \circ the element-wise product, and 𝟏{}1\mathbf{1}\{\cdot\}bold_1 { ⋅ } the indicator function. ϵisubscriptitalic-ϵ𝑖\epsilon_{i}italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is detailed next. σ𝜎\sigmaitalic_σ is a hyperparameter.

3.3.2 Object Scale and Density

It is a natural intuition that the influence of one object on another relates to the scale (size) of the object, e.g., smaller objects tend to be more influenced by nearby objects, whereas larger objects need to capture the impact at longer ranges. The density around a proposed detection is another important factor. We assume that when there are fewer other RoIs around a detection (i.e., lower density), it should capture the influence of RoIs from further away. In contrast, in areas with high density (many RoIs), it should mainly interact with nearby RoIs. To qualitatively model these factors, we compute ϵisubscriptitalic-ϵ𝑖\epsilon_{i}italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as:

ϵi=Swihi×exp(ρ¯i)subscriptitalic-ϵ𝑖𝑆subscript𝑤𝑖subscript𝑖subscript¯𝜌𝑖\epsilon_{i}=\frac{S}{\sqrt{w_{i}h_{i}}}\times\exp(\bar{\rho}_{i})italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_S end_ARG start_ARG square-root start_ARG italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG × roman_exp ( over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (4)

where S𝑆Sitalic_S is a global (dataset-wide) scale factor determined by the input image.

To account for the density around an RoI, we first calculate for the i𝑖iitalic_i-th RoI:

Refer to caption
Figure 3: Illustration of the adaptive attention weights w.r.t one RoI. Areas with high opacity indicate higher weights ϵisubscriptitalic-ϵ𝑖\epsilon_{i}italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In the second row, the left image demonstrates a tennis court with a larger area and fewer surrounding objects, resulting in a smaller ϵisubscriptitalic-ϵ𝑖\epsilon_{i}italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This leads to a slower decay in Eq. 3 as distance increases, enabling the model to capture relationships with more distant objects. Conversely, the right image depicts a car with a smaller area and a higher density of surrounding objects, which entails a larger ϵisubscriptitalic-ϵ𝑖\epsilon_{i}italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, resulting in a faster decay of Eq. 3 and thus a focus on objects in closer proximity.
ρi=iwihiexp((Swihidij)2/σ2),subscript𝜌𝑖subscript𝑖subscript𝑤𝑖subscript𝑖superscript𝑆subscript𝑤𝑖subscript𝑖subscript𝑑𝑖𝑗2superscript𝜎2\rho_{i}=\sum_{i}w_{i}h_{i}\exp(-(\frac{S}{\sqrt{w_{i}h_{i}}}d_{ij})^{2}/% \sigma^{2}),italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - ( divide start_ARG italic_S end_ARG start_ARG square-root start_ARG italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (5)

then image-wise normalize ρisubscript𝜌𝑖\rho_{i}italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and map them into (1,1)11(-1,1)( - 1 , 1 ):

ρ¯i=tanh((ρimean(ρ))/std(ρ)).subscript¯𝜌𝑖tanhsubscript𝜌𝑖mean𝜌std𝜌\bar{\rho}_{i}=\text{tanh}((\rho_{i}-\text{mean}(\rho))/\text{std}(\rho)).over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = tanh ( ( italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - mean ( italic_ρ ) ) / std ( italic_ρ ) ) . (6)

3.3.3 RoI Overlap**

In addition to the aforementioned aspects, it is also necessary to mitigate the self-influence among multiple overlap** RoIs corresponding to the same object. Specifically, if we do not exclude these closely overlap** RoIs, their proximity to each other could lead to them being overly emphasized in the attention calculation while neglecting the interactions between RoIs of different objects. Therefore, we mask the attention weights to only consider RoIs with IoU below a certain threshold δ𝛿\deltaitalic_δ.

The overall attention weights are calculated as

Asoftmax(QTK+P)𝐴softmaxsuperscript𝑄T𝐾𝑃A\circ\text{softmax}(Q^{\text{T}}K+P)italic_A ∘ softmax ( italic_Q start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT italic_K + italic_P ) (7)

where P𝑃Pitalic_P is the aggregated spatial and geometric relations computed in Eq. 2.

4 Experiment

4.1 Dataset

DOTA is a large-scale aerial object detection dataset.

DOTA-v1.0 contains 2,806 images, with sizes ranging from 800×800800800800\times 800800 × 800 to 4000×4000400040004000\times 40004000 × 4000 pixels. It includes 188,282 instances across 15 categories, annotated as: Plane (PL), Baseball Diamond (BD), Bridge (BR), Ground Track Field (GTF), Small Vehicle (SV), Large Vehicle (LV), Ship (SH), Tennis Court (TC), Basketball Court (BC), Storage Tank (ST), Soccer Ball Field (SBF), Roundabout (RA), Harbor (HA), Swimming Pool (SP), and Helicopter (HC). Following the common practice [10], we use both the training and validation sets for training and the test set for testing. We report mAP in PASCAL VOC2007 format and submit the testing result on the official dataset server.

DOTA-v1.5 uses the same image set but with increased annotations. This version features 402,089 instances and introduces an additional category, Container Crane (CC), broadening the dataset’s applicability in aerial object detection.

HRSC2016 focuses on ship detection in aerial images, containing 1,061 images with a total of 2,976 instances. Image sizes in this dataset range from 300×300 to 1500×900 pixels. The dataset is divided into training, validation, and test sets with 436, 181, and 444 images, respectively.

4.2 Implementation Details

Our implementation is based on the MMRotate [36] library and adopts ReDet’s framework and hyperparameter settings. We train our model for 12121212 epochs using the AdamW [19] optimizer with an initial learning rate of 1e41𝑒41e-41 italic_e - 4, reduced to 1e51𝑒51e-51 italic_e - 5 and 1e61𝑒61e-61 italic_e - 6 at epochs 8 and 11. We also use a weight decay of 0.05. The experiments were conducted using two RTX 3090 GPUs.

The Transformer module consists of 6 encoder layers, similar to the ViT structure, and integrates sinusoidal two-dimensional absolute position encoding, hyperparameters σ𝜎\sigmaitalic_σ is set to 4. A dropout rate of 0.1 is employed during the training phase of the Transformer.

Table 2: Results of each object class on the DOTA-v1.5 dataset. We highlight the best in blue.

Method PL BD BR GTF SV LV SH TC BC ST SBF RA HA SP HC CC mAP RetinaNet-O[16] 71.43 77.64 42.12 64.65 44.53 56.79 73.31 90.84 76.02 59.96 46.95 69.24 59.65 64.52 48.06 0.83 59.16 RF. R-CNN [23] 72.20 76.43 47.58 69.91 51.99 70.52 80.27 90.87 79.16 68.63 59.57 72.34 66.44 66.07 55.29 6.87 64.63 Mask R-CNN[11] 76.84 73.51 49.90 57.80 51.31 71.34 79.75 90.46 74.21 66.07 46.21 70.61 63.07 64.46 57.81 9.42 62.67 HTC [3] 77.80 73.67 51.40 63.99 51.54 73.31 80.31 90.48 75.21 67.34 48.51 70.63 64.84 64.48 55.87 5.15 63.40 RoI-Trans. [6] 72.27 81.95 54.47 70.02 52.49 76.31 81.03 90.90 84.19 69.12 62.85 72.73 68.67 65.89 57.09 7.12 66.69 ReDet [10] 79.20 82.81 51.92 71.41 52.38 75.73 80.92 90.83 75.81 68.64 49.29 72.03 73.36 70.55 63.33 11.53 66.86 Ours 80.79 85.58 53.80 71.74 52.72 77.84 88.71 90.89 86.24 74.73 64.97 73.31 76.76 72.99 73.84 22.87 71.74

Table 3: More experiments on DOTA-v1.5. We highlight our method’s improvements in blue

. Method mAP RoI Transformer 66.69 RoI Transformer + Ours 68.61(+1.92) Oriented RCNN 65.94 Oriented RCNN + Ours 66.38(+0.44) Rotated Faster RCNN 64.63 Rotated Faster RCNN + Ours 65.28(+0.65) ReDet 66.86 ReDet + Ours 71.74(+4.88)

4.3 Comparison with Baselines

First, we evaluate our model against the baselines on DOTA-v1.0, DOTA-v1.5, and HRSC2016 to demonstrate the efficacy of the proposed method. The results are shown in Tab. 2, Tab. 4, respectively. To further illustrate the versatility of our approach, we conducted additional experiments on DOTA-v1.5 with our method applied to various models, yielding consistent improvements, as shown in Tab. 3. These results demonstrate that our method consistently outperforms the baselines across different datasets. Notably, however, the improvement achieved in HRSC2016 is marginal compared to that on DOTA-v1.5. This is possibly due to the number of instances in a single image being much fewer in HRSC (typically less than 4), thus there are limited opportunities to leverage the inter-object relationships. These findings suggest that our model’s strengths are most pronounced in scenarios rich in object interactions and contextual dynamics, aligning with our design’s focus on capturing and utilizing inter-object relationships.

Table 4: Results in COCO style on DOTA-v1.0 and HRSC2016. We highlight our method’s improvements in blue

. Dataset Method AP50 AP75 mAP DOTA-v1.0 ReDet 76.25 50.86 47.11 Ours 77.84(+1.59) 51.42(+0.56) 48.32(+1.21) HRSC2016 ReDet 90.46 89.46 70.41 Ours 90.49(+0.03) 89.67(+0.21) 72.51(+2.10)

Table 5: Detection performance with and without preliminary detection training.
Table 6: Effect of the components in computing the attention weights via Eq. 7
pre cls supervision mAP
w/o 69.86
w 71.74
Transformer Relations (P𝑃Pitalic_P) Ada. Weight (A𝐴Aitalic_A) mAP
66.86
69.02
71.09
70.87
71.74
Table 6: Effect of the components in computing the attention weights via Eq. 7
Table 7: Design choices of the relations and adaptive weights.
(a) Effect of the spatial and geometric relations.
Method mAP

dx𝑑𝑥dxitalic_d italic_x dy𝑑𝑦dyitalic_d italic_y da𝑑𝑎daitalic_d italic_a

70.84

   + dist

70.96

    + IoU

71.51

        + area

71.74
(b) Effect of the factors in computing β𝛽\betaitalic_β.
Attention Weights (β𝛽\betaitalic_β) mAP
baseline (ReDet) 66.86
- scale (ϵ=S=32italic-ϵ𝑆32\epsilon=\sqrt{S}=32italic_ϵ = square-root start_ARG italic_S end_ARG = 32) 70.35(+3.49)
- overlap (δ=1𝛿1\delta=1italic_δ = 1) 67.32(+0.46)
- density (ρ¯i=0subscript¯𝜌𝑖0\bar{\rho}_{i}=0over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0) 69.85(+2.99)
Ours 71.74(+4.88)

4.4 Ablation Study

To understand the effect of each design choice in our model, we conduct a series of experiments to shed light on the following:

4.4.1 Preliminary Detection Phase

Compared to the standard detection pipeline, our model incorporates two detection heads - placed before and after the Transformer module. The output from the initial detection phase, termed ’Preliminary Detections’, includes a classification result (parameterized as a softmax distribution) from the first head, which forms a component of the RoI token. We posit that knowing the class information with uncertainties would help with reasoning about the inter-object relationships. To empirically validate this hypothesis, we compared the performance of our model with and without training the first detection head. As Tab. 6 shows, although solely incorporating the Transformer offers an improvement of mAP to the baseline, omitting the preliminary detection leads to a notable decline in performance. This suggests that relying only on the Transformer for RoIs to interact lacks efficacy. In contrast, the explicit inclusion of preliminary classification data, despite its potential inaccuracies, enhances the model’s ability to reason about semantical and contextual relationships. The results underscore the value of early classification cues in guiding the relational reasoning process within our proposed architecture.

4.4.2 Spatial and Geometric Relations

The different terms presented in Tab. 1 characterize various aspects of the spatial and geometric relationships among objects (RoI Tokens) within an image. In this section, we aim to empirically evaluate the individual and collective contributions of these spatial and geometric relational terms to the overall performance of our detection model. As shown in Tab. 7(a), IoU and rel. area contribute the most. Intuitively, they are particularly helpful when reasoning about the co-occurrence and spatial arrangement of objects. For example, IoU helps to disambiguate the overlap**, potentially duplicate or conflicting detections. Similarly, relative area aids in discerning the size relationship between objects. Consequently, our method can effectively solve the problem in the motivation example. See Sec. 5.2 for details.

4.4.3 Adaptive Attention Weights

As mentioned in Sec. 3.3, making the attention weights adaptive to specific RoI Tokens is essential to cope with the diversity and complexities in a scene. We evaluate our density- and scale-aware attention weighting scheme which is designed to augment the scaled-dot-product self-attention and allow the model to dynamically adjust its focus based on the scale of objects and their surrounding density. Findings in Tab. 7(b) indicate that masking the influence of overlap** RoIs plays a crucial role. This observation aligns with our initial understanding that indiscriminately emphasizing neighboring RoIs, without considering overlap, could lead to skewed attention distributions and potentially impair the model’s ability to accurately discern between distinct objects.

5 Analysis

To gain insights into how inter-object relationships have improved detection performance, we collect and analyze dataset-wise statistics and specific examples.

5.1 Evidential Statistics

By examining the data we found that many false detections deviate far from the typical scales associated with their respective categories. To investigate this observation, we compute for each category the mean and standard deviation of object scale wihisubscript𝑤𝑖subscript𝑖\sqrt{w_{i}h_{i}}square-root start_ARG italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG using detections with confidence >0.9absent0.9>0.9> 0.9 on the test set. We then identified outliers as those detections deviating from the mean by more than three times the standard deviation. This method provides a rough measure of the frequency of incorrect scale detections. As shown in Fig. 4, the detections produced by our methods have substantially fewer outliers compared to the baseline. This result suggests that our model better maintains scale consistency across different object categories. This improvement is particularly vital in aerial image analysis, where scale variance is substantial and often indicative of the detection model’s reliability and robustness.

Refer to caption
Figure 4: Count of outliers (in log-scale) for each category on the test dataset.

Additionally, our visual analysis revealed a common misclassification of many land-based objects as Ship. To quantify this observation, we compute the average chamfer distance between certain categories S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in an image:

d(S1,S2)=12(1|S1|iS1minjS2distij22+1|S2|jS2miniS1distij22).𝑑subscript𝑆1subscript𝑆2121subscript𝑆1subscript𝑖subscript𝑆1subscript𝑗subscript𝑆2subscriptsuperscriptnormsubscriptdist𝑖𝑗221subscript𝑆2subscript𝑗subscript𝑆2subscript𝑖subscript𝑆1subscriptsuperscriptnormsubscriptdist𝑖𝑗22d(S_{1},S_{2})=\frac{1}{2}(\frac{1}{|S_{1}|}\sum_{i\in S_{1}}\min_{j\in S_{2}}% \|\text{dist}_{ij}\|^{2}_{2}+\frac{1}{|S_{2}|}\sum_{j\in S_{2}}\min_{i\in S_{1% }}\|\text{dist}_{ij}\|^{2}_{2}).italic_d ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( divide start_ARG 1 end_ARG start_ARG | italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_j ∈ italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ dist start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG | italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_i ∈ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ dist start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) . (8)
Table 8: Average Chamfer distance between detections of ship, small-vehicle, plane, and harbor.
Categories Baseline Ours
Shipnormal-⇔\LeftrightarrowSmall Vehicle 504.68 845.17 \uparrow
Shipnormal-⇔\LeftrightarrowPlane 1000.48 1699.70 \uparrow
Harbornormal-⇔\LeftrightarrowShip 266.54 234.90 \downarrow

As Table 8 shows, the results are in line with the logical expectation that Ship instances should be found in water, near Harbor, but distant from Small Vehicle. This finding underscores our model’s effectiveness in accurately understanding the spatial arrangement of objects, further validating the benefits of our approach in handling complex aerial imagery.

5.2 Qualitative Comparison

Refer to caption
Figure 5: Qualitative comparison of our model and ReDet. Our method effectively reduces semantically incompatible detections, e.g., Small Vehicle and SHip on tennis courts.

5.2.1 Conflict Removal

Perhaps the most obvious advantage of considering inter-object relationships is that it helps to ensure the arrangement of detected objects aligns with real-world expectations and dataset distribution. For example, a basketball or tennis court should be no larger than an airplane. Two airplanes in the same image should have consistent sizes. And most importantly, the nose of a plane, despite the resemblance in some way, should not be classified as a ship. These examples (Fig. 1, Fig. 5) underscore the model’s proficiency in removing conflicts and ensuring logical and physically feasible detections.

5.2.2 Co-occurrence Recognition

Beyond resolving conflicts, our model exhibits a remarkable ability to leverage patterns of co-occurrence for enhanced detection accuracy. An illustrative case is the identification of a fleet of aligned planes (Fig. 5, third column). This aspect of co-occurrence pattern recognition is particularly beneficial in complex aerial images where objects often appear in structured groups or formations.

5.2.3 Failure Case

Unfortunately, there are cases where incorrect understanding of relationships could lead to even worse results, as depicted in Fig. 6.

Refer to caption
Figure 6: An example failure case. The harbor and ship detections are all incorrect but reinforce each other.

6 Conclusion

In conclusion, this study presents a novel approach to object detection in aerial imagery, leveraging a Transformer-based architecture augmented with a two-stage detection process and adaptive attention mechanisms. Our model effectively captures and utilizes the spatial and geometric relationships among objects, as evidenced by its improved performance in handling diverse and complex scenes. The introduction of preliminary detection heads and the innovative use of scale- and density-aware attention weighting schemes have been shown to be particularly effective in enhancing detection accuracy.

Despite its strengths, our model has certain limitations. Firstly, the inclusion of the Transformer module notably increases computational demands, although the impact on training time is less pronounced. Secondly, inaccuracies in understanding object relationships can occasionally degrade detection results, as highlighted in some failure cases. Lastly, the model’s design incorporates several choices that may be specific to the dataset and application domain. Future efforts should aim to mitigate these limitations, striving towards a more universally applicable framework.

References

  • [1] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020)
  • [2] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020)
  • [3] Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., et al.: Hybrid task cascade for instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4974–4983 (2019)
  • [4] Chen, P., Zhang, M., Shen, Y., Sheng, K., Gao, Y., Sun, X., Li, K., Shen, C.: Efficient decoder-free object detection with transformers. In: European Conference on Computer Vision. pp. 70–86. Springer (2022)
  • [5] Chen, S., Li, Z., Huang, F., Zhang, C., Ma, H.: Object detection using dual graph network. 2020 25th International Conference on Pattern Recognition (ICPR) pp. 3280–3287 (2021), https://api.semanticscholar.org/CorpusID:233876987
  • [6] Ding, J., Xue, N., Long, Y., Xia, G.S., Lu, Q.: Learning roi transformer for oriented object detection in aerial images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2849–2858 (2019)
  • [7] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
  • [8] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 580–587 (2014)
  • [9] Han, J., Ding, J., Li, J., Xia, G.S.: Align deep features for oriented object detection. IEEE Transactions on Geoscience and Remote Sensing 60, 1–11 (2021)
  • [10] Han, J., Ding, J., Xue, N., Xia, G.S.: Redet: A rotation-equivariant detector for aerial object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2786–2795 (2021)
  • [11] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017)
  • [12] He, L., Todorovic, S.: Destr: Object detection with split transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9377–9386 (2022)
  • [13] Hou, L., Lu, K., Xue, J., Li, Y.: Shape-adaptive selection and measurement for oriented object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 923–932 (2022)
  • [14] Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3588–3597 (2018)
  • [15] Kim, J., Baek, J., Hwang, S.J.: Object detection in aerial images with uncertainty-aware graph network. In: European Conference on Computer Vision. pp. 521–536. Springer (2022)
  • [16] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980–2988 (2017)
  • [17] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. pp. 21–37. Springer (2016)
  • [18] Liu, Z., Yuan, L., Weng, L., Yang, Y.: A high resolution optical satellite image dataset for ship recognition and some new baselines. In: ICPRAM. pp. 324–331 (2017)
  • [19] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019), https://openreview.net/forum?id=Bkg6RiCqY7
  • [20] Ming, Q., Zhou, Z., Miao, L., Zhang, H., Li, L.: Dynamic anchor learning for arbitrary-oriented object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 2355–2363 (2021)
  • [21] Pu, Y., Wang, Y., Xia, Z., Han, Y., Wang, Y., Gan, W., Wang, Z., Song, S., Huang, G.: Adaptive rotated convolution for rotated object detection. arXiv preprint arXiv:2303.07820 (2023)
  • [22] Redmon, J., Farhadi, A.: Yolov3: An incremental improvement (2018)
  • [23] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)
  • [24] Sun, Z., Cao, S., Yang, Y., Kitani, K.M.: Rethinking transformer-based set prediction for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3611–3620 (2021)
  • [25] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
  • [26] Xia, G.S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., Zhang, L.: Dota: A large-scale dataset for object detection in aerial images. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3974–3983 (2018)
  • [27] Xie, X., Cheng, G., Wang, J., Yao, X., Han, J.: Oriented r-cnn for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3520–3529 (2021)
  • [28] Xu, H., Jiang, C., Liang, X., Li, Z.: Spatial-aware graph relation network for large-scale object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9298–9307 (2019)
  • [29] Xu, Y., Fu, M., Wang, Q., Wang, Y., Chen, K., Xia, G.S., Bai, X.: Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE transactions on pattern analysis and machine intelligence 43(4), 1452–1459 (2020)
  • [30] Yang, X., Yan, J., Feng, Z., He, T.: R3det: Refined single-stage detector with feature refinement for rotating object. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 3163–3171 (2021)
  • [31] Yang, X., Yan, J., Ming, Q., Wang, W., Zhang, X., Tian, Q.: Rethinking rotated object detection with gaussian wasserstein distance loss. In: International Conference on Machine Learning. pp. 11830–11841. PMLR (2021)
  • [32] Yang, X., Yang, J., Yan, J., Zhang, Y., Zhang, T., Guo, Z., Sun, X., Fu, K.: Scrdet: Towards more robust detection for small, cluttered and rotated objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8232–8241 (2019)
  • [33] Yang, X., Yang, X., Yang, J., Ming, Q., Wang, W., Tian, Q., Yan, J.: Learning high-precision bounding box for rotated object detection via kullback-leibler divergence. Advances in Neural Information Processing Systems 34, 18381–18394 (2021)
  • [34] Yang, X., Zhou, Y., Zhang, G., Yang, J., Wang, W., Yan, J., Zhang, X., Tian, Q.: The kfiou loss for rotated object detection. arXiv preprint arXiv:2201.12558 (2022)
  • [35] Zhao, J., Chu, J., Leng, L., Pan, C., Jia, T.: Rgrn: Relation-aware graph reasoning network for object detection. Neural Computing and Applications 35, 16671 – 16688 (2023), https://api.semanticscholar.org/CorpusID:258271811
  • [36] Zhou, Y., Yang, X., Zhang, G., Wang, J., Liu, Y., Hou, L., Jiang, X., Liu, X., Yan, J., Lyu, C., et al.: Mmrotate: A rotated object detection benchmark using pytorch. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 7331–7334 (2022)