Deep Transformer Network for Monocular
Pose Estimation of Ship-Based UAV

Maneesha Wickramasuriya111Ph.D. Candidate, Department of Mechanical and Aerospace Engineering, The George Washington University The George Washington University, 800 22nd St NW, Washington DC 20052 Taeyoung Lee222Professor, Department of Mechanical and Aerospace Engineering, The George Washington University, 800 22nd St NW, Washington DC 20052 The George Washington University, 800 22nd St NW, Washington DC 20052 Murray Snyder333Professor Emeritus, Department of Mechanical and Aerospace Engineering, The George Washington University, 800 22nd St NW, Washington DC 20052 The George Washington University, 800 22nd St NW, Washington DC 20052
Abstract

This paper introduces a deep transformer network for estimating the relative 6D pose of a Unmanned Aerial Vehicle (UAV) with respect to a ship using monocular images. A synthetic dataset of ship images is created and annotated with 2D keypoints of multiple ship parts. A Transformer Neural Network model is trained to detect these keypoints and estimate the 6D pose of each part. The estimates are integrated using Bayesian fusion. The model is tested on synthetic data and in-situ flight experiments, demonstrating robustness and accuracy in various lighting conditions. The position estimation error is approximately 0.8% and 1.0% of the distance to the ship for the synthetic data and the flight experiments, respectively. The method has potential applications for ship-based autonomous UAV landing and navigation. Source code : fdcl-gwu/TNN-MO, Video : https://youtu.be/ZG_zVVS8xw8

1 Introduction

Unmanned Aerial Vehicles (UAVs) have seen a surge in usage across a multitude of industries, such as aerial photography, military operations, agriculture, map**, and surveying. The advantages of UAVs over traditional manned aircraft are numerous, including cost-effectiveness, enhanced safety, and superior flexibility. However, the autonomous operation of UAVs, particularly their ability to land on moving platforms like ships, poses a crucial challenge. This capability is of significant importance for industries that depend on maritime transportation or offshore operations.

A primary challenge in this context is the estimation of the UAV’s relative pose with respect to the ship, which is vital for precise control of the UAV’s movements and ensuring a safe landing. Conventionally, the relative pose has been determined using the Real-Time Kinematic (RTK) Global Positioning System (GPS). To receive RTK-GPS, a communication link between the ship and the UAV must be maintained at all times, typically via radio. As a result, the UAV cannot estimate its relative position independently. In military surveillance operations, the broadcasting of radio signals to maintain this communication link could potentially reveal the ship’s position, thereby increasing its vulnerability. Also, reliance on GPS can lead to issues in situations where GPS signals may be weak or unavailable, and it is susceptible to malicious activities, such as jamming or spoofing, which can disrupt the operation of the UAV.

An alternative to GPS for estimating the relative pose of a UAV could involve the use of a monocular camera mounted on the UAV. Then, captured Red-Green-Blue (RGB) camera images can be processed to estimate the UAV’s relative pose. This method provides a potential solution for independent and secure pose estimation. Traditionally, ArUco markers have been employed to estimate the relative pose of the camera. ArUco markers are a type of fiducial marker, which are square markers with a specific binary pattern that can be easily detected and identified by a computer [1, 2]. However, the use of ArUco markers comes with certain disadvantages, such as the need for clear visibility and specific lighting conditions, their susceptibility to occlusion, and the requirement for the markers to be within the camera’s field of view. These limitations can affect the accuracy and reliability of pose estimation.

In our preceding studies, we accomplished autonomous flight for shipboard launch and landing using feature-based Visual-Inertial Navigation (VIN) [3]. However, it turned out that the accuracy of estimating the relative pose depends on the proximity of the camera to the landing pad. The quality and quantity of features significantly influence the accuracy of the relative pose. Moreover, the number of features extracted under variable lighting conditions can fluctuate and features extracted from the environment outside the ship, such as the sky and ocean, adversely affect the accuracy of relative pose estimation.

Recent advancements in object pose estimation with deep learning, particularly the use of Convolutional Neural Networks (CNNs), have shown significant progress. Multiple convolution layers for feature extraction, object detection, and/or segmentation are utilized for object silhouette prediction to estimate the 6D pose of an object [4, 5]. Recently, a transformer architecture based on the self-attention mechanism, originally introduced for Natural Language Processing (NLP), has been successfully applied to various computer vision tasks. The DEtection TRansformer (DETR) is one such transformer-based neural network architecture for object detection [6]. This has been extended with the YOLOPose architecture [7] to perform multi-object 6D pose estimation using both direct regression and keypoint regression.

In one of our recent studies, we introduced a Transformer Neural Network for Single Object (TNN-SO) model to estimate the 6D pose from the ship [8], where we took the advantage of tracking a part of a ship near the landing pad as a single object rather than tracking the entire ship. However, despite the robustness of the TNN-SO model for variable lighting and the higher accuracy closer to the landing pad, the accuracy of pose estimation significantly decreased for longer ranges approximately greater than 6m from the center of the landing pad. This is because the visibility diminished as the UAV flies away from the ship.

Refer to caption
Figure 1: We propose a Transformer Neural Network for Multi Object (TNN-MO) for 6D pose estimation from RGB image, which exhibits excellent accuracy and shows promising potential for vision-based UAV landing.

To address this issue, in this paper, we propose an innovative method that facilitates multi-object detection using the Transformer Neural Network (TNN). In this approach, the ship is decomposed into multiple parts with varying sizes and locations, as opposed to the previous single object detection method. We model a Transformer Neural Network for Multi Object (TNN-MO) by utilizing the DETR architecture to estimate the 6D pose of a UAV relative to the selected parts of the ship. The multiple pose estimates with respect to those parts are integrated in a Bayesian fashion.

More specifically, the proposed TNN-MO is implemented as follows. The performance of deep neural networks is constrained by the quality and quantity of data used in the training and validation steps. However, the absence of a large labeled dataset containing dynamically changing environments for a real-world ship poses a significant challenge in effectively training the transformer neural network model, especially when there are multiple objects to be detected. To tackle the challenges in constructing a rich dataset in the real world, we generate synthetic images by rendering a 3D model of the ship with randomly distributed camera poses under various textured environments and backgrounds to train the TNN-MO model. After training the TNN-MO model, it is able to detect keypoints of multiple parts of the ship.

To estimate the 6D poses, we utilize the EPnP algorithm [9], leveraging the 2D-3D correspondences of the predicted 2D keypoints of multiple parts of the ship and its known 3D keypoints. Then, the resulting pose estimates of multiple ship parts are filtered when the object class confidence is greater than 0.9, and the filtered pose estimates are integrated using Bayesian fusion. Upon training both the TNN-SO [8] and TNN-MO models, we observed that the position estimation accuracy of the TNN-MO model surpassed that of the TNN-SO model, even at extended ranges, when evaluated with the synthetic test dataset.

To assess the performance of our models in real-world scenarios, we collected images from the camera-attached data collection system (DCS) [10] on an octocopter UAV launched from the YP689 vessel, operated by the United States Naval Academy (USNA) [11]. The DCS is also equipped with RTK GPS, along with a base station mounted at the flight deck of the ship and a dedicated wireless communication link, to measure the relative position up to a centimeter-level accuracy, which is considered as a reference. It is verified that the position error of the proposed TNN-MO in the real world is approximately 2% of the range.

Our proposed approach successfully estimates the relative pose with monocular real-world images, even under challenging conditions such as varying light and different viewpoints. Notably, the TNN-MO model, compared to the TNN-SO model, demonstrates remarkable pose estimation accuracy over a longer range by leveraging the benefits of multi-object detection, while maintaining high accuracy near the landing area. This demonstrates promising potential for vision-based UAV landing and navigation by eliminating the need for GPS, highlighting the effectiveness and precision enhancement offered by our method.

This paper is organized as follows. We present a virtual environment for a ship model in Section 2, from which synthetic data for training and validation are generated in Section 3. Next, Section 4 presents the training procedure and the Bayesian fusion scheme, followed by validations with the synthetic data and with flight experiments, respectively in Section 5 and Section 6.

2 Virtual Environments for Synthetic Data

This paper focuses on a specific scenario of pose estimation relative to a research vessel operated by the United States Naval Academy, specifically YP689 (or identical prior research vessel YP700), over Chesapeake Bay, Maryland. However, it is impractical to obtain a large dataset of real-world images for a vessel in the ocean with the exact pose under dynamically changing environments, due to the substantial time and cost required. To overcome this, we generate a synthetic dataset from a 3D CAD model using open-source rendering software, with which the proposed network is trained. This section outlines the detailed steps involved in creating the virtual environments, including a CAD model of the ship, texture, and illumination conditions.

2.1 Development of the 3D CAD Model

First, we create a CAD model by capturing point clouds of the vessel. We captured a footage of the entire ship using a monocular RGB camera mounted on an octocopter [10] [11]. Utilizing this video, we generated a 3D point cloud of the ship using structure from motion techniques in computer vision [12]. The resulting 3D point cloud is corrupted by noise and it incomplete as the octocopter is not allowed to fly near the bow side of the ship. It is first was manually filtered to remove outliers, then the cleaned point cloud was imported into an open-source rendering software, namely Blender, to create a CAD model [13]. During this step, detailed structures on the ship not captured by the point cloud, such as the ladders, ship bow, and fence, were manually added too. Furthermore, the CAD model was rescaled according to the actual size of the ship. To enhance the realistic view of the scene, we included other objects such as the ocean, sun, and humans. Also, to make ocean more realistic, we randomly generated wave patterns for every scene using Blender wave modifier tool.

Refer to caption
Figure 2: 3D model based on the filtered point cloud and a CAD model obtained from the point cloud

2.2 Assignment of Textures

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 3: Texture Categories: (a) Ocean-like, (b) Sky-like, (c) Vessel skin, (d) Landing pad, and (e) Landing pad markings. These texture categories are utilized for texture assignment in our study.

When operating a UAV in the vicinity of a ship, the environment is subject to frequent and dynamic changes. Real images are often influenced by variations in lighting, weather conditions, changes in cloud and wave patterns, occlusions, and other factors. These variables make the detection, recognition, and pose estimation of vessels a challenging task. Training a neural network using synthetic images generated in a static environment may not yield successful results due to the discrepancy between the simulated and real environments, a phenomenon referred to as the sim-to-real gap.

To address this issue, we generate synthetic images using domain randomization, where non-photo-realistic and realistic textures are randomly assigned to each object in Blender [14]. Texture assignment, which involves map** necessary textures onto 3D objects, is a crucial step in the process of generating synthetic images. In this work, we utilized two sources to create texture categories: the Describable Textures Dataset (DTD) [15] and the open-source platform OpenGameArt [16]. The categorization of these textures was based on their image frequency and the characteristics of the textures. Specifically, textures with high frequencies or those resembling oceans were grouped together as "Ocean-like textures". Textures selected with low frequencies were categorized as "Vessel skin textures". Similarly, textures with low frequencies or those resembling the sky were grouped together as "Sky-like textures". To define the landing pad and its markings, we grouped together dark, asphalt-like textures as "Landing pad textures", while whitish color textures were categorized as "Landing pad markings textures", as shown in Figure 3.

Randomly assigning different textures to objects can enhance object detectability in dynamic environments, which is beneficial for improving the performance of the neural network (NN) model, especially in real-world environments, thereby bridging the sim-to-real gap. Texture categorization is also important as it allows the NN to recognize the shape of the 3-D object (vessel) instead of solely relying on object features.

2.3 Environmental Illumination and Occlusions

Refer to caption
Figure 4: Illustration of randomly varied ambient lighting conditions and human models

In the ocean environments, the lighting conditions may vary drastically. Furthermore, the illumination of objects in an image can be affected by changes in the orientation of the objects or the camera, even under consistent lighting conditions. For instance, the orientation of an object or the camera may result in the appearance of shadows under direct sunlight. Most cameras are equipped with an automatic exposure control (AE) system that adjusts the brightness of the image based on the lighting conditions. As a result, the lighting conditions in the captured image can vary depending on the adjustments made by the camera’s built-in AE system. This variability in lighting conditions poses challenges for object detection, tracking, and pose estimation when using real images. As such, randomizing object texture may not fully mitigate these challenges. To address this, we have enhanced the scene by introducing randomly varying lighting conditions, as illustrated in Figure 4.

In real-world scenarios, human operators and other objects are present in the vessel, leading to occlusions that cannot be overlooked. For example, during any UAV operation to collect real test data, the operator must be positioned on the landing pad. The occlusions created by such instances become critical during the UAV’s landing and take-off stages. To account for these occlusions, we have incorporated human models and other objects randomly during data generation.

3 Data Generation in Synthetic Environments

In this section, we incorporate a virtual camera object into the scene to create synthetic images. Importantly, the camera’s properties were configured to match those of the actual camera used for image collection, particularly in terms of field of view and resolution. This step is vital in ensuring the accuracy of the pose estimation for real images.

The camera can be positioned anywhere within the 3D space by defining its position and orientation. One of the primary objectives of this work is to estimate the pose of the camera with data-driven learning. Therefore, it is crucial to incorporate various random camera poses in the dataset. The distribution of the camera pose chosen for the training data should ideally reflect the distribution of the pose encountered during nominal shipboard operations for launch and recovery. To ensure this, we formulate a probability density function (PDF) for the camera pose. This PDF is designed such that the density value increases near the landing pad when the camera is pointed toward the bow side. This distribution is then dispersed to cover the flight envelope of nominal operations. The training dataset is generated by sampling the camera pose from this distribution and generating the corresponding image in the synthetic environment, as discussed in the previous section. The process of sampling the relative pose of the camera involves three steps: defining the distribution of the camera pose, denoted by C𝐶Citalic_C, defining the distribution of the camera is pointed toward, denoted by F𝐹Fitalic_F, and formulating the attitude of the camera based on these distributions [8].

3.1 Formulating Probability Density for Position

We first present the probability distribution for C𝐶Citalic_C and F𝐹Fitalic_F. Let the base frame B𝐵Bitalic_B be fixed to the flight deck of the ship. The x𝑥xitalic_x-axis of this frame points to the starboard side of the ship (right side of the helmsman), the y𝑦yitalic_y-axis points toward the bow (front) of the ship, and the z𝑧zitalic_z-axis points upwards. These axes are denoted by xBsubscript𝑥𝐵x_{B}italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, yBsubscript𝑦𝐵y_{B}italic_y start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, and zBsubscript𝑧𝐵z_{B}italic_z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, as illustrated in Figure 5. The location of C𝐶Citalic_C or F𝐹Fitalic_F is defined by the spherical coordinates (r,θB,ϕB)𝑟subscript𝜃𝐵subscriptitalic-ϕ𝐵(r,\theta_{B},\phi_{B})( italic_r , italic_θ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ), where r𝑟r\in\mathbb{R}italic_r ∈ blackboard_R represents the distance, and θB[π,π],ϕB[0,π2]formulae-sequencesubscript𝜃𝐵𝜋𝜋subscriptitalic-ϕ𝐵0𝜋2\theta_{B}\in[-\pi,\pi],\phi_{B}\in[0,\frac{\pi}{2}]italic_θ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ∈ [ - italic_π , italic_π ] , italic_ϕ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ∈ [ 0 , divide start_ARG italic_π end_ARG start_ARG 2 end_ARG ] represent rotation around the z𝑧zitalic_z-axis and rotation around the x𝑥xitalic_x-axis respectively.

For the position of the camera C𝐶Citalic_C, the angle θBsubscript𝜃𝐵\theta_{B}italic_θ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is sampled from a uniform distribution on the range [π,0]𝜋0[-\pi,0][ - italic_π , 0 ], representing the area behind the flight deck. Similarly, ϕBsubscriptitalic-ϕ𝐵\phi_{B}italic_ϕ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is uniformly sampled from the range [0,π3]0𝜋3[0,\frac{\pi}{3}][ 0 , divide start_ARG italic_π end_ARG start_ARG 3 end_ARG ], corresponding to the area above the flight deck. This choice is motivated by our desire to generate realistic camera orientations that resemble those of a horizontally fixed camera during UAV flight. We initially experimented with increasing the maximum range of ϕBsubscriptitalic-ϕ𝐵\phi_{B}italic_ϕ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT closer to π2𝜋2\frac{\pi}{2}divide start_ARG italic_π end_ARG start_ARG 2 end_ARG, but found that this led to a number of unrealistic camera orientations. To mitigate this, we decided to set the maximum range for ϕBsubscriptitalic-ϕ𝐵\phi_{B}italic_ϕ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT at π3𝜋3\frac{\pi}{3}divide start_ARG italic_π end_ARG start_ARG 3 end_ARG. This decision has proven effective in maintaining the realism of our generated camera orientations. The distance r𝑟ritalic_r is sampled from a truncated normal distribution, which is a one-dimensional Gaussian distribution truncated to the range of [0,L]0𝐿[0,L][ 0 , italic_L ] and rescaled accordingly. Here, the maximum range L𝐿Litalic_L is chosen as 25m25𝑚25m25 italic_m, and the mean and the standard deviation of the distribution are 1111 and 40404040, respectively. Next, the distribution for the target point F𝐹Fitalic_F is sampled in a similar manner. The range for each of θBsubscript𝜃𝐵\theta_{B}italic_θ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, ϕBsubscriptitalic-ϕ𝐵\phi_{B}italic_ϕ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, and r𝑟ritalic_r is [0,2π]02𝜋[0,2\pi][ 0 , 2 italic_π ], [π2,π2]𝜋2𝜋2[-\frac{\pi}{2},\frac{\pi}{2}][ - divide start_ARG italic_π end_ARG start_ARG 2 end_ARG , divide start_ARG italic_π end_ARG start_ARG 2 end_ARG ], and [0,15]015[0,15][ 0 , 15 ], respectively. The mean and the standard deviation for the truncated normal distribution are 00 and 1111. The resulting set of sampled C𝐶Citalic_C and F𝐹Fitalic_F is presented at Figure 6.

Refer to captionF𝐹Fitalic_Fr𝑟ritalic_rθBsubscript𝜃𝐵\theta_{B}italic_θ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPTxBsubscript𝑥𝐵x_{B}italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPTC𝐶Citalic_Cψ𝜓\psiitalic_ψyCsubscript𝑦𝐶y_{C}italic_y start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPTxCsubscript𝑥𝐶x_{C}italic_x start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPTzCsubscript𝑧𝐶z_{C}italic_z start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPTϕBsubscriptitalic-ϕ𝐵\phi_{B}italic_ϕ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPTB𝐵Bitalic_ByBsubscript𝑦𝐵y_{B}italic_y start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPTzBsubscript𝑧𝐵z_{B}italic_z start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT
Figure 5: The base frame and the camera frame

3.2 Attitude of Camera

Once the points C𝐶Citalic_C and F𝐹Fitalic_F are sampled, the camera attitude is determined as follows. Let the camera frame be defined such that its origin coincides with C𝐶Citalic_C, and the positive x-axis points to the right (along the direction of the image width), the positive y-axis points upward (along the direction of the image height), and the positive z-axis is opposite to the line of sight, as illustrated in Figure 5. The orientation of the camera frame with respect to the flight deck frame is specified by a rotation matrix R𝖲𝖮(𝟥)={R3×3|RTR=I,det[R]=1}𝑅𝖲𝖮3conditional-set𝑅superscript33formulae-sequencesuperscript𝑅𝑇𝑅𝐼detdelimited-[]𝑅1R\in\mathsf{SO(3)}=\{R\in\mathbb{R}^{3\times 3}\,|\,R^{T}R=I,\;\mathrm{det}[R]% =1\}italic_R ∈ sansserif_SO ( sansserif_3 ) = { italic_R ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT | italic_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R = italic_I , roman_det [ italic_R ] = 1 }.

The points of C𝐶Citalic_C and F𝐹Fitalic_F are sampled as described above, and they are reordered randomly to create a set of pairs of (C,F)𝐶𝐹(C,F)( italic_C , italic_F ). Each pair defines the line of sight of the camera as \vvCF\vv𝐶𝐹\vv{CF}italic_C italic_F. We choose the corresponding rotation matrix R𝖲𝖮(𝟥)superscript𝑅𝖲𝖮3R^{\prime}\in\mathsf{SO(3)}italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ sansserif_SO ( sansserif_3 ) such that its third axis is opposite to \vvCF\vv𝐶𝐹\vv{CF}italic_C italic_F and its x𝑥xitalic_x-axis is level to the ground. More specifically, Rsuperscript𝑅R^{\prime}italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is given by

R=[r1,r2,r3],superscript𝑅subscriptsuperscript𝑟1subscriptsuperscript𝑟2subscriptsuperscript𝑟3\displaystyle R^{\prime}=[r^{\prime}_{1},r^{\prime}_{2},r^{\prime}_{3}],italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = [ italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ] ,

where the unit-vectors r1,r2,r33subscriptsuperscript𝑟1subscriptsuperscript𝑟2subscriptsuperscript𝑟3superscript3r^{\prime}_{1},r^{\prime}_{2},r^{\prime}_{3}\in\mathbb{R}^{3}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT are obtained by

r3=\vvCF\vvCF,r1=e3×r3e3×r3,r2=r3×r1formulae-sequencesubscriptsuperscript𝑟3\vv𝐶𝐹norm\vv𝐶𝐹formulae-sequencesubscriptsuperscript𝑟1subscript𝑒3subscriptsuperscript𝑟3normsubscript𝑒3subscriptsuperscript𝑟3subscriptsuperscript𝑟2subscriptsuperscript𝑟3subscriptsuperscript𝑟1\displaystyle r^{\prime}_{3}=-\frac{\vv{{CF}}}{\|\vv{CF}\|},\quad r^{\prime}_{% 1}=\frac{e_{3}\times r^{\prime}_{3}}{\|e_{3}\times r^{\prime}_{3}\|},\quad r^{% \prime}_{2}=r^{\prime}_{3}\times r^{\prime}_{1}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = - divide start_ARG italic_C italic_F end_ARG start_ARG ∥ italic_C italic_F ∥ end_ARG , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT × italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_e start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT × italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∥ end_ARG , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT × italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

with e3=[0,0,1]3subscript𝑒3001superscript3e_{3}=[0,0,1]\in\mathbb{R}^{3}italic_e start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = [ 0 , 0 , 1 ] ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. Then, the camera attitude is obtained by rotating Rsuperscript𝑅R^{\prime}italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT about the z𝑧zitalic_z-axis by an angle ψ𝜓\psiitalic_ψ, which is sampled from the uniform distribution in the range [π6,π6]𝜋6𝜋6[-\frac{\pi}{6},\frac{\pi}{6}][ - divide start_ARG italic_π end_ARG start_ARG 6 end_ARG , divide start_ARG italic_π end_ARG start_ARG 6 end_ARG ], i.e., R=Rexp(ψe^3)𝑅superscript𝑅𝜓subscript^𝑒3R=R^{\prime}\exp(\psi\hat{e}_{3})italic_R = italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_exp ( start_ARG italic_ψ over^ start_ARG italic_e end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG ), where the hat map ^:3×3×3:^superscript3superscript33\hat{\cdot}:\mathbb{R}^{3}\times\mathbb{R}^{3\times 3}over^ start_ARG ⋅ end_ARG : blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT is defined such that x^y=x×y^𝑥𝑦𝑥𝑦\hat{x}y=x\times yover^ start_ARG italic_x end_ARG italic_y = italic_x × italic_y and x^=x^T^𝑥superscript^𝑥𝑇\hat{x}=-\hat{x}^{T}over^ start_ARG italic_x end_ARG = - over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT for any x,y3𝑥𝑦superscript3x,y\in\mathbb{R}^{3}italic_x , italic_y ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. The angle ψ𝜓\psiitalic_ψ represents the possible rolling motion of the camera.

Refer to caption
Figure 6: Sampled points for the camera location C𝐶Citalic_C (blue) and the points that the camera is pointed toward F𝐹Fitalic_F (red).

3.3 Generated Synthetic Images

The selected camera pose is transferred to the rendering software, Blender to generate the synthetic images with varying textures, wave patterns, and lightening conditions. Each image is paired with the correct pose (C,R)𝐶𝑅(C,R)( italic_C , italic_R ) to be used for training and verification. The selected samples from the synthetic images are illustrated in Figure 7.

Refer to caption
Figure 7: Generated synthetic images with the presence of sea horizon

In the virtual environment, the sea is modeled as an infinitely large plane, leading to the prominence of the sea horizon (the line where the sea meets the sky) in the background of most synthetic images. However, in real-world scenarios, the visibility of the sea horizon can be obstructed by various objects or geographical features. For instance, when the ship is in a bay, the shoreline or the land may alter the view of the horizon. As such, a dataset with infinitely extending sea may not represent the real-world scenario properly. We have observed that a network trained over such dataset may exhibit a bias especially in estimating the height of the camera.

To address this, we have opted to generate a distinct synthetic image dataset without the horizon. This is achieved by substituting the ship’s background with a randomly generated image. The random image is produced using the random image generator as proposed by [17]. Then, we mix the above dataset with the synthetic images without any horizon. This approach ensures a more diverse and realistic representation in our synthetic image dataset, while capturing the importation information provided by a horizon.

Refer to caption
Figure 8: Generated synthetic images without sea horizon by replacing background with random image

3.4 Decomposition of the Ship into Multiple Parts and Keypoint Selection

The above images constitute the input to the neural network formulated in the next section. The desired outputs, or the target data correspond to the pixel location in the image for the 3D keypoints defining a pre-defined bounding box of the ship. Then, the camera pose can be estimated by the resulting correspondence between the 2D image coordinate and the 3D location of the keypoints.

However, defining the entire ship as a single object is not desirable, as the visibility might be obscured depending on the position and the orientation of the camera. For example, during the landing, the keypoints at the stern of the ship may not be visible. To address this issue, the ship is decomposed into multiple parts for detection, rather than being detected as a whole. This method also helps mitigate the impact of certain camera orientations and specific lighting conditions that might result in only partial visibility of certain areas of the ship, potentially affecting the accuracy of the model’s estimations. Specifically, the ship is decomposed into six parts: the whole ship, stern, superstructure, and three parts of the flight structure immediately forward of the flight deck, referred to as the dog house,, as illustrated at Figure 9.

Refer to caption
Figure 9: Selected specific components of the ship and 3D keypoints

For each of these six parts, we define a 3D bounding box by specifying eight corners following [7]. Each of twelve edges of the bounding box is divided into three sections by adding two intermediate points. As such, a bounding box is defined by thirty two keypoints. The location of each keypoint is calculated by the perspective projection from the camera model. In short, the target for each synthetic image is the 2D location of thirty two keypoints for each of six parts of the ship.

4 Relative 6D Object Pose Estimation

In this section, we present a deep neural network model for pose estimation and a Bayesian fusion to integrate the estimate pose relative to multiple parts of the ship. For the neural network architecture, we employ a Transformer Neural Network (TNN) as presented in [7]. In this method, we estimate the 2D keypoints associated with each object and the object class confidence of the RGB image. Then, we recover the corresponding relative 6D poses of each object by solving the 2D to 3D correspondence of the keypoints using Efficient Point-n-Perspective (EPnP) [9]. Finally, we integrate the resulting pose estimations using Bayesian fusion to obtain the most probable pose estimate.

4.1 Transformer Neural Network for Keypoints Estimation

Refer to captionRefer to captionCNNEPnPEPnPEPnPEPnPBackbonePositional EncodingObject QueriesPrediction HeadsFeatures transformer
decoder
(6 layers) FFNFFNFFNFFNFFNo2subscript𝑜2{}_{o_{2}}start_FLOATSUBSCRIPT italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_FLOATSUBSCRIPTR2,t2subscript𝑅2subscript𝑡2{}_{R_{2},\ t_{2}}start_FLOATSUBSCRIPT italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_FLOATSUBSCRIPTo1subscript𝑜1{}_{o_{1}}start_FLOATSUBSCRIPT italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_FLOATSUBSCRIPTR1,t1subscript𝑅1subscript𝑡1{}_{R_{1},\ t_{1}}start_FLOATSUBSCRIPT italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_FLOATSUBSCRIPTo4subscript𝑜4{}_{o_{4}}start_FLOATSUBSCRIPT italic_o start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_FLOATSUBSCRIPTR4,t4subscript𝑅4subscript𝑡4{}_{R_{4},\ t_{4}}start_FLOATSUBSCRIPT italic_R start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_FLOATSUBSCRIPTo3subscript𝑜3{}_{o_{3}}start_FLOATSUBSCRIPT italic_o start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_FLOATSUBSCRIPTR3,t3subscript𝑅3subscript𝑡3{}_{R_{3},\ t_{3}}start_FLOATSUBSCRIPT italic_R start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_FLOATSUBSCRIPT BF (Bayesian Fusion) R,t𝑅𝑡R,\ titalic_R , italic_tclassprobabilitieskeypointsclassprobabilitieskeypointsclassprobabilitieskeypointsclassprobabilitieskeypoints\emptysetObject transformer
encoder
(6 layers)
Figure 10: TNN-MO model architecture to estimate 6D pose

The proposed transformer neural network architecture is described as follows. First, for a given RGB image, we extract image features using a CNN backbone. Specifically, the input RGB image has a size of H×W=480×640𝐻𝑊480640H\times W=480\times 640italic_H × italic_W = 480 × 640. We employ ResNet50 to generate a lower-resolution feature fC×H0×W0𝑓superscript𝐶subscript𝐻0subscript𝑊0f\in\mathbb{R}^{C\times H_{0}\times W_{0}}italic_f ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where C=2048𝐶2048C=2048italic_C = 2048 is the channel dimension and H0,W0subscript𝐻0subscript𝑊0H_{0},W_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are defined as H32=15,W32=20formulae-sequence𝐻3215𝑊3220\frac{H}{32}=15,\frac{W}{32}=20divide start_ARG italic_H end_ARG start_ARG 32 end_ARG = 15 , divide start_ARG italic_W end_ARG start_ARG 32 end_ARG = 20 respectively, following the DETR [6]. Then, we use a 1×1111\times 11 × 1 convolution to reduce the feature dimension to d=256𝑑256d=256italic_d = 256. The extracted features, which are in the dimension of d×H0×W0𝑑subscript𝐻0subscript𝑊0d\times H_{0}\times W_{0}italic_d × italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, are flattened to a dimension of d×H0W0𝑑subscript𝐻0subscript𝑊0d\times H_{0}W_{0}italic_d × italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Subsequently, a fixed positional encoding is added to the features to generate input embeddings, which have the dimension of d×H0W0𝑑subscript𝐻0subscript𝑊0d\times H_{0}W_{0}italic_d × italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Then, each of these embeddings is used as an input for the transformer encoder [18].

The transformer encoder has six standard encoder layers with skip connections. The output from the encoder is passed to the decoder module which also has six standard decoder layers with skip connections, accompanied by Q𝑄Qitalic_Q object queries [7]. These object queries are a set of learnable positional embeddings with the dimension d×Q𝑑𝑄d\times Qitalic_d × italic_Q, representing the position of each object class. These vectors are learned during the training process of the model, to help the model understand and represent the positional relationships between different objects in the data. Let N𝑁Nitalic_N be the cardinality of the object classes including the case when no object is detected, which is denoted by \varnothing. Since we are considering six parts of the ship as objects which defined in Figure 9, we have N=7𝑁7N=7italic_N = 7. In the presented TNN-MO model, we set Q=N𝑄𝑁Q=Nitalic_Q = italic_N, implying that the number of object queries is equal to the object classes.

The resulting decoder output embeddings are processed with N𝑁Nitalic_N feed-forward networks (FFNs). Each FFN is composed of two parts: linear layers with an input size of 256 and an output size of N𝑁Nitalic_N for class prediction, and a standard three-layer perceptron that has an input size of 256, a hidden layer dimension of 256 with the ReLU activation, and an output size of 64 for keypoint prediction.

These outputs of the i𝑖iitalic_i-th FFN, corresponding to the i𝑖iitalic_i-th object query, are denoted by the class predition c¯iNsubscript¯𝑐𝑖superscript𝑁\bar{c}_{i}\in\mathbb{R}^{N}over¯ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and the keypoints K¯i32×2subscript¯𝐾𝑖superscript322\bar{K}_{i}\in\mathbb{R}^{32\times 2}over¯ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 32 × 2 end_POSTSUPERSCRIPT, respectively. Here, the j𝑗jitalic_j-th variable of c¯isubscript¯𝑐𝑖{\bar{c}}_{i}over¯ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, namely c¯ijsubscript¯𝑐𝑖𝑗\bar{c}_{ij}over¯ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT corresponds to the class logit (unnormalized score) that the keypoint output of the i𝑖iitalic_i-th FFN, K¯isubscript¯𝐾𝑖\bar{K}_{i}over¯ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT belongs to the j𝑗jitalic_j-th object class. In other words, the higher the value of c¯ijsubscript¯𝑐𝑖𝑗\bar{c}_{ij}over¯ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is, the more likely it is that the keypoints K¯isubscript¯𝐾𝑖{\bar{K}}_{i}over¯ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are of the j𝑗jitalic_j-th object. These logits are passed through a softmax function to obtain the predicted class probability p¯i[0,1]Nsubscript¯𝑝𝑖superscript01𝑁\bar{p}_{i}\in[0,1]^{N}over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT of the i𝑖iitalic_i-th FFN, satisfying j=1Np¯ij=1superscriptsubscript𝑗1𝑁subscript¯𝑝𝑖𝑗1\sum_{j=1}^{N}\bar{p}_{ij}=1∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1. These are concatenated into a stochastic matrix P¯=[p¯1T,p¯2T,,p¯NT]T[0,1]N×N¯𝑃superscriptsuperscriptsubscript¯𝑝1𝑇superscriptsubscript¯𝑝2𝑇superscriptsubscript¯𝑝𝑁𝑇𝑇superscript01𝑁𝑁\bar{P}=[\bar{p}_{1}^{T},\bar{p}_{2}^{T},\cdots,\bar{p}_{N}^{T}]^{T}\in[0,1]^{% N\times N}over¯ start_ARG italic_P end_ARG = [ over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , ⋯ , over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT, where p¯ijsubscript¯𝑝𝑖𝑗\bar{p}_{ij}over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denotes the predicted probability that the i𝑖iitalic_i-th set of keypoints K¯isubscript¯𝐾𝑖\bar{K}_{i}over¯ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT belongs to the j𝑗jitalic_j-th object class.

4.2 Ground Truth Data

In the training input image, the ground truth object class labels are represented by 𝐜g=[cg1,cg2,,cgN]{0,1}Nsubscript𝐜𝑔subscript𝑐subscript𝑔1subscript𝑐subscript𝑔2subscript𝑐subscript𝑔𝑁superscript01𝑁\mathbf{c}_{g}=[c_{g_{1}},c_{g_{2}},\ldots,c_{g_{N}}]\in\{0,1\}^{N}bold_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = [ italic_c start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where cgjsubscript𝑐subscript𝑔𝑗c_{g_{j}}italic_c start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT is a binary variable indicating the presence (or the absence) of the j𝑗jitalic_j-th object. Specifically, if the j𝑗jitalic_j-th object appears in the training input image, we set cgj=1subscript𝑐subscript𝑔𝑗1c_{g_{j}}=1italic_c start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1; otherwise, we set cgj=0subscript𝑐subscript𝑔𝑗0c_{g_{j}}=0italic_c start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0. The N𝑁Nitalic_N-th object corresponds to the \varnothing object, and therefore, cgN=0subscript𝑐subscript𝑔𝑁0c_{g_{N}}=0italic_c start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0 always for any image in the training data set. It is important to note that since we are considering a single ship, there is no repetition of objects in the image. For example, following the object classes defined in Figure 9, when the whole ship (5th) and the ship stern (6th) are visible in a specific image, the resulting ground truth object class labels is 𝐜g=[0,0,0,0,1,1,0]subscript𝐜𝑔0000110\mathbf{c}_{g}=[0,0,0,0,1,1,0]bold_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = [ 0 , 0 , 0 , 0 , 1 , 1 , 0 ].

For any i𝑖iitalic_i with cgi0subscript𝑐subscript𝑔𝑖0c_{g_{i}}\neq 0italic_c start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≠ 0, the corresponding location of the keypoints in the image is denoted by Ki32×2subscript𝐾𝑖superscript322K_{i}\in\mathbb{R}^{32\times 2}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 32 × 2 end_POSTSUPERSCRIPT, and the location of the keypoints in the ship-fixed frame is denoted by q32×3𝑞superscript323q\in\mathbb{R}^{32\times 3}italic_q ∈ blackboard_R start_POSTSUPERSCRIPT 32 × 3 end_POSTSUPERSCRIPT.

4.3 Loss Function for Training

For the given labeled data and the predicted output, the loss function for training is formulated as follows. Our approach to loss computation is inspired by the set prediction task, similar to the methods used in DETR [6] and YOLOPose [7]. It is composed of two steps of identifying the matching pairs and calculating the loss.

First, we perform the matching between the predicted class probability outputs and the ground truth using bipartite matching [6]. Let 𝔖Nsubscript𝔖𝑁\mathfrak{S}_{N}fraktur_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT be the set of all possible permutations of {1,,N}1𝑁\{1,\ldots,N\}{ 1 , … , italic_N }. For any sequence σ𝔖𝜎𝔖\sigma\in\mathfrak{S}italic_σ ∈ fraktur_S, the matching cost is defined by

match(σ)=i=1Ncgip¯σ(i)isubscript𝑚𝑎𝑡𝑐𝜎superscriptsubscript𝑖1𝑁subscript𝑐subscript𝑔𝑖subscript¯𝑝𝜎𝑖𝑖\mathcal{L}_{match}(\sigma)=\sum_{i=1}^{N}-c_{g_{i}}\bar{p}_{\sigma(i)i}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT ( italic_σ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT - italic_c start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_σ ( italic_i ) italic_i end_POSTSUBSCRIPT (1)

where σ(i)𝜎𝑖\sigma(i)italic_σ ( italic_i ) gives the index of the i𝑖iitalic_i-th element in the permuted sequence. As such, this represents the sum of the negative probability that K¯σ(i)subscript¯𝐾𝜎𝑖\bar{K}_{\sigma(i)}over¯ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_σ ( italic_i ) end_POSTSUBSCRIPT belongs to the i𝑖iitalic_i-th object class for all objects present in the training image satisfying cgi0subscript𝑐subscript𝑔𝑖0c_{g_{i}}\neq 0italic_c start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≠ 0. The bipartite matching is to identify the optimal sequence that minimizes the matching cost:

σ¯=argminσ𝔖Nmatch(σ),¯𝜎subscriptargmin𝜎subscript𝔖𝑁subscript𝑚𝑎𝑡𝑐𝜎\bar{\sigma}=\operatorname*{arg\,min}_{\sigma\in\mathfrak{S}_{N}}\mathcal{L}_{% match}(\sigma),over¯ start_ARG italic_σ end_ARG = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_σ ∈ fraktur_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT ( italic_σ ) , (2)

and it is addressed by the Hungarian method [19].

Next, after the matching pairs are identified, we formulate the keypoint loss as the sum of the negative log-likelihood for the object class prediction, and the error in the prediction of the keypoints, as given by

keypoints=i=1N[logp¯σ¯(i)i+γcgiKiK¯σ¯(i)1],subscript𝑘𝑒𝑦𝑝𝑜𝑖𝑛𝑡𝑠superscriptsubscript𝑖1𝑁delimited-[]subscript¯𝑝¯𝜎𝑖𝑖𝛾subscript𝑐subscript𝑔𝑖subscriptnormsubscript𝐾𝑖subscript¯𝐾¯𝜎𝑖1\mathcal{L}_{keypoints}=\sum_{i=1}^{N}\left[-\log\bar{p}_{\bar{\sigma}(i)i}+% \gamma c_{g_{i}}\norm{K_{i}-\bar{K}_{\bar{\sigma}(i)}}_{1}\right],caligraphic_L start_POSTSUBSCRIPT italic_k italic_e italic_y italic_p italic_o italic_i italic_n italic_t italic_s end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT [ - roman_log over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT over¯ start_ARG italic_σ end_ARG ( italic_i ) italic_i end_POSTSUBSCRIPT + italic_γ italic_c start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_ARG italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_K end_ARG start_POSTSUBSCRIPT over¯ start_ARG italic_σ end_ARG ( italic_i ) end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] , (3)

where γ>0𝛾0\gamma>0italic_γ > 0 is a hyperparameter for relative weighting. The parameters of the proposed deep neural network and the object queries are randomly chosen initially, and they are adjusted to minimize the above loss during training. The detailed implementation of the training is described in the next section.

4.4 Model Inference with Bayesian Fusion

During the inference phase, the proposed TNN returns the probability of the class prediction p¯i[0,1]Nsubscript¯𝑝𝑖superscript01𝑁\bar{p}_{i}\in[0,1]^{N}over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and the predicted keypoints K¯i32×2subscript¯𝐾𝑖superscript322\bar{K}_{i}\in\mathbb{R}^{32\times 2}over¯ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 32 × 2 end_POSTSUPERSCRIPT for each object query i{1,,N}𝑖1𝑁i\in\{1,\ldots,N\}italic_i ∈ { 1 , … , italic_N }. The object class corresponding to the i𝑖iitalic_i-th object query, namely σ(i){1,,N}superscript𝜎𝑖1𝑁\sigma^{*}(i)\in\{1,\ldots,N\}italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) ∈ { 1 , … , italic_N } is identified by

σ(i)=argmax1jN{p¯i1,p¯iN}superscript𝜎𝑖subscriptargmax1𝑗𝑁subscript¯𝑝𝑖1subscript¯𝑝𝑖𝑁\displaystyle\sigma^{*}(i)=\operatorname*{arg\,max}_{1\leq j\leq N}\{\bar{p}_{% i1},\ldots\bar{p}_{iN}\}italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_N end_POSTSUBSCRIPT { over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , … over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i italic_N end_POSTSUBSCRIPT }

and the resulting maximum probability is denoted by oi=p¯iσ(i)[0,1]subscript𝑜𝑖subscript¯𝑝𝑖superscript𝜎𝑖01o_{i}=\bar{p}_{i\sigma^{*}(i)}\in[0,1]italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) end_POSTSUBSCRIPT ∈ [ 0 , 1 ], which is considered as the confidence in the object classification.

Together with the 3D location of the keypoints (or the bounding box) namely qσ(i)32×3subscript𝑞superscript𝜎𝑖superscript323q_{\sigma^{*}(i)}\in\mathbb{R}^{32\times 3}italic_q start_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 32 × 3 end_POSTSUPERSCRIPT in the ground truth, the pair (K¯i,qσ(i))subscript¯𝐾𝑖subscript𝑞superscript𝜎𝑖(\bar{K}_{i},q_{\sigma^{*}(i)})( over¯ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) end_POSTSUBSCRIPT ) provides a 2D-to-3D correspondence for 32 points on the σ(i)superscript𝜎𝑖\sigma^{*}(i)italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i )-th bounding box, which can be used to estimate the camera pose with respect to the ship-fixed frame. Specifically, we use the EPnP algoroithm [9] in conjunction with RANSAC to estimate the pose (Ri,ti)𝖲𝖮(𝟥)×3subscript𝑅𝑖subscript𝑡𝑖𝖲𝖮3superscript3(R_{i},t_{i})\in\mathsf{SO(3)}\times\mathbb{R}^{3}( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ sansserif_SO ( sansserif_3 ) × blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT from the i𝑖iitalic_i-th object.

As such, we obtain at most N1𝑁1N-1italic_N - 1 pairs of (Ri,ti)subscript𝑅𝑖subscript𝑡𝑖(R_{i},t_{i})( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) from the object classes defined in Section 3.4, excluding the predictions corresponding to the no object class. Each pose estimate has a varying degree of accuracy and confidence, as one object may be captured clearly than others depending on the perspective of the camera and the lighting condition. Or, it is possible that certain objects are occluded or outside of the field of view.

To address this, the multiple pose estimates are integrated as follows. It has been empirically observed that the class confidence of the object, namely oi=p¯iσ(i)subscript𝑜𝑖subscript¯𝑝𝑖superscript𝜎𝑖o_{i}=\bar{p}_{i\sigma^{*}(i)}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i ) end_POSTSUBSCRIPT is closely related to the accuracy of the pose estimate [20]. To eliminate outliers, the estimated pose is discarded if the object class confidence is less than or equal to 0.9, i.e., oi0.9subscript𝑜𝑖0.9o_{i}\leq 0.9italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ 0.9. Let NinliersN1subscript𝑁𝑖𝑛𝑙𝑖𝑒𝑟𝑠𝑁1N_{inliers}\leq N-1italic_N start_POSTSUBSCRIPT italic_i italic_n italic_l italic_i italic_e italic_r italic_s end_POSTSUBSCRIPT ≤ italic_N - 1 be the number of estimates remaining after removing outliers. Then, the class confidence {oi}1Ninlierssuperscriptsubscriptsubscript𝑜𝑖1subscript𝑁𝑖𝑛𝑙𝑖𝑒𝑟𝑠\{o_{i}\}_{1}^{N_{inliers}}{ italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_n italic_l italic_i italic_e italic_r italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for the inliers are filtered with the softmax function to obtain the normalized weight wi[0,1]subscript𝑤𝑖01w_{i}\in[0,1]italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , 1 ] for the inliers, satisfying iwi=1subscript𝑖subscript𝑤𝑖1\sum_{i}w_{i}=1∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1. Each pose estimate (Ri,ti)subscript𝑅𝑖subscript𝑡𝑖(R_{i},t_{i})( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is paired with the corresponding weight wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, to be integrated into a single pose estimate.

For the position estimate, the mean μt3subscript𝜇𝑡superscript3\mu_{t}\in\mathbb{R}^{3}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and the covariance matrix Σt3×3subscriptΣ𝑡superscript33\Sigma_{t}\in\mathbb{R}^{3\times 3}roman_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT are obtained by the weighted sum as

μt=iwiti,Σt=iwi(tiμt)(tiμt)T,formulae-sequencesubscript𝜇𝑡subscript𝑖subscript𝑤𝑖subscript𝑡𝑖subscriptΣ𝑡subscript𝑖subscript𝑤𝑖subscript𝑡𝑖subscript𝜇𝑡superscriptsubscript𝑡𝑖subscript𝜇𝑡𝑇\displaystyle\mu_{t}=\sum_{i}w_{i}t_{i},\quad\Sigma_{t}=\sum_{i}w_{i}(t_{i}-% \mu_{t})(t_{i}-\mu_{t})^{T},italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (4)

which serve as the position estimate and the corresponding degree of confidence.

Next, for the attitude estimate, it has been shown that the following arithmetic mean E[R]3×3Edelimited-[]𝑅superscript33\mathrm{E}[R]\in\mathbb{R}^{3\times 3}roman_E [ italic_R ] ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT, or the first moment of the rotation matrix describes the mode and the degree of dispersion in 𝖲𝖮(𝟥)𝖲𝖮3\mathsf{SO(3)}sansserif_SO ( sansserif_3 ) [21].

E[R]Edelimited-[]𝑅\displaystyle\mathrm{E}[R]roman_E [ italic_R ] =jwjRj.absentsubscript𝑗subscript𝑤𝑗subscript𝑅𝑗\displaystyle=\sum_{j}w_{j}R_{j}.= ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT . (5)

Note that while each Rjsubscript𝑅𝑗R_{j}italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT belongs to 𝖲𝖮(𝟥)𝖲𝖮3\mathsf{SO(3)}sansserif_SO ( sansserif_3 ), the weighted sum is not necessarily on 𝖲𝖮(𝟥)𝖲𝖮3\mathsf{SO(3)}sansserif_SO ( sansserif_3 ). Let E[R]=UD(V)TEdelimited-[]𝑅superscript𝑈superscript𝐷superscriptsuperscript𝑉𝑇\mathrm{E}[R]=U^{\prime}D^{\prime}(V^{\prime})^{T}roman_E [ italic_R ] = italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT be the singular value decomposition of E[R]Edelimited-[]𝑅\mathrm{E}[R]roman_E [ italic_R ] with the orthogonal U,V3×3superscript𝑈superscript𝑉superscript33U^{\prime},V^{\prime}\in\mathbb{R}^{3\times 3}italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT and the diagonal D3×3superscript𝐷superscript33D^{\prime}\in\mathbb{R}^{3\times 3}italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT. The proper singular value decomposition of E[R]Edelimited-[]𝑅\mathrm{E}[R]roman_E [ italic_R ] is given by

E[R]=UDVT,Edelimited-[]𝑅𝑈𝐷superscript𝑉𝑇\displaystyle\mathrm{E}[R]=UDV^{T},roman_E [ italic_R ] = italic_U italic_D italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (6)

where the rotation matrices U,V𝖲𝖮(𝟥)𝑈𝑉𝖲𝖮3U,V\in\mathsf{SO(3)}italic_U , italic_V ∈ sansserif_SO ( sansserif_3 ) and the diagonal matrix D=diag[d1,d2,d3]3×3𝐷diagsubscript𝑑1subscript𝑑2subscript𝑑3superscript33D=\mathrm{diag}[d_{1},d_{2},d_{3}]\in\mathbb{R}^{3\times 3}italic_D = roman_diag [ italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT are defined as

U𝑈\displaystyle Uitalic_U =Udiag[1,1,det[U]],absentsuperscript𝑈diag11superscript𝑈\displaystyle=U^{\prime}\mathrm{diag}[1,1,\det[U^{\prime}]],= italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_diag [ 1 , 1 , roman_det [ italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ] ,
D𝐷\displaystyle Ditalic_D =Ddiag[1,1,det[UV]],absentsuperscript𝐷diag11superscript𝑈superscript𝑉\displaystyle=D^{\prime}\mathrm{diag}[1,1,\det[U^{\prime}V^{\prime}]],= italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_diag [ 1 , 1 , roman_det [ italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ] ,
V𝑉\displaystyle Vitalic_V =Vdiag[1,1,det[V]].absentsuperscript𝑉diag11superscript𝑉\displaystyle=V^{\prime}\mathrm{diag}[1,1,\det[V^{\prime}]].= italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_diag [ 1 , 1 , roman_det [ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ] .

Given the value of E[R]Edelimited-[]𝑅\mathrm{E}[R]roman_E [ italic_R ] and its proper singular value decomposition, the maximum likelihood estimate of the attitude is given by μR=UVT𝖲𝖮(𝟥)subscript𝜇𝑅𝑈superscript𝑉𝑇𝖲𝖮3\mu_{R}=UV^{T}\in\mathsf{SO(3)}italic_μ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = italic_U italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ sansserif_SO ( sansserif_3 ) [21]. The confidence in the attitude estimate can be measured by the diagonal elements of D𝐷Ditalic_D satisfying 1d1d2|d3|01subscript𝑑1subscript𝑑2subscript𝑑301\geq d_{1}\geq d_{2}\geq|d_{3}|\geq 01 ≥ italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ | italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | ≥ 0. Roughly speaking, as the diagonal elements become closer to one (resp. zero), the confidence level on the attitude estimate μRsubscript𝜇𝑅\mu_{R}italic_μ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT increases (resp. decreases).

More specifically, the distribution of the estimated attitude can be described by the matrix Fisher distribution, which is the maximum entropy (or most arbitrary) distribution on 𝖲𝖮(𝟥)𝖲𝖮3\mathsf{SO(3)}sansserif_SO ( sansserif_3 ) when conditioned by the fixed value of E[R]Edelimited-[]𝑅\mathrm{E}[R]roman_E [ italic_R ]. From the matrix Fisher distribution, the degree of uncertainties in the attitude estimate can be quantified as follows. For any R𝖲𝖮(𝟥)𝑅𝖲𝖮3R\in\mathsf{SO(3)}italic_R ∈ sansserif_SO ( sansserif_3 ), let η3𝜂superscript3\eta\in\mathbb{R}^{3}italic_η ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT be defined such that R=Uexp(η^)VT𝑅𝑈^𝜂superscript𝑉𝑇R=U\exp(\hat{\eta})V^{T}italic_R = italic_U roman_exp ( start_ARG over^ start_ARG italic_η end_ARG end_ARG ) italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. In other words, η𝜂\etaitalic_η represents the difference between R𝑅Ritalic_R and the estimated attitude μRsubscript𝜇𝑅\mu_{R}italic_μ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT in the sense that R𝑅Ritalic_R can be obtained by rotating μRsubscript𝜇𝑅\mu_{R}italic_μ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT about the axis Vη𝑉𝜂V\etaitalic_V italic_η resolved in the ship-fixed frame (or the axis Uη𝑈𝜂U\etaitalic_U italic_η when resolved in the estimated camera frame) by the angle ηnorm𝜂\|\eta\|∥ italic_η ∥. Thus, Vη𝑉𝜂V\etaitalic_V italic_η is the axis of rotation and ηnorm𝜂\|\eta\|∥ italic_η ∥ is the angle of rotation from μRsubscript𝜇𝑅\mu_{R}italic_μ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT to R𝑅Ritalic_R.

There are two methods for uncertainty quantification. First, when the diagonal elements are identical, i.e., d1=d2=d3=dd_{1}=d_{2}=d_{3}\triangleq=ditalic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ≜ = italic_d, the probability that the angle of rotation is less that a specific value θ[0,π]𝜃0𝜋\theta\in[0,\pi]italic_θ ∈ [ 0 , italic_π ] can be computed by

[ηθ]=1π(I0(2s)I1(2s))0θexp(2scosρ)(1cosρ)𝑑ρ,delimited-[]norm𝜂𝜃1𝜋subscript𝐼02𝑠subscript𝐼12𝑠superscriptsubscript0𝜃2𝑠𝜌1𝜌differential-d𝜌\displaystyle\mathbb{P}[\|\eta\|\leq\theta]=\frac{1}{\pi(I_{0}(2s)-I_{1}(2s))}% \int_{0}^{\theta}\exp(2s\cos\rho)(1-\cos\rho)d\rho,blackboard_P [ ∥ italic_η ∥ ≤ italic_θ ] = divide start_ARG 1 end_ARG start_ARG italic_π ( italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 2 italic_s ) - italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 2 italic_s ) ) end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT roman_exp ( start_ARG 2 italic_s roman_cos italic_ρ end_ARG ) ( 1 - roman_cos italic_ρ ) italic_d italic_ρ , (7)

where I0,I1subscript𝐼0subscript𝐼1I_{0},I_{1}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT correspond to the modified Bessel function of the first kind, and s0𝑠0s\geq 0italic_s ≥ 0 is a scalar that can be obtained numerically from d𝑑ditalic_d [21]. The value of the above probability for varying d𝑑ditalic_d and θ𝜃\thetaitalic_θ is illustrated at Figure 11. For example, when d=0.999𝑑0.999d=0.999italic_d = 0.999, the estimation error is less than 5.07superscript5.075.07^{\circ}5.07 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT with the probability of 0.950.950.950.95.

Refer to caption

Figure 11: Degree of confidence in the attitude estimate

Second, when the estimated distribution is highly concentrated, or when d31subscript𝑑31d_{3}\rightarrow 1italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT → 1, the rotation vector η𝜂\etaitalic_η follows a Gaussian distribution 𝒩(0,Ση)𝒩0subscriptΣ𝜂\mathcal{N}(0,\Sigma_{\eta})caligraphic_N ( 0 , roman_Σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ) with

Ση=diag[1+d1d2d3,1d1+d2d3,1d1d2+d3].subscriptΣ𝜂diag1subscript𝑑1subscript𝑑2subscript𝑑31subscript𝑑1subscript𝑑2subscript𝑑31subscript𝑑1subscript𝑑2subscript𝑑3\displaystyle\Sigma_{\eta}=\mathrm{diag}[1+d_{1}-d_{2}-d_{3},1-d_{1}+d_{2}-d_{% 3},1-d_{1}-d_{2}+d_{3}].roman_Σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT = roman_diag [ 1 + italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , 1 - italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , 1 - italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ] . (8)

As it is unlikely to encounter highly concentrated estimates where (8) holds in practice, the following empirical expression can be used instead.

Ση=iwiηiηiT,subscriptΣ𝜂subscript𝑖subscript𝑤𝑖subscript𝜂𝑖superscriptsubscript𝜂𝑖𝑇\displaystyle\Sigma_{\eta}=\sum_{i}w_{i}\eta_{i}\eta_{i}^{T},roman_Σ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (9)

where ηi3subscript𝜂𝑖superscript3\eta_{i}\in\mathbb{R}^{3}italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is obtained such that η^i=log(UTRiV)subscript^𝜂𝑖superscript𝑈𝑇subscript𝑅𝑖𝑉\hat{\eta}_{i}=\log(U^{T}R_{i}V)over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_log ( start_ARG italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V end_ARG ).

In short, the attitude estimate is given by μR=UVTsubscript𝜇𝑅𝑈superscript𝑉𝑇\mu_{R}=UV^{T}italic_μ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = italic_U italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, and the confidence in the estimated attitude can be analyzed with (7), (8), or (9).

5 Model Training and Testing with Synthetic Data

The above transformer network and the Bayesian fusion constitute the proposed network architecture to estimate the 6D pose from monocular vision. In this section, we describe how the network is trained, including the implementation details and trained model is validation with the synthetic data.

5.1 Model Training with Synthetic Data

The model is trained as follows. The training dataset is composed of 435k synthetic images, including 332k images with a sea horizon background and 102k images with a random image background (without a sea horizon). Given that the synthetic dataset already included randomized situations, effectively covering a broad spectrum of variations and scenarios, image augmentation techniques were not utilized during the training process. The existing dataset was deemed to provide sufficient diversity and complexity for effective learning and adaptation of the model, rendering additional image augmentation unnecessary.

The TNN-MO model was implemented using the PyTorch framework. The model was trained for 350 epochs with the batch size of 48, hyperparameter γ=10𝛾10\gamma=10italic_γ = 10, leveraging the AdamW optimizer [22], using a 20GB Multi-Instance GPU (MIG) partition from NVIDIA A100-PCIE-40GB GPU. The learning rate was initially set to 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for the first 200 epochs, after which it was reduced to 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. Gradient clip** was also employed, limiting the maximum gradient norm to 0.1. The resulting training curve is presented at Figure 12.

Refer to caption
Figure 12: Training loss over epochs

5.2 Model Testing with Synthetic Data

Refer to caption
Figure 13: TNN-MO model tested over synthetic data: for object classes with the confidence oj>0.9subscript𝑜𝑗0.9o_{j}>0.9italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0.9, the key points and the base frame are illustrated
Refer to caption
Figure 14: TNN-SO model (top) vs TNN-MO model (bottom) tested over long range synthetic data: for TNN-MO model, object classes with the confidence oj>0.9subscript𝑜𝑗0.9o_{j}>0.9italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0.9, the key points are illustrated in colors. Compared to the TNN-SO model, the TNN-MO model exhibits the grater accuracy greater accuracy over a longer range.

The trained model is validated with the synthetic test dataset that were not in the training dataset. For further comparison, we also incorporate the TNN model without the proposed multi-objects framework as developed in [8]. The resulting single object model predicts the relative pose with respect to the dog house structure. These are referred to as the TNN-SO Model and TNN-MO Model, respectively.

We compute the mean absolute error (MAE) for the position estimate and the attitude estimate over synthetic images under various lighting conditions and poses. The results are summarized at Table 1, where it is presented that TNN-MO model exhibits the position error of 0.204 mtimes0.204meter0.204\text{\,}\mathrm{m}start_ARG 0.204 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG and the attitude error of 0.91 °times0.91degree0.91\text{\,}\mathrm{\SIUnitSymbolDegree}start_ARG 0.91 end_ARG start_ARG times end_ARG start_ARG ° end_ARG. The position error is about 0.8% of the maximum range. While the direct comparison is not possible as they are distinct approaches based on different assumptions, these errors of the proposed approach are comparable or superior to the common visual-inertial odometry techniques [23]. When compared with the TNN-SO model, the proposed TNN-MO model demonstrates improved accuracy, even over long ranges. From Figure 14, we can see that the TNN-SO model is unable to detect the keypoints of the dog house when the camera is far from the ship. However, our TNN-MO model is capable of detecting keypoints of multiple objects and it is not vulnerable to the visibility of a single object. Figure 13 and Figure 14 illustrates the bounding boxes with a higher confidence level computed from the estimated pose, and it demonstrates that the estimated keypoints accurately aligned with the ground truth keypoints.

Table 1: Validation with Synthetic Dataset
TNN type No. Images Max range (L𝐿Litalic_L(m)) MAE Rot. Est. (deg) MAE Pos. Est. (m) MAE / L𝐿Litalic_L (%)
TNN-SO 4.2k 20 1.13 0.281 1.41
TNN-MO 5.5k 25 0.91 0.204 0.82

6 Validation with Flight Experiments

The TNN-MO model, trained on a synthetic dataset, was further tested using real images obtained during flight experiments on a USNA research vessel. The model was trained on a synthetic dataset, as collecting a large labeled dataset of a research vessel in ocean is infeasible. As such, it is critical to evaluate the model in the real world situation to access its capacity to handle the unique attributes of real images, such as such as variations in lighting conditions, object appearances, backgrounds, and other factors that may deviate from the synthetic dataset, referred to as the sim-to-real gap. This will offer important insights into its adaptability and resilience, confirming its efficacy beyond the synthetic training data and its suitability for real-world applications.

6.1 Data Collection System (DCS)

The hardware configuration to capture a validation dataset is as follows. A data collection system (DCS) that was developed for ship air wake measurements [10] is augmented with a camera. It is attached to an octocoper unmanned aerial vehicle that is manually controlled to fly around a USNA research vessel in Chesapeake bay, Maryland.

The detailed configuration of the DCS composed of a base module and a rover module is presented at Figure 15. Specifically, the DCS includes Inertial Measurement Units (IMUs) and Relative Time Kinematic (RTK) GPS, from which an extended Kalman filter is executed to estimate the pose in real time. The rover module of the DCS has been enhanced with an Alvium 1800 C-240 Global Shutter RGB camera and a Jetson Nano single-board computer, connected via an MIPI CSI-2 cable. This setup allows image capture at a rate of 5Hz, synchronized with the RTK GPS via a WIFI connection. This upgraded rover configuration enables synchronization and time-stam** of the images within the RTK GPS data collection process. It allows for the effective collection of real image data along with accurate relative camera 6D poses.

GPS radio transmitterGPS (Piksi Multi)GPS location,velocity, time(UTC)IMU (VN 100)attitude, angular rate,linear accelerationAnemometer(AT-type A)3-axis air velocity,air temperatureWi-Fi extenderHost computerUSB cablesWi-Fi antennaa) Base station configurationGPS radio receiverGPS (Piksi Multi) RTK position,velocity, time (UTC)IMU (VN 100)attitude,angular rate,linear accelerationTriSonica Mini Anemometers3-axis air velocity,air temperatureCustom designedPCBRaspberry pi 4Wi-Fi hot-spotRover vision moduleAlvium 1800 C-240 GS cameramax Res. 1936(H) x 1216 (W) max fps.128 GS (Global shutter)Wi-Fi antennaJetson Nanob) Rover configurationRover GPS/IMU moduleWi-Fi router
Figure 15: Data Collection System [8]
Refer to caption
Figure 16: Base (left) and camera fixed rover (right) setup used of the DCS

6.2 Model Testing with Real Data

The trained model is tested with the real world images captured by the DCS over multiple days. During the in-flight experiments conducted on the USNA research vessel, the model was subjected to real-world conditions where instances of variable lighting and occlusions were encountered. We deliberately selected challenging images for testing because they represent realistic conditions where such situations can occur. Figures 17 to 19 show the keypoints prediction and corresponding re-projected ships coordinate frame from estimated pose by the proposed TNN-MO model under the following three lighting conditions.

  • Overexposed ship: An overexposed ship is one where the image has captured too much light, causing the ship to appear excessively bright.

    This results in a loss of detail, especially in areas that have subtle color variations or textures. The ship’s features become hard to be distinguished because the intense light overwhelms the camera’s sensor, leading to a predominance of white or light areas, particularly on surfaces like the landing pad. It’s as if the ship is caught in a glare, with its details bleached by the brightness.

  • Underexposed ship: An underexposed ship is one where the image has not captured enough light, making the ship appear too dark. This can obscure details and make it challenging to distinguish features, especially in areas that are naturally shadowed or lack reflective surfaces. It is as if the ship is enveloped in shadows, with its details concealed in darkness.

  • Normal ship: A normal ship, in terms of exposure, is one where the lighting conditions are optimal for image analysis, resulting in a balanced image with clear visibility of details. The lighting is neither too intense nor too dim, providing an optimal level of brightness that allows all parts of the ship to be clearly distinguished.

Traditional feature extraction methods and visual marker-based approaches often struggle under challenging circumstances, as there is a lack of visual features on the ship. Furthermore, in this dynamic environment of real-world imaging, human operators and various items are present on the ship, which we can identify as occlusions. These occlusions, along with variable lighting, present significant challenges. However, as illustrated by Figures 17 to 19, the proposed model successfully identified the keypoints and the bounding boxes of multiple parts of the ship. Its robust performance under such conditions is a testament to its capability to provide accurate results, ensuring reliable 6D pose estimation even when faced with obstructions and fluctuating light levels.

Refer to caption
Figure 17: Estimated keypoints and bounding boxes for overexposed images: objects with the confidence oj>0.9subscript𝑜𝑗0.9o_{j}>0.9italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0.9 are highlighted
Refer to caption
Figure 18: Estimated keypoints and bounding boxes for underexposed images: objects with the confidence oj>0.9subscript𝑜𝑗0.9o_{j}>0.9italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0.9 are highlighted
Refer to caption
Figure 19: Estimated keypoints and bounding boxes for normal images: objects with the confidence oj>0.9subscript𝑜𝑗0.9o_{j}>0.9italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0.9 are highlighted

6.3 Validation with RTK GPS

Further, for quantitative validation, the estimated pose is compared with the actual pose determined by the DCS. The DCS integrates the relative attitude, derived from the base rover’s IMUs, and the relative position from the RTK-GPS with an extended Kalman filter. The IMUs, such as the VN-100, provide a magnetic heading accuracy of 2.0superscript2.02.0^{\circ}2.0 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, a gyro in-run bias stability of 5superscript55^{\circ}5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT/hr, and a pitch/roll accuracy of 0.5superscript0.50.5^{\circ}0.5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT under normal conditions. The RTK-GPS, such as the Piksi Multi GNSS, provides a horizontal position accuracy of 0.75 m in SBAS mode, a velocity accuracy of 0.03 m/s RMS, and a time accuracy of 60 ns RMS. In RTK mode, it provides an accuracy of 0.010 m horizontally and 0.015 m vertically. These systems provide a centimeter-level accuracy in nominal conditions for the DCS. Here, the measurements of the RTK-GPS are considered as ground truth. The trajectories estimated by the proposed TNN-MO model and the RTK-GPS under the above three illumination conditions are illustrated in Figures 23 to 25 with the estimation errors and 3σ3𝜎3\sigma3 italic_σ-bounds calculated by (4) and (9), where it is shown that the position and the attitude trajectory estimated by the TNN-MO model are consistent with the IMU and the RTK-GPS.

Table 2: Accuracy of 6D Pose Estimation for TNN-MO Model Under Variable Lighting Conditions
Image Type Max Range, L𝐿Litalic_L (m) MAE / σ𝜎\sigmaitalic_σ / d𝑑ditalic_d of Rot. (deg) MAE / σ𝜎\sigmaitalic_σ of Pos. (m) MAE/L𝐿Litalic_L (%)
Overexposed Ship 13.7 1.8 / 2.32 / 0.999 0.112 / 0.017 0.82
Underexposed Ship 13.5 1.1 / 1.83 / 0.999 0.089 / 0.019 0.66
Normal Ship 18.2 4.0 / 4.54 / 0.999 0.177 / 0.022 0.97

The mean absolute errors are also tabulated at Table 2. The MAE of position estimation ranges from 0.0890.0890.0890.089 to 0.1770.1770.1770.177 meters, which corresponds to 0.66%percent0.660.66\%0.66 % to 0.97%percent0.970.97\%0.97 % of the maximum range L𝐿Litalic_L. The MAE of rotation estimation varies from 1.11.11.11.1 to 4.04.04.04.0 degrees. While these errors are slightly larger than the validation error for the synthetic data as presented in Table 1, it is verified that the proposed TNN-MO successfully overcome the sim-to-real gap. The performance is consistent over the varying illumination conditions captured over multiple days. While the degree of attitude uncertainties measured by the standard deviation is in the reasonable range, the standard deviation for the position estimate is relatively small compared with the error, indicating over-confidence. This is partially because the uncertainties are measured indirectly in (4) by the variations of the estimate over multiple objects, not in the confidence of estimate for the individual object. However, in Figures 23 to 25, the uncertainties increase as the camera is further away from the ship, which is expected. These show that the presented uncertainty model behaves reasonably in the qualitative sense, but not necessarily accurate quantitatively. The exact modeling of uncertainties is considered as one of future directions. This provides promising results for pose estimate that can be utilized for autonomous launch and recovery in the ocean environments.

6.4 Multi-Objects Pose Estimation

The proposed TNN-MO model integrates the multiple pose estimates with respect to six object classes. To investigate the advantages of the multi-objects estimates, we perform an ablation study by studying the pose estimated with respect to the individual object class without fusion.

Refer to caption
(a) Dog house center
Refer to caption
(b) Dog house
Refer to caption
(c) House
Refer to caption
(d) Super structure
Refer to caption
(e) Whole ship
Refer to caption
(f) Ship stern
Figure 20: Pose estimated with respect to the individual object class without fusion for underexposed images

For the second case of the underexposed ship, the trajectory estimated by each object is presented at Figure 20. Upon examining Figures 20(d) to 20(f), it is evident that when the camera is closer to the ship, three object categories of the super structure, the whole ship, and the stern are not within the field-of-view, and the corresponding estimate is degraded. Conversely, for Figures 20(a) to 20(c), object pose estimations are accurate and reliable, as the objects remain visible throughout the flight.

Furthermore, the position estimation error with respect to each object is summarized at Table 3, where it is shown that the proposed multi-object approach yields smaller errors in mean and max values compared to the TNN-SO, illustrating the advantages of fusion.

Table 3: Position estimation error for TNN-MO and TNN-SO
Overexposed Ship Underexposed Ship Normal Ship
Pos. err. (m) Pos. err. (m) Pos. err. (m)
Mean Max Mean Max Mean Max
TNN-MO 0.112 0.485 0.089 0.937 0.177 0.666
TNN-SO (dog house center) 0.129 0.723 0.259 1.669 0.261 1.191
TNN-SO (dog house) 0.131 0.720 0.259 1.738 0.253 1.154
TNN-SO (house) 0.123 0.728 0.258 1.667 0.254 1.220
TNN-SO (super structure) 0.135 0.782 0.347 1.660 0.304 1.169
TNN-SO (whole ship) 0.151 0.682 0.369 1.704 0.355 1.129
TNN-SO (ship stern) 0.152 0.669 0.394 1.664 0.365 1.130

6.5 Attention Map

Refer to caption
Figure 21: Encoder self-attention for a selected set of reference points: The attention map by using 300 (H0×W0subscript𝐻0subscript𝑊0H_{0}\times W_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) features from the convoluted input image. Each pixel in the image corresponds to one of these features. We are visualizing how a feature linked to a selected pixel attends to other features. By analyzing the self-attention map, we can see that the encoder is capable of identifying key elements such as the sea, sky, and ship within the image based on the selected points.
Refer to caption
Figure 22: Object keypoints are predicted by TNN-MO in the given real image (top), and the decoder cross-attention maps for the object queries corresponding to the predictions in the first row are shown (bottom): Here, we are visualizing how a encoder output attends to each queries related to the objects defined in Figure 9. However, the cross-attention related to 7777-th query, which is for the \varnothing object class, is not presented in the figure.

To gain insights into how the synthetically trained TNN-MO model performs on real-world images, we analyzed the visualizations of the attention maps for the encoder and the decoder. The self-attention mechanism in the transformer network enables the encoder to understand the context of image elements. This is crucial in object detection, where the relationship between different parts of the image can provide important information. Therefore, we analyzed the transformer attention maps by passing a real-world image into TNN-MO, as shown in Figure 21. We observed that the TNN-MO encoder’s self-attention mechanism allows the model to understand the context of each pixel in relation to the others. For example, the self-attention map for the pixel point on the ship in the image, attention weights gets higher in the similar area of the ship on the map.

Additionally, to gain insights into what the model is focusing on, we visualized the decoder cross-attention map for different queries, as shown in Figure 22. This can help us understand which parts of the image the model finds most relevant for each query, providing valuable insights into the model’s decision-making process. We observed that the area where the cross-attention weight is higher for each query lies in the ship area of the image. This means that our TNN-MO model is able to understand the relevant information coming from the encoder to identify each part of the ship.

7 Conclusions

This study introduces an innovative method for estimating the relative 6D pose of an autonomous UAV in relation to a ship, utilizing monocular RGB camera images. We developed a synthetic dataset incorporating varied textures, lighting conditions, and camera poses, and annotated them with 2D keypoints of ship components in a virtual environment. Subsequently, we present a network architecture based on the transformer network, trained to detect keypoints of multiple parts of the ship, from which the camera pose is estimated using the Efficient Perspective-n-Point (EPnP) algorithm. Then, the estimated 6D poses are filtered, where the object class confidence is greater than 0.9, and they are integrated with Bayesian fusion to provide reliable pose estimations. This process combines pose estimations from multiple objects, disregarding information from unreliable estimates and placing more reliance on credible ones. The trained model has been thoroughly tested and validated on both synthetic and real-world data. The resulting mean absolute errors for position are 0.8% and 1.0% of the maximum flight range, respectively.

Our proposed formulation, based on synthetic data, circumvents challenges associated with collecting a large set of labeled data in the real world, successfully overcoming the discrepancy between virtual and real environments. Furthermore, our proposed multi-object pose estimation and fusion enhance accuracy and robustness under various relative configurations, occlusions, and illumination conditions. These results underscore the suitability of our approach for improving the autonomy and safety of UAV operations, particularly for autonomous landing and takeoff on moving platforms.

Future work includes integrating the pose estimate of the proposed transformer network model with an Inertial Measurement Unit (IMU) for visual-inertial navigation and utilizing it for autonomous flight experiments in ocean environments.

Acknowledgments

This research was conducted on the US Naval Academy’s research vessel YP689, and we extend our gratitude to the US Naval Academy and the crew of YP689. The research has been partially supported by the USNA/NAVSUP (N0016123RC01EA5), NSF (CNS-1837382), AFOSR MURI (FA9550-23-1-0400), and ONR (N00014-23-1-2850). The authors express their gratitude for this support.

References

  • Robaglia et al. [2018] Robaglia, A. E., Libine, S., and Gamagedara, K., Autonomous Landing of an Unmanned Aerial Vehicle on a Moving Ship, 2018. 10.2514/6.2018-1461, URL https://arc.aiaa.org/doi/abs/10.2514/6.2018-1461.
  • Marut et al. [2019] Marut, A., Wojtowicz, K., and Falkowski, K., “ArUco markers pose estimation in UAV landing aid system,” 2019 IEEE 5th International Workshop on Metrology for AeroSpace (MetroAeroSpace), 2019, pp. 261–266. 10.1109/MetroAeroSpace.2019.8869572.
  • Gamagedara et al. [2024] Gamagedara, K., Lee, T., and Snyder, M., “Delayed Kalman filter for vision-based autonomous flight in ocean environments,” Control Engineering Practice, Vol. 143, 2024, p. 105791. https://doi.org/10.1016/j.conengprac.2023.105791, URL https://www.sciencedirect.com/science/article/pii/S096706612300360X.
  • Kang et al. [2020] Kang, J., Liu, W., Tu, W., and Yang, L., “YOLO-6D+: Single Shot 6D Pose Estimation Using Privileged Silhouette Information,” 2020 International Conference on Image Processing and Robotics (ICIP), 2020, pp. 1–6.
  • Wang et al. [2021] Wang, G., Manhardt, F., Tombari, F., and Ji, X., “GDR-Net: Geometry-Guided Direct Regression Network for Monocular 6D Object Pose Estimation,” CoRR, Vol. abs/2102.12145, 2021. URL https://arxiv.longhoe.net/abs/2102.12145.
  • Carion et al. [2020] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S., “End-to-End Object Detection with Transformers,” CoRR, Vol. abs/2005.12872, 2020. URL https://arxiv.longhoe.net/abs/2005.12872.
  • Amini et al. [2022] Amini, A., Periyasamy, A. S., and Behnke, S., “YOLOPose: Transformer-based Multi-Object 6D Pose Estimation using Keypoint Regression,” , 2022. 10.48550/ARXIV.2205.02536, URL https://arxiv.longhoe.net/abs/2205.02536.
  • Wickramasuriya et al. [2024] Wickramasuriya, M., Lee, T., and Snyder, M., “Deep Monocular Relative 6D Pose Estimation for Ship-Based Autonomous UAV,” AIAA SCITECH 2024 Forum, 2024, p. 2877.
  • Lepetit et al. [2009] Lepetit, V., Moreno-Noguer, F., and Fua, P., “EPnP: An accurate O(n) solution to the PnP problem,” International journal of computer vision, Vol. 81, 2009, pp. 155–166.
  • Gamagedara et al. [2019] Gamagedara, K., Lee, T., and Snyder, M. R., Real-time Kinematics GPS Based Telemetry System for Airborne Measurements of Ship Air Wake, 2019. 10.2514/6.2019-2377, URL https://arc.aiaa.org/doi/abs/10.2514/6.2019-2377.
  • Bostock et al. [2023] Bostock, N., Richez, A., Costello, D. H., Webster-Giddings, A., and Wickramasuriya, M., “Verification of YP689 Flow Field Models for Dynamic Interface Flight Test,” AIAA SCITECH 2023 Forum, 2023, p. 0294.
  • Schönberger and Frahm [2016] Schönberger, J. L., and Frahm, J.-M., “Structure-from-Motion Revisited,” Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Community [2022] Community, B. O., Blender - a 3D modelling and rendering package, Blender Foundation, Stichting Blender Foundation, Amsterdam, 2022. URL http://www.blender.org.
  • Loquercio et al. [2019] Loquercio, A., Kaufmann, E., Ranftl, R., Dosovitskiy, A., Koltun, V., and Scaramuzza, D., “Deep Drone Racing: From Simulation to Reality with Domain Randomization,” IEEE Transactions on Robotics, 2019. 10.1109/TRO.2019.2942989.
  • Cimpoi et al. [2014] Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , and Vedaldi, A., “Describing Textures in the Wild,” Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
  • Ope [n.d.] “OpenGameArt,” https://opengameart.org/, n.d. Accessed May 12, 2023.
  • Budinich [2017] Budinich, R., “A Region Based Easy Path Wavelet Transform For Sparse Image Representation,” , 2017.
  • Vaswani et al. [2023] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I., “Attention Is All You Need,” , 2023.
  • Stewart and Andriluka [2015] Stewart, R., and Andriluka, M., “End-to-end people detection in crowded scenes,” , 2015.
  • Huang et al. [2022] Huang, W.-L., Hung, C.-Y., and Lin, I.-C., “Confidence-Based 6D Object Pose Estimation,” IEEE Transactions on Multimedia, Vol. 24, 2022, pp. 3025–3035. 10.1109/TMM.2021.3092149.
  • Lee [2018] Lee, T., “Bayesian Attitude Estimation with the Matrix Fisher Distribution on SO(3),” IEEE Transactions on Automatic Control, Vol. 63, No. 10, 2018, pp. 3377–3392.
  • Loshchilov and Hutter [2017] Loshchilov, I., and Hutter, F., “Fixing Weight Decay Regularization in Adam,” CoRR, Vol. abs/1711.05101, 2017. URL http://arxiv.longhoe.net/abs/1711.05101.
  • Zhang and Scaramuzza [2018] Zhang, Z., and Scaramuzza, D., “A tutorial on quantitative trajectory evaluation for visual (-inertial) odometry,” 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2018, pp. 7244–7251.
Refer to caption
(a) 3d view of the ship
Refer to caption
(b) Top and side view of the ship
Refer to caption
(c) Position, x𝑥xitalic_x
Refer to caption
(d) Position Error, exsubscript𝑒𝑥e_{x}italic_e start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT
Refer to caption
(e) Attitude, ypr𝑦𝑝𝑟ypritalic_y italic_p italic_r
Refer to caption
(f) Attitude Error, η𝜂\etaitalic_η
Figure 23: Overexposed case: the estimated pose (red) is compared against the RTK-GPS measurements (blue)
Refer to caption
(a) 3d view of the ship
Refer to caption
(b) Top and side view of the ship
Refer to caption
(c) Position, x𝑥xitalic_x
Refer to caption
(d) Position Error, exsubscript𝑒𝑥e_{x}italic_e start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT
Refer to caption
(e) Attitude, ypr𝑦𝑝𝑟ypritalic_y italic_p italic_r
Refer to caption
(f) Attitude Error, η𝜂\etaitalic_η
Figure 24: Underexposed case: the estimated pose (red) is compared against the RTK-GPS measurements (blue)
Refer to caption
(a) 3d view of the ship
Refer to caption
(b) Top and side view of the ship
Refer to caption
(c) Position, x𝑥xitalic_x
Refer to caption
(d) Position Error, exsubscript𝑒𝑥e_{x}italic_e start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT
Refer to caption
(e) Attitude, ypr𝑦𝑝𝑟ypritalic_y italic_p italic_r
Refer to caption
(f) Attitude Error, η𝜂\etaitalic_η
Figure 25: Nominal case: the estimated pose (red) is compared against the RTK-GPS measurements (blue)