Deep Transformer Network for Monocular
Pose Estimation of Ship-Based UAV
Abstract
This paper introduces a deep transformer network for estimating the relative 6D pose of a Unmanned Aerial Vehicle (UAV) with respect to a ship using monocular images. A synthetic dataset of ship images is created and annotated with 2D keypoints of multiple ship parts. A Transformer Neural Network model is trained to detect these keypoints and estimate the 6D pose of each part. The estimates are integrated using Bayesian fusion. The model is tested on synthetic data and in-situ flight experiments, demonstrating robustness and accuracy in various lighting conditions. The position estimation error is approximately 0.8% and 1.0% of the distance to the ship for the synthetic data and the flight experiments, respectively. The method has potential applications for ship-based autonomous UAV landing and navigation. Source code : fdcl-gwu/TNN-MO, Video : https://youtu.be/ZG_zVVS8xw8
1 Introduction
Unmanned Aerial Vehicles (UAVs) have seen a surge in usage across a multitude of industries, such as aerial photography, military operations, agriculture, map**, and surveying. The advantages of UAVs over traditional manned aircraft are numerous, including cost-effectiveness, enhanced safety, and superior flexibility. However, the autonomous operation of UAVs, particularly their ability to land on moving platforms like ships, poses a crucial challenge. This capability is of significant importance for industries that depend on maritime transportation or offshore operations.
A primary challenge in this context is the estimation of the UAV’s relative pose with respect to the ship, which is vital for precise control of the UAV’s movements and ensuring a safe landing. Conventionally, the relative pose has been determined using the Real-Time Kinematic (RTK) Global Positioning System (GPS). To receive RTK-GPS, a communication link between the ship and the UAV must be maintained at all times, typically via radio. As a result, the UAV cannot estimate its relative position independently. In military surveillance operations, the broadcasting of radio signals to maintain this communication link could potentially reveal the ship’s position, thereby increasing its vulnerability. Also, reliance on GPS can lead to issues in situations where GPS signals may be weak or unavailable, and it is susceptible to malicious activities, such as jamming or spoofing, which can disrupt the operation of the UAV.
An alternative to GPS for estimating the relative pose of a UAV could involve the use of a monocular camera mounted on the UAV. Then, captured Red-Green-Blue (RGB) camera images can be processed to estimate the UAV’s relative pose. This method provides a potential solution for independent and secure pose estimation. Traditionally, ArUco markers have been employed to estimate the relative pose of the camera. ArUco markers are a type of fiducial marker, which are square markers with a specific binary pattern that can be easily detected and identified by a computer [1, 2]. However, the use of ArUco markers comes with certain disadvantages, such as the need for clear visibility and specific lighting conditions, their susceptibility to occlusion, and the requirement for the markers to be within the camera’s field of view. These limitations can affect the accuracy and reliability of pose estimation.
In our preceding studies, we accomplished autonomous flight for shipboard launch and landing using feature-based Visual-Inertial Navigation (VIN) [3]. However, it turned out that the accuracy of estimating the relative pose depends on the proximity of the camera to the landing pad. The quality and quantity of features significantly influence the accuracy of the relative pose. Moreover, the number of features extracted under variable lighting conditions can fluctuate and features extracted from the environment outside the ship, such as the sky and ocean, adversely affect the accuracy of relative pose estimation.
Recent advancements in object pose estimation with deep learning, particularly the use of Convolutional Neural Networks (CNNs), have shown significant progress. Multiple convolution layers for feature extraction, object detection, and/or segmentation are utilized for object silhouette prediction to estimate the 6D pose of an object [4, 5]. Recently, a transformer architecture based on the self-attention mechanism, originally introduced for Natural Language Processing (NLP), has been successfully applied to various computer vision tasks. The DEtection TRansformer (DETR) is one such transformer-based neural network architecture for object detection [6]. This has been extended with the YOLOPose architecture [7] to perform multi-object 6D pose estimation using both direct regression and keypoint regression.
In one of our recent studies, we introduced a Transformer Neural Network for Single Object (TNN-SO) model to estimate the 6D pose from the ship [8], where we took the advantage of tracking a part of a ship near the landing pad as a single object rather than tracking the entire ship. However, despite the robustness of the TNN-SO model for variable lighting and the higher accuracy closer to the landing pad, the accuracy of pose estimation significantly decreased for longer ranges approximately greater than 6m from the center of the landing pad. This is because the visibility diminished as the UAV flies away from the ship.
![Refer to caption](extracted/5662906/images/YP689/6dpose.jpg)
To address this issue, in this paper, we propose an innovative method that facilitates multi-object detection using the Transformer Neural Network (TNN). In this approach, the ship is decomposed into multiple parts with varying sizes and locations, as opposed to the previous single object detection method. We model a Transformer Neural Network for Multi Object (TNN-MO) by utilizing the DETR architecture to estimate the 6D pose of a UAV relative to the selected parts of the ship. The multiple pose estimates with respect to those parts are integrated in a Bayesian fashion.
More specifically, the proposed TNN-MO is implemented as follows. The performance of deep neural networks is constrained by the quality and quantity of data used in the training and validation steps. However, the absence of a large labeled dataset containing dynamically changing environments for a real-world ship poses a significant challenge in effectively training the transformer neural network model, especially when there are multiple objects to be detected. To tackle the challenges in constructing a rich dataset in the real world, we generate synthetic images by rendering a 3D model of the ship with randomly distributed camera poses under various textured environments and backgrounds to train the TNN-MO model. After training the TNN-MO model, it is able to detect keypoints of multiple parts of the ship.
To estimate the 6D poses, we utilize the EPnP algorithm [9], leveraging the 2D-3D correspondences of the predicted 2D keypoints of multiple parts of the ship and its known 3D keypoints. Then, the resulting pose estimates of multiple ship parts are filtered when the object class confidence is greater than 0.9, and the filtered pose estimates are integrated using Bayesian fusion. Upon training both the TNN-SO [8] and TNN-MO models, we observed that the position estimation accuracy of the TNN-MO model surpassed that of the TNN-SO model, even at extended ranges, when evaluated with the synthetic test dataset.
To assess the performance of our models in real-world scenarios, we collected images from the camera-attached data collection system (DCS) [10] on an octocopter UAV launched from the YP689 vessel, operated by the United States Naval Academy (USNA) [11]. The DCS is also equipped with RTK GPS, along with a base station mounted at the flight deck of the ship and a dedicated wireless communication link, to measure the relative position up to a centimeter-level accuracy, which is considered as a reference. It is verified that the position error of the proposed TNN-MO in the real world is approximately 2% of the range.
Our proposed approach successfully estimates the relative pose with monocular real-world images, even under challenging conditions such as varying light and different viewpoints. Notably, the TNN-MO model, compared to the TNN-SO model, demonstrates remarkable pose estimation accuracy over a longer range by leveraging the benefits of multi-object detection, while maintaining high accuracy near the landing area. This demonstrates promising potential for vision-based UAV landing and navigation by eliminating the need for GPS, highlighting the effectiveness and precision enhancement offered by our method.
This paper is organized as follows. We present a virtual environment for a ship model in Section 2, from which synthetic data for training and validation are generated in Section 3. Next, Section 4 presents the training procedure and the Bayesian fusion scheme, followed by validations with the synthetic data and with flight experiments, respectively in Section 5 and Section 6.
2 Virtual Environments for Synthetic Data
This paper focuses on a specific scenario of pose estimation relative to a research vessel operated by the United States Naval Academy, specifically YP689 (or identical prior research vessel YP700), over Chesapeake Bay, Maryland. However, it is impractical to obtain a large dataset of real-world images for a vessel in the ocean with the exact pose under dynamically changing environments, due to the substantial time and cost required. To overcome this, we generate a synthetic dataset from a 3D CAD model using open-source rendering software, with which the proposed network is trained. This section outlines the detailed steps involved in creating the virtual environments, including a CAD model of the ship, texture, and illumination conditions.
2.1 Development of the 3D CAD Model
First, we create a CAD model by capturing point clouds of the vessel. We captured a footage of the entire ship using a monocular RGB camera mounted on an octocopter [10] [11]. Utilizing this video, we generated a 3D point cloud of the ship using structure from motion techniques in computer vision [12]. The resulting 3D point cloud is corrupted by noise and it incomplete as the octocopter is not allowed to fly near the bow side of the ship. It is first was manually filtered to remove outliers, then the cleaned point cloud was imported into an open-source rendering software, namely Blender, to create a CAD model [13]. During this step, detailed structures on the ship not captured by the point cloud, such as the ladders, ship bow, and fence, were manually added too. Furthermore, the CAD model was rescaled according to the actual size of the ship. To enhance the realistic view of the scene, we included other objects such as the ocean, sun, and humans. Also, to make ocean more realistic, we randomly generated wave patterns for every scene using Blender wave modifier tool.
![Refer to caption](extracted/5662906/images/cad/Point_Cad.png)
2.2 Assignment of Textures
![Refer to caption](x1.png)
![Refer to caption](x2.png)
![Refer to caption](x3.png)
![Refer to caption](x4.png)
![Refer to caption](x5.png)
When operating a UAV in the vicinity of a ship, the environment is subject to frequent and dynamic changes. Real images are often influenced by variations in lighting, weather conditions, changes in cloud and wave patterns, occlusions, and other factors. These variables make the detection, recognition, and pose estimation of vessels a challenging task. Training a neural network using synthetic images generated in a static environment may not yield successful results due to the discrepancy between the simulated and real environments, a phenomenon referred to as the sim-to-real gap.
To address this issue, we generate synthetic images using domain randomization, where non-photo-realistic and realistic textures are randomly assigned to each object in Blender [14]. Texture assignment, which involves map** necessary textures onto 3D objects, is a crucial step in the process of generating synthetic images. In this work, we utilized two sources to create texture categories: the Describable Textures Dataset (DTD) [15] and the open-source platform OpenGameArt [16]. The categorization of these textures was based on their image frequency and the characteristics of the textures. Specifically, textures with high frequencies or those resembling oceans were grouped together as "Ocean-like textures". Textures selected with low frequencies were categorized as "Vessel skin textures". Similarly, textures with low frequencies or those resembling the sky were grouped together as "Sky-like textures". To define the landing pad and its markings, we grouped together dark, asphalt-like textures as "Landing pad textures", while whitish color textures were categorized as "Landing pad markings textures", as shown in Figure 3.
Randomly assigning different textures to objects can enhance object detectability in dynamic environments, which is beneficial for improving the performance of the neural network (NN) model, especially in real-world environments, thereby bridging the sim-to-real gap. Texture categorization is also important as it allows the NN to recognize the shape of the 3-D object (vessel) instead of solely relying on object features.
2.3 Environmental Illumination and Occlusions
![Refer to caption](extracted/5662906/images/lighting/var_light.jpg)
In the ocean environments, the lighting conditions may vary drastically. Furthermore, the illumination of objects in an image can be affected by changes in the orientation of the objects or the camera, even under consistent lighting conditions. For instance, the orientation of an object or the camera may result in the appearance of shadows under direct sunlight. Most cameras are equipped with an automatic exposure control (AE) system that adjusts the brightness of the image based on the lighting conditions. As a result, the lighting conditions in the captured image can vary depending on the adjustments made by the camera’s built-in AE system. This variability in lighting conditions poses challenges for object detection, tracking, and pose estimation when using real images. As such, randomizing object texture may not fully mitigate these challenges. To address this, we have enhanced the scene by introducing randomly varying lighting conditions, as illustrated in Figure 4.
In real-world scenarios, human operators and other objects are present in the vessel, leading to occlusions that cannot be overlooked. For example, during any UAV operation to collect real test data, the operator must be positioned on the landing pad. The occlusions created by such instances become critical during the UAV’s landing and take-off stages. To account for these occlusions, we have incorporated human models and other objects randomly during data generation.
3 Data Generation in Synthetic Environments
In this section, we incorporate a virtual camera object into the scene to create synthetic images. Importantly, the camera’s properties were configured to match those of the actual camera used for image collection, particularly in terms of field of view and resolution. This step is vital in ensuring the accuracy of the pose estimation for real images.
The camera can be positioned anywhere within the 3D space by defining its position and orientation. One of the primary objectives of this work is to estimate the pose of the camera with data-driven learning. Therefore, it is crucial to incorporate various random camera poses in the dataset. The distribution of the camera pose chosen for the training data should ideally reflect the distribution of the pose encountered during nominal shipboard operations for launch and recovery. To ensure this, we formulate a probability density function (PDF) for the camera pose. This PDF is designed such that the density value increases near the landing pad when the camera is pointed toward the bow side. This distribution is then dispersed to cover the flight envelope of nominal operations. The training dataset is generated by sampling the camera pose from this distribution and generating the corresponding image in the synthetic environment, as discussed in the previous section. The process of sampling the relative pose of the camera involves three steps: defining the distribution of the camera pose, denoted by , defining the distribution of the camera is pointed toward, denoted by , and formulating the attitude of the camera based on these distributions [8].
3.1 Formulating Probability Density for Position
We first present the probability distribution for and . Let the base frame be fixed to the flight deck of the ship. The -axis of this frame points to the starboard side of the ship (right side of the helmsman), the -axis points toward the bow (front) of the ship, and the -axis points upwards. These axes are denoted by , , and , as illustrated in Figure 5. The location of or is defined by the spherical coordinates , where represents the distance, and represent rotation around the -axis and rotation around the -axis respectively.
For the position of the camera , the angle is sampled from a uniform distribution on the range , representing the area behind the flight deck. Similarly, is uniformly sampled from the range , corresponding to the area above the flight deck. This choice is motivated by our desire to generate realistic camera orientations that resemble those of a horizontally fixed camera during UAV flight. We initially experimented with increasing the maximum range of closer to , but found that this led to a number of unrealistic camera orientations. To mitigate this, we decided to set the maximum range for at . This decision has proven effective in maintaining the realism of our generated camera orientations. The distance is sampled from a truncated normal distribution, which is a one-dimensional Gaussian distribution truncated to the range of and rescaled accordingly. Here, the maximum range is chosen as , and the mean and the standard deviation of the distribution are and , respectively. Next, the distribution for the target point is sampled in a similar manner. The range for each of , , and is , , and , respectively. The mean and the standard deviation for the truncated normal distribution are and . The resulting set of sampled and is presented at Figure 6.
3.2 Attitude of Camera
Once the points and are sampled, the camera attitude is determined as follows. Let the camera frame be defined such that its origin coincides with , and the positive x-axis points to the right (along the direction of the image width), the positive y-axis points upward (along the direction of the image height), and the positive z-axis is opposite to the line of sight, as illustrated in Figure 5. The orientation of the camera frame with respect to the flight deck frame is specified by a rotation matrix .
The points of and are sampled as described above, and they are reordered randomly to create a set of pairs of . Each pair defines the line of sight of the camera as . We choose the corresponding rotation matrix such that its third axis is opposite to and its -axis is level to the ground. More specifically, is given by
where the unit-vectors are obtained by
with . Then, the camera attitude is obtained by rotating about the -axis by an angle , which is sampled from the uniform distribution in the range , i.e., , where the hat map is defined such that and for any . The angle represents the possible rolling motion of the camera.
![Refer to caption](x6.png)
3.3 Generated Synthetic Images
The selected camera pose is transferred to the rendering software, Blender to generate the synthetic images with varying textures, wave patterns, and lightening conditions. Each image is paired with the correct pose to be used for training and verification. The selected samples from the synthetic images are illustrated in Figure 7.
![Refer to caption](x7.png)
In the virtual environment, the sea is modeled as an infinitely large plane, leading to the prominence of the sea horizon (the line where the sea meets the sky) in the background of most synthetic images. However, in real-world scenarios, the visibility of the sea horizon can be obstructed by various objects or geographical features. For instance, when the ship is in a bay, the shoreline or the land may alter the view of the horizon. As such, a dataset with infinitely extending sea may not represent the real-world scenario properly. We have observed that a network trained over such dataset may exhibit a bias especially in estimating the height of the camera.
To address this, we have opted to generate a distinct synthetic image dataset without the horizon. This is achieved by substituting the ship’s background with a randomly generated image. The random image is produced using the random image generator as proposed by [17]. Then, we mix the above dataset with the synthetic images without any horizon. This approach ensures a more diverse and realistic representation in our synthetic image dataset, while capturing the importation information provided by a horizon.
![Refer to caption](x8.png)
3.4 Decomposition of the Ship into Multiple Parts and Keypoint Selection
The above images constitute the input to the neural network formulated in the next section. The desired outputs, or the target data correspond to the pixel location in the image for the 3D keypoints defining a pre-defined bounding box of the ship. Then, the camera pose can be estimated by the resulting correspondence between the 2D image coordinate and the 3D location of the keypoints.
However, defining the entire ship as a single object is not desirable, as the visibility might be obscured depending on the position and the orientation of the camera. For example, during the landing, the keypoints at the stern of the ship may not be visible. To address this issue, the ship is decomposed into multiple parts for detection, rather than being detected as a whole. This method also helps mitigate the impact of certain camera orientations and specific lighting conditions that might result in only partial visibility of certain areas of the ship, potentially affecting the accuracy of the model’s estimations. Specifically, the ship is decomposed into six parts: the whole ship, stern, superstructure, and three parts of the flight structure immediately forward of the flight deck, referred to as the dog house,, as illustrated at Figure 9.
![Refer to caption](extracted/5662906/images/3dkeys/3dkeys.png)
For each of these six parts, we define a 3D bounding box by specifying eight corners following [7]. Each of twelve edges of the bounding box is divided into three sections by adding two intermediate points. As such, a bounding box is defined by thirty two keypoints. The location of each keypoint is calculated by the perspective projection from the camera model. In short, the target for each synthetic image is the 2D location of thirty two keypoints for each of six parts of the ship.
4 Relative 6D Object Pose Estimation
In this section, we present a deep neural network model for pose estimation and a Bayesian fusion to integrate the estimate pose relative to multiple parts of the ship. For the neural network architecture, we employ a Transformer Neural Network (TNN) as presented in [7]. In this method, we estimate the 2D keypoints associated with each object and the object class confidence of the RGB image. Then, we recover the corresponding relative 6D poses of each object by solving the 2D to 3D correspondence of the keypoints using Efficient Point-n-Perspective (EPnP) [9]. Finally, we integrate the resulting pose estimations using Bayesian fusion to obtain the most probable pose estimate.
4.1 Transformer Neural Network for Keypoints Estimation
The proposed transformer neural network architecture is described as follows. First, for a given RGB image, we extract image features using a CNN backbone. Specifically, the input RGB image has a size of . We employ ResNet50 to generate a lower-resolution feature , where is the channel dimension and are defined as respectively, following the DETR [6]. Then, we use a convolution to reduce the feature dimension to . The extracted features, which are in the dimension of , are flattened to a dimension of . Subsequently, a fixed positional encoding is added to the features to generate input embeddings, which have the dimension of . Then, each of these embeddings is used as an input for the transformer encoder [18].
The transformer encoder has six standard encoder layers with skip connections. The output from the encoder is passed to the decoder module which also has six standard decoder layers with skip connections, accompanied by object queries [7]. These object queries are a set of learnable positional embeddings with the dimension , representing the position of each object class. These vectors are learned during the training process of the model, to help the model understand and represent the positional relationships between different objects in the data. Let be the cardinality of the object classes including the case when no object is detected, which is denoted by . Since we are considering six parts of the ship as objects which defined in Figure 9, we have . In the presented TNN-MO model, we set , implying that the number of object queries is equal to the object classes.
The resulting decoder output embeddings are processed with feed-forward networks (FFNs). Each FFN is composed of two parts: linear layers with an input size of 256 and an output size of for class prediction, and a standard three-layer perceptron that has an input size of 256, a hidden layer dimension of 256 with the ReLU activation, and an output size of 64 for keypoint prediction.
These outputs of the -th FFN, corresponding to the -th object query, are denoted by the class predition and the keypoints , respectively. Here, the -th variable of , namely corresponds to the class logit (unnormalized score) that the keypoint output of the -th FFN, belongs to the -th object class. In other words, the higher the value of is, the more likely it is that the keypoints are of the -th object. These logits are passed through a softmax function to obtain the predicted class probability of the -th FFN, satisfying . These are concatenated into a stochastic matrix , where denotes the predicted probability that the -th set of keypoints belongs to the -th object class.
4.2 Ground Truth Data
In the training input image, the ground truth object class labels are represented by , where is a binary variable indicating the presence (or the absence) of the -th object. Specifically, if the -th object appears in the training input image, we set ; otherwise, we set . The -th object corresponds to the object, and therefore, always for any image in the training data set. It is important to note that since we are considering a single ship, there is no repetition of objects in the image. For example, following the object classes defined in Figure 9, when the whole ship (5th) and the ship stern (6th) are visible in a specific image, the resulting ground truth object class labels is .
For any with , the corresponding location of the keypoints in the image is denoted by , and the location of the keypoints in the ship-fixed frame is denoted by .
4.3 Loss Function for Training
For the given labeled data and the predicted output, the loss function for training is formulated as follows. Our approach to loss computation is inspired by the set prediction task, similar to the methods used in DETR [6] and YOLOPose [7]. It is composed of two steps of identifying the matching pairs and calculating the loss.
First, we perform the matching between the predicted class probability outputs and the ground truth using bipartite matching [6]. Let be the set of all possible permutations of . For any sequence , the matching cost is defined by
(1) |
where gives the index of the -th element in the permuted sequence. As such, this represents the sum of the negative probability that belongs to the -th object class for all objects present in the training image satisfying . The bipartite matching is to identify the optimal sequence that minimizes the matching cost:
(2) |
and it is addressed by the Hungarian method [19].
Next, after the matching pairs are identified, we formulate the keypoint loss as the sum of the negative log-likelihood for the object class prediction, and the error in the prediction of the keypoints, as given by
(3) |
where is a hyperparameter for relative weighting. The parameters of the proposed deep neural network and the object queries are randomly chosen initially, and they are adjusted to minimize the above loss during training. The detailed implementation of the training is described in the next section.
4.4 Model Inference with Bayesian Fusion
During the inference phase, the proposed TNN returns the probability of the class prediction and the predicted keypoints for each object query . The object class corresponding to the -th object query, namely is identified by
and the resulting maximum probability is denoted by , which is considered as the confidence in the object classification.
Together with the 3D location of the keypoints (or the bounding box) namely in the ground truth, the pair provides a 2D-to-3D correspondence for 32 points on the -th bounding box, which can be used to estimate the camera pose with respect to the ship-fixed frame. Specifically, we use the EPnP algoroithm [9] in conjunction with RANSAC to estimate the pose from the -th object.
As such, we obtain at most pairs of from the object classes defined in Section 3.4, excluding the predictions corresponding to the no object class. Each pose estimate has a varying degree of accuracy and confidence, as one object may be captured clearly than others depending on the perspective of the camera and the lighting condition. Or, it is possible that certain objects are occluded or outside of the field of view.
To address this, the multiple pose estimates are integrated as follows. It has been empirically observed that the class confidence of the object, namely is closely related to the accuracy of the pose estimate [20]. To eliminate outliers, the estimated pose is discarded if the object class confidence is less than or equal to 0.9, i.e., . Let be the number of estimates remaining after removing outliers. Then, the class confidence for the inliers are filtered with the softmax function to obtain the normalized weight for the inliers, satisfying . Each pose estimate is paired with the corresponding weight , to be integrated into a single pose estimate.
For the position estimate, the mean and the covariance matrix are obtained by the weighted sum as
(4) |
which serve as the position estimate and the corresponding degree of confidence.
Next, for the attitude estimate, it has been shown that the following arithmetic mean , or the first moment of the rotation matrix describes the mode and the degree of dispersion in [21].
(5) |
Note that while each belongs to , the weighted sum is not necessarily on . Let be the singular value decomposition of with the orthogonal and the diagonal . The proper singular value decomposition of is given by
(6) |
where the rotation matrices and the diagonal matrix are defined as
Given the value of and its proper singular value decomposition, the maximum likelihood estimate of the attitude is given by [21]. The confidence in the attitude estimate can be measured by the diagonal elements of satisfying . Roughly speaking, as the diagonal elements become closer to one (resp. zero), the confidence level on the attitude estimate increases (resp. decreases).
More specifically, the distribution of the estimated attitude can be described by the matrix Fisher distribution, which is the maximum entropy (or most arbitrary) distribution on when conditioned by the fixed value of . From the matrix Fisher distribution, the degree of uncertainties in the attitude estimate can be quantified as follows. For any , let be defined such that . In other words, represents the difference between and the estimated attitude in the sense that can be obtained by rotating about the axis resolved in the ship-fixed frame (or the axis when resolved in the estimated camera frame) by the angle . Thus, is the axis of rotation and is the angle of rotation from to .
There are two methods for uncertainty quantification. First, when the diagonal elements are identical, i.e., , the probability that the angle of rotation is less that a specific value can be computed by
(7) |
where correspond to the modified Bessel function of the first kind, and is a scalar that can be obtained numerically from [21]. The value of the above probability for varying and is illustrated at Figure 11. For example, when , the estimation error is less than with the probability of .
Second, when the estimated distribution is highly concentrated, or when , the rotation vector follows a Gaussian distribution with
(8) |
As it is unlikely to encounter highly concentrated estimates where (8) holds in practice, the following empirical expression can be used instead.
(9) |
where is obtained such that .
5 Model Training and Testing with Synthetic Data
The above transformer network and the Bayesian fusion constitute the proposed network architecture to estimate the 6D pose from monocular vision. In this section, we describe how the network is trained, including the implementation details and trained model is validation with the synthetic data.
5.1 Model Training with Synthetic Data
The model is trained as follows. The training dataset is composed of 435k synthetic images, including 332k images with a sea horizon background and 102k images with a random image background (without a sea horizon). Given that the synthetic dataset already included randomized situations, effectively covering a broad spectrum of variations and scenarios, image augmentation techniques were not utilized during the training process. The existing dataset was deemed to provide sufficient diversity and complexity for effective learning and adaptation of the model, rendering additional image augmentation unnecessary.
The TNN-MO model was implemented using the PyTorch framework. The model was trained for 350 epochs with the batch size of 48, hyperparameter , leveraging the AdamW optimizer [22], using a 20GB Multi-Instance GPU (MIG) partition from NVIDIA A100-PCIE-40GB GPU. The learning rate was initially set to for the first 200 epochs, after which it was reduced to . Gradient clip** was also employed, limiting the maximum gradient norm to 0.1. The resulting training curve is presented at Figure 12.
![Refer to caption](x10.png)
5.2 Model Testing with Synthetic Data
![Refer to caption](extracted/5662906/images/syn_est/MTNN_synthetic/SynTest4x2.jpg)
![Refer to caption](extracted/5662906/images/syn_est/MultivsSingle/MOvsSO4x2.jpg)
The trained model is validated with the synthetic test dataset that were not in the training dataset. For further comparison, we also incorporate the TNN model without the proposed multi-objects framework as developed in [8]. The resulting single object model predicts the relative pose with respect to the dog house structure. These are referred to as the TNN-SO Model and TNN-MO Model, respectively.
We compute the mean absolute error (MAE) for the position estimate and the attitude estimate over synthetic images under various lighting conditions and poses. The results are summarized at Table 1, where it is presented that TNN-MO model exhibits the position error of and the attitude error of . The position error is about 0.8% of the maximum range. While the direct comparison is not possible as they are distinct approaches based on different assumptions, these errors of the proposed approach are comparable or superior to the common visual-inertial odometry techniques [23]. When compared with the TNN-SO model, the proposed TNN-MO model demonstrates improved accuracy, even over long ranges. From Figure 14, we can see that the TNN-SO model is unable to detect the keypoints of the dog house when the camera is far from the ship. However, our TNN-MO model is capable of detecting keypoints of multiple objects and it is not vulnerable to the visibility of a single object. Figure 13 and Figure 14 illustrates the bounding boxes with a higher confidence level computed from the estimated pose, and it demonstrates that the estimated keypoints accurately aligned with the ground truth keypoints.
TNN type | No. Images | Max range ((m)) | MAE Rot. Est. (deg) | MAE Pos. Est. (m) | MAE / (%) |
---|---|---|---|---|---|
TNN-SO | 4.2k | 20 | 1.13 | 0.281 | 1.41 |
TNN-MO | 5.5k | 25 | 0.91 | 0.204 | 0.82 |
6 Validation with Flight Experiments
The TNN-MO model, trained on a synthetic dataset, was further tested using real images obtained during flight experiments on a USNA research vessel. The model was trained on a synthetic dataset, as collecting a large labeled dataset of a research vessel in ocean is infeasible. As such, it is critical to evaluate the model in the real world situation to access its capacity to handle the unique attributes of real images, such as such as variations in lighting conditions, object appearances, backgrounds, and other factors that may deviate from the synthetic dataset, referred to as the sim-to-real gap. This will offer important insights into its adaptability and resilience, confirming its efficacy beyond the synthetic training data and its suitability for real-world applications.
6.1 Data Collection System (DCS)
The hardware configuration to capture a validation dataset is as follows. A data collection system (DCS) that was developed for ship air wake measurements [10] is augmented with a camera. It is attached to an octocoper unmanned aerial vehicle that is manually controlled to fly around a USNA research vessel in Chesapeake bay, Maryland.
The detailed configuration of the DCS composed of a base module and a rover module is presented at Figure 15. Specifically, the DCS includes Inertial Measurement Units (IMUs) and Relative Time Kinematic (RTK) GPS, from which an extended Kalman filter is executed to estimate the pose in real time. The rover module of the DCS has been enhanced with an Alvium 1800 C-240 Global Shutter RGB camera and a Jetson Nano single-board computer, connected via an MIPI CSI-2 cable. This setup allows image capture at a rate of 5Hz, synchronized with the RTK GPS via a WIFI connection. This upgraded rover configuration enables synchronization and time-stam** of the images within the RTK GPS data collection process. It allows for the effective collection of real image data along with accurate relative camera 6D poses.
![Refer to caption](extracted/5662906/images/DCS/DCS5.jpg)
6.2 Model Testing with Real Data
The trained model is tested with the real world images captured by the DCS over multiple days. During the in-flight experiments conducted on the USNA research vessel, the model was subjected to real-world conditions where instances of variable lighting and occlusions were encountered. We deliberately selected challenging images for testing because they represent realistic conditions where such situations can occur. Figures 17 to 19 show the keypoints prediction and corresponding re-projected ships coordinate frame from estimated pose by the proposed TNN-MO model under the following three lighting conditions.
-
•
Overexposed ship: An overexposed ship is one where the image has captured too much light, causing the ship to appear excessively bright.
This results in a loss of detail, especially in areas that have subtle color variations or textures. The ship’s features become hard to be distinguished because the intense light overwhelms the camera’s sensor, leading to a predominance of white or light areas, particularly on surfaces like the landing pad. It’s as if the ship is caught in a glare, with its details bleached by the brightness.
-
•
Underexposed ship: An underexposed ship is one where the image has not captured enough light, making the ship appear too dark. This can obscure details and make it challenging to distinguish features, especially in areas that are naturally shadowed or lack reflective surfaces. It is as if the ship is enveloped in shadows, with its details concealed in darkness.
-
•
Normal ship: A normal ship, in terms of exposure, is one where the lighting conditions are optimal for image analysis, resulting in a balanced image with clear visibility of details. The lighting is neither too intense nor too dim, providing an optimal level of brightness that allows all parts of the ship to be clearly distinguished.
Traditional feature extraction methods and visual marker-based approaches often struggle under challenging circumstances, as there is a lack of visual features on the ship. Furthermore, in this dynamic environment of real-world imaging, human operators and various items are present on the ship, which we can identify as occlusions. These occlusions, along with variable lighting, present significant challenges. However, as illustrated by Figures 17 to 19, the proposed model successfully identified the keypoints and the bounding boxes of multiple parts of the ship. Its robust performance under such conditions is a testament to its capability to provide accurate results, ensuring reliable 6D pose estimation even when faced with obstructions and fluctuating light levels.
![Refer to caption](extracted/5662906/images/real_est/Real_collage/overexposed/overexposed5x3.jpg)
![Refer to caption](extracted/5662906/images/real_est/Real_collage/underexposed/underexposed5x3.jpg)
![Refer to caption](extracted/5662906/images/real_est/Real_collage/normal/normal5x3.jpg)
6.3 Validation with RTK GPS
Further, for quantitative validation, the estimated pose is compared with the actual pose determined by the DCS. The DCS integrates the relative attitude, derived from the base rover’s IMUs, and the relative position from the RTK-GPS with an extended Kalman filter. The IMUs, such as the VN-100, provide a magnetic heading accuracy of , a gyro in-run bias stability of /hr, and a pitch/roll accuracy of under normal conditions. The RTK-GPS, such as the Piksi Multi GNSS, provides a horizontal position accuracy of 0.75 m in SBAS mode, a velocity accuracy of 0.03 m/s RMS, and a time accuracy of 60 ns RMS. In RTK mode, it provides an accuracy of 0.010 m horizontally and 0.015 m vertically. These systems provide a centimeter-level accuracy in nominal conditions for the DCS. Here, the measurements of the RTK-GPS are considered as ground truth. The trajectories estimated by the proposed TNN-MO model and the RTK-GPS under the above three illumination conditions are illustrated in Figures 23 to 25 with the estimation errors and -bounds calculated by (4) and (9), where it is shown that the position and the attitude trajectory estimated by the TNN-MO model are consistent with the IMU and the RTK-GPS.
Image Type | Max Range, (m) | MAE / / of Rot. (deg) | MAE / of Pos. (m) | MAE/ (%) |
---|---|---|---|---|
Overexposed Ship | 13.7 | 1.8 / 2.32 / 0.999 | 0.112 / 0.017 | 0.82 |
Underexposed Ship | 13.5 | 1.1 / 1.83 / 0.999 | 0.089 / 0.019 | 0.66 |
Normal Ship | 18.2 | 4.0 / 4.54 / 0.999 | 0.177 / 0.022 | 0.97 |
The mean absolute errors are also tabulated at Table 2. The MAE of position estimation ranges from to meters, which corresponds to to of the maximum range . The MAE of rotation estimation varies from to degrees. While these errors are slightly larger than the validation error for the synthetic data as presented in Table 1, it is verified that the proposed TNN-MO successfully overcome the sim-to-real gap. The performance is consistent over the varying illumination conditions captured over multiple days. While the degree of attitude uncertainties measured by the standard deviation is in the reasonable range, the standard deviation for the position estimate is relatively small compared with the error, indicating over-confidence. This is partially because the uncertainties are measured indirectly in (4) by the variations of the estimate over multiple objects, not in the confidence of estimate for the individual object. However, in Figures 23 to 25, the uncertainties increase as the camera is further away from the ship, which is expected. These show that the presented uncertainty model behaves reasonably in the qualitative sense, but not necessarily accurate quantitatively. The exact modeling of uncertainties is considered as one of future directions. This provides promising results for pose estimate that can be utilized for autonomous launch and recovery in the ocean environments.
6.4 Multi-Objects Pose Estimation
The proposed TNN-MO model integrates the multiple pose estimates with respect to six object classes. To investigate the advantages of the multi-objects estimates, we perform an ablation study by studying the pose estimated with respect to the individual object class without fusion.
![Refer to caption](extracted/5662906/images/plots/ablation/0.png)
![Refer to caption](extracted/5662906/images/plots/ablation/1.png)
![Refer to caption](extracted/5662906/images/plots/ablation/2.png)
![Refer to caption](extracted/5662906/images/plots/ablation/3.png)
![Refer to caption](extracted/5662906/images/plots/ablation/4.png)
![Refer to caption](extracted/5662906/images/plots/ablation/5.png)
For the second case of the underexposed ship, the trajectory estimated by each object is presented at Figure 20. Upon examining Figures 20(d) to 20(f), it is evident that when the camera is closer to the ship, three object categories of the super structure, the whole ship, and the stern are not within the field-of-view, and the corresponding estimate is degraded. Conversely, for Figures 20(a) to 20(c), object pose estimations are accurate and reliable, as the objects remain visible throughout the flight.
Furthermore, the position estimation error with respect to each object is summarized at Table 3, where it is shown that the proposed multi-object approach yields smaller errors in mean and max values compared to the TNN-SO, illustrating the advantages of fusion.
Overexposed Ship | Underexposed Ship | Normal Ship | ||||
---|---|---|---|---|---|---|
Pos. err. (m) | Pos. err. (m) | Pos. err. (m) | ||||
Mean | Max | Mean | Max | Mean | Max | |
TNN-MO | 0.112 | 0.485 | 0.089 | 0.937 | 0.177 | 0.666 |
TNN-SO (dog house center) | 0.129 | 0.723 | 0.259 | 1.669 | 0.261 | 1.191 |
TNN-SO (dog house) | 0.131 | 0.720 | 0.259 | 1.738 | 0.253 | 1.154 |
TNN-SO (house) | 0.123 | 0.728 | 0.258 | 1.667 | 0.254 | 1.220 |
TNN-SO (super structure) | 0.135 | 0.782 | 0.347 | 1.660 | 0.304 | 1.169 |
TNN-SO (whole ship) | 0.151 | 0.682 | 0.369 | 1.704 | 0.355 | 1.129 |
TNN-SO (ship stern) | 0.152 | 0.669 | 0.394 | 1.664 | 0.365 | 1.130 |
6.5 Attention Map
![Refer to caption](extracted/5662906/images/TNN/encoder_self_attention_feature.png)
![Refer to caption](extracted/5662906/images/TNN/decoder_cross_attention_arranged.png)
To gain insights into how the synthetically trained TNN-MO model performs on real-world images, we analyzed the visualizations of the attention maps for the encoder and the decoder. The self-attention mechanism in the transformer network enables the encoder to understand the context of image elements. This is crucial in object detection, where the relationship between different parts of the image can provide important information. Therefore, we analyzed the transformer attention maps by passing a real-world image into TNN-MO, as shown in Figure 21. We observed that the TNN-MO encoder’s self-attention mechanism allows the model to understand the context of each pixel in relation to the others. For example, the self-attention map for the pixel point on the ship in the image, attention weights gets higher in the similar area of the ship on the map.
Additionally, to gain insights into what the model is focusing on, we visualized the decoder cross-attention map for different queries, as shown in Figure 22. This can help us understand which parts of the image the model finds most relevant for each query, providing valuable insights into the model’s decision-making process. We observed that the area where the cross-attention weight is higher for each query lies in the ship area of the image. This means that our TNN-MO model is able to understand the relevant information coming from the encoder to identify each part of the ship.
7 Conclusions
This study introduces an innovative method for estimating the relative 6D pose of an autonomous UAV in relation to a ship, utilizing monocular RGB camera images. We developed a synthetic dataset incorporating varied textures, lighting conditions, and camera poses, and annotated them with 2D keypoints of ship components in a virtual environment. Subsequently, we present a network architecture based on the transformer network, trained to detect keypoints of multiple parts of the ship, from which the camera pose is estimated using the Efficient Perspective-n-Point (EPnP) algorithm. Then, the estimated 6D poses are filtered, where the object class confidence is greater than 0.9, and they are integrated with Bayesian fusion to provide reliable pose estimations. This process combines pose estimations from multiple objects, disregarding information from unreliable estimates and placing more reliance on credible ones. The trained model has been thoroughly tested and validated on both synthetic and real-world data. The resulting mean absolute errors for position are 0.8% and 1.0% of the maximum flight range, respectively.
Our proposed formulation, based on synthetic data, circumvents challenges associated with collecting a large set of labeled data in the real world, successfully overcoming the discrepancy between virtual and real environments. Furthermore, our proposed multi-object pose estimation and fusion enhance accuracy and robustness under various relative configurations, occlusions, and illumination conditions. These results underscore the suitability of our approach for improving the autonomy and safety of UAV operations, particularly for autonomous landing and takeoff on moving platforms.
Future work includes integrating the pose estimate of the proposed transformer network model with an Inertial Measurement Unit (IMU) for visual-inertial navigation and utilizing it for autonomous flight experiments in ocean environments.
Acknowledgments
This research was conducted on the US Naval Academy’s research vessel YP689, and we extend our gratitude to the US Naval Academy and the crew of YP689. The research has been partially supported by the USNA/NAVSUP (N0016123RC01EA5), NSF (CNS-1837382), AFOSR MURI (FA9550-23-1-0400), and ONR (N00014-23-1-2850). The authors express their gratitude for this support.
References
- Robaglia et al. [2018] Robaglia, A. E., Libine, S., and Gamagedara, K., Autonomous Landing of an Unmanned Aerial Vehicle on a Moving Ship, 2018. 10.2514/6.2018-1461, URL https://arc.aiaa.org/doi/abs/10.2514/6.2018-1461.
- Marut et al. [2019] Marut, A., Wojtowicz, K., and Falkowski, K., “ArUco markers pose estimation in UAV landing aid system,” 2019 IEEE 5th International Workshop on Metrology for AeroSpace (MetroAeroSpace), 2019, pp. 261–266. 10.1109/MetroAeroSpace.2019.8869572.
- Gamagedara et al. [2024] Gamagedara, K., Lee, T., and Snyder, M., “Delayed Kalman filter for vision-based autonomous flight in ocean environments,” Control Engineering Practice, Vol. 143, 2024, p. 105791. https://doi.org/10.1016/j.conengprac.2023.105791, URL https://www.sciencedirect.com/science/article/pii/S096706612300360X.
- Kang et al. [2020] Kang, J., Liu, W., Tu, W., and Yang, L., “YOLO-6D+: Single Shot 6D Pose Estimation Using Privileged Silhouette Information,” 2020 International Conference on Image Processing and Robotics (ICIP), 2020, pp. 1–6.
- Wang et al. [2021] Wang, G., Manhardt, F., Tombari, F., and Ji, X., “GDR-Net: Geometry-Guided Direct Regression Network for Monocular 6D Object Pose Estimation,” CoRR, Vol. abs/2102.12145, 2021. URL https://arxiv.longhoe.net/abs/2102.12145.
- Carion et al. [2020] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S., “End-to-End Object Detection with Transformers,” CoRR, Vol. abs/2005.12872, 2020. URL https://arxiv.longhoe.net/abs/2005.12872.
- Amini et al. [2022] Amini, A., Periyasamy, A. S., and Behnke, S., “YOLOPose: Transformer-based Multi-Object 6D Pose Estimation using Keypoint Regression,” , 2022. 10.48550/ARXIV.2205.02536, URL https://arxiv.longhoe.net/abs/2205.02536.
- Wickramasuriya et al. [2024] Wickramasuriya, M., Lee, T., and Snyder, M., “Deep Monocular Relative 6D Pose Estimation for Ship-Based Autonomous UAV,” AIAA SCITECH 2024 Forum, 2024, p. 2877.
- Lepetit et al. [2009] Lepetit, V., Moreno-Noguer, F., and Fua, P., “EPnP: An accurate O(n) solution to the PnP problem,” International journal of computer vision, Vol. 81, 2009, pp. 155–166.
- Gamagedara et al. [2019] Gamagedara, K., Lee, T., and Snyder, M. R., Real-time Kinematics GPS Based Telemetry System for Airborne Measurements of Ship Air Wake, 2019. 10.2514/6.2019-2377, URL https://arc.aiaa.org/doi/abs/10.2514/6.2019-2377.
- Bostock et al. [2023] Bostock, N., Richez, A., Costello, D. H., Webster-Giddings, A., and Wickramasuriya, M., “Verification of YP689 Flow Field Models for Dynamic Interface Flight Test,” AIAA SCITECH 2023 Forum, 2023, p. 0294.
- Schönberger and Frahm [2016] Schönberger, J. L., and Frahm, J.-M., “Structure-from-Motion Revisited,” Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Community [2022] Community, B. O., Blender - a 3D modelling and rendering package, Blender Foundation, Stichting Blender Foundation, Amsterdam, 2022. URL http://www.blender.org.
- Loquercio et al. [2019] Loquercio, A., Kaufmann, E., Ranftl, R., Dosovitskiy, A., Koltun, V., and Scaramuzza, D., “Deep Drone Racing: From Simulation to Reality with Domain Randomization,” IEEE Transactions on Robotics, 2019. 10.1109/TRO.2019.2942989.
- Cimpoi et al. [2014] Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , and Vedaldi, A., “Describing Textures in the Wild,” Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
- Ope [n.d.] “OpenGameArt,” https://opengameart.org/, n.d. Accessed May 12, 2023.
- Budinich [2017] Budinich, R., “A Region Based Easy Path Wavelet Transform For Sparse Image Representation,” , 2017.
- Vaswani et al. [2023] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I., “Attention Is All You Need,” , 2023.
- Stewart and Andriluka [2015] Stewart, R., and Andriluka, M., “End-to-end people detection in crowded scenes,” , 2015.
- Huang et al. [2022] Huang, W.-L., Hung, C.-Y., and Lin, I.-C., “Confidence-Based 6D Object Pose Estimation,” IEEE Transactions on Multimedia, Vol. 24, 2022, pp. 3025–3035. 10.1109/TMM.2021.3092149.
- Lee [2018] Lee, T., “Bayesian Attitude Estimation with the Matrix Fisher Distribution on SO(3),” IEEE Transactions on Automatic Control, Vol. 63, No. 10, 2018, pp. 3377–3392.
- Loshchilov and Hutter [2017] Loshchilov, I., and Hutter, F., “Fixing Weight Decay Regularization in Adam,” CoRR, Vol. abs/1711.05101, 2017. URL http://arxiv.longhoe.net/abs/1711.05101.
- Zhang and Scaramuzza [2018] Zhang, Z., and Scaramuzza, D., “A tutorial on quantitative trajectory evaluation for visual (-inertial) odometry,” 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2018, pp. 7244–7251.
![Refer to caption](extracted/5662906/images/plots/C_Test_6_overexposed_plots/overexposed_type3D.png)
![Refer to caption](extracted/5662906/images/plots/C_Test_6_overexposed_plots/overexposed_typeB.png)
![Refer to caption](x11.png)
![Refer to caption](x12.png)
![Refer to caption](x13.png)
![Refer to caption](x14.png)
![Refer to caption](extracted/5662906/images/plots/Test_51_underexposed_plots/underexposed_type3D.png)
![Refer to caption](extracted/5662906/images/plots/Test_51_underexposed_plots/underexposed_typeB.png)
![Refer to caption](x15.png)
![Refer to caption](x16.png)
![Refer to caption](x17.png)
![Refer to caption](x18.png)
![Refer to caption](extracted/5662906/images/plots/E_Test_5_normal_plots/normal_type3D.png)
![Refer to caption](extracted/5662906/images/plots/E_Test_5_normal_plots/normal_typeB.png)
![Refer to caption](x19.png)
![Refer to caption](x20.png)
![Refer to caption](x21.png)
![Refer to caption](x22.png)