CTS: Sim-to-Real Unsupervised Domain Adaptation on 3D Detection thanks: Meiying Zhang and Weiyuan Peng are co-first authors; Corresponding author: Qi Hao (e-mail: [email protected]) thanks: 1 Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology (SUSTech), China; 2 Department of Computer Science and Engineering, SUSTech; 3 Kuang-Chi Institute of Advanced Technology, China. thanks: This work is supported by the Science and Technology Innovation Committee of Shenzhen City (No: JCYJ20200109141622964), the National Natural Science Foundation of China (62261160654), and the Shenzhen Key Laboratory of Robotics and Computer Vision (ZDSYS20220330160557001).

Meiying Zhang1, Weiyuan Peng1, Guangyao Ding2, Chenyang Lei2, Chunlin Ji3, Qi Hao2
Abstract

Simulation data can be accurately labeled and have been expected to improve the performance of data-driven algorithms, including object detection. However, due to the various domain inconsistencies from simulation to reality (sim-to-real), cross-domain object detection algorithms usually suffer from dramatic performance drops. While numerous unsupervised domain adaptation (UDA) methods have been developed to address cross-domain tasks between real-world datasets, progress in sim-to-real remains limited. This paper presents a novel Complex-to-Simple (CTS) framework to transfer models from labeled simulation (source) to unlabeled reality (target) domains. Based on a two-stage detector, the novelty of this work is threefold: 1. develo** fixed-size anchor heads and RoI augmentation to address size bias and feature diversity between two domains, thereby improving the quality of pseudo-label; 2. develo** a novel corner-format representation of aleatoric uncertainty (AU) for the bounding box, to uniformly quantify pseudo-label quality; 3. develo** a noise-aware mean teacher domain adaptation method based on AU, as well as object-level and frame-level sampling strategies, to migrate the impact of noisy labels. Experimental results demonstrate that our proposed approach significantly enhances the sim-to-real domain adaptation capability of 3D object detection models, outperforming state-of-the-art cross-domain algorithms, which are usually developed for real-to-real UDA tasks.

I Introduction

Unsupervised domain adaptation (UDA) research in 3D object detection has yielded outstanding results in various real-world datasets [1, 2, 3, 4, 5, 6, 7, 8]. By contrast, the sim-to-real domain adaptation has not made much progress yet. This is primarily due to the point cloud generated in commonly used simulation environments, such as CARLA [9], have limitations including: 1. ideal and densely collected with minimal noise; 2. significant statistical disparities from real-world data, as simulated assets are limited in types and sizes; and 3. insufficient diversity in object features. These limits degrade the sim-to-real domain adaptation performance in 3D object detection.

Refer to caption
Figure 1: An illustration of unsupervised sim-to-real domain adaptation guided by pseudo-label, which aims to minimize domain shifts arising from the simulator (e.g., CARLA[9]) to the real-world datasets (e.g., KITTI[10], Lyft[11] and TinySUSCape[12]).

Generally, UDA methods in 3D object detection can be divided into two main categories: 1. domain-invariant feature learning[1, 2, 3, 4], which learns domain-invariant features by minimizing the distance of feature distribution between the source and target domains; 2. pseudo-label guided methods[5, 6, 7, 8], which enhance transfer performance by generating pseudo-labels in the target domain and further training using these labels. While the former requires specific feature information of two domains, the latter provides a more general and flexible cross-domain framework. However, these methods are not directly applicable to sim-to-real scenarios. A fully functional pseudo-label guided approach to sim-to-real UDA should be able to address the following issues:

  • Geneartion of High-quality Pseudo-label. The object size bias and distribution differences between the simulated and real data, as shown in Fig 1, easily lead to inconsistent regression results (i.e., low-quality pseudo-labels). How to mitigate these biases in detection is important for generating high-quality pseudo-labels.

  • Uniform Quantification of Pseudo-label Quality. The generated pseudo-labels include true positive (TP), false positive (FP), and false negative (FN), as shown in Fig. 1. In general, TP labels have high quality, while FP ones have low quality and FN ones are missing labels. How to uniformly quantify the quality of pseudo-labels is critical for subsequent sampling of high-quality labels.

  • Target Data Sampling with High-quality Pseudo-labels. In most UDA methods guided by pseudo-labels, all pseudo-labels are packaged into the target domain training stage. However, FP and FN pseudo-labels introduce extra noise into this process and degrade model performance. How to smartly sample the target data with high-quality pseudo-labels is crucial to improve cross-domain performance.

To reduce the domain gap arising from object bias, current methods primarily focus on point cloud preprocessing in the source domain. However, these methods can barely reduce domain inconsistencies between two domains [13, 8, 7]. Furthermore, methods that use a complex two-stage UDA design show limited performance in sim-to-real tasks [13, 6]. Meanwhile, various methods have been proposed to achieve high-quality pseudo-label guidance, including multi-output fusion techniques, such as fusing multi-modality outputs for 2D-3D data [12], or fusing multi-pass outputs to maintain “high stochasticity” [14]. The mean teacher scheme can also generate more accurate pseudo-labels in target domains [14, 6, 15]. However, its performance can be much degraded by the data noise in sim-to-real tasks.

This paper proposes a mean teacher-based Complex-to-Simple (CTS) framework, focusing on the second stage design, for sim-to-real UDA, with novel techniques to mitigate object bias, enhance pseudo-label quality, and optimize target domain data sampling for pseudo-label guidance. The main contributions include:

  • Development of localization refinement techniques including RoI random scaling and fixed-size anchor heads to address domain inconsistencies and produce high-quality pseudo-labels.

  • Development of a uniform corner-format measure for aleatoric uncertainty (AU) estimation to evaluate the quality pseudo-labels accurately.

  • Development of two sampling strategies based on AU in the mean teacher domain adaptation process to choose those point cloud frames and labels with adequate label quality only.

  • Release of the open source code of CTS, alongside the CARLA3D simulated dataset, for further research111The code of CTS and CARALA3D dataset are available at https://github.com/tendo518/CTS-UDA.

II Related Work

II-A UDA for 3D object detection

Some previous works have well explored the usage of UDA in 3D object detection [7, 8, 16, 13, 6, 14]. One common challenge of UDA in 3D object detection is the object size bias when cross-domains. Wang et al.[13] propose statistical normalization (SN) to align object sizes utilizing statistical information from target domain data. ST3D[7] and ST3D++[8] employ data augmentations during source domain training to improve the model’s incorporation of diverse size information. Besides mitigating object size bias, using pseudo-label guided methods in UDA emphasizes improving the quality of pseudo-labels. JST [12] enhances pseudo-label quality through 2D and 3D joint refinement, aligning outcomes from both modalities. ST3D [7] integrates an additional IoU regression head to assess prediction quality, facilitating selective updates of the pseudo-label pool. Building upon ST3D, ST3D++ [8] further refines pseudo-labels using a quality-aware denoising pipeline. MLC-Net[6] also employs the mean teacher scheme to ensure target domain consistency between teacher and student modules at both point and instance levels, which is similar to our method but involves higher complexity using UDA design for both stages. Although having significant improvements in real-to-real tasks, existing UDA methods often experience serious performance degradation in sim-to-real tasks. Therefore, based on the analysis of simulation and reality differences, our study concentrates on the quality enhancement, evaluation and selection for pseudo-labels to achieve higher sim-to-real performance.

Refer to caption
Figure 2: An illustration of the CTS framework. In the first stage, the model is trained on the source domain with Anchor Head (Sec IV-A1), RoI Augmentation (Sec IV-A2) and corner-format AU modeling (Sec IV-B). In the second stage, the noise-aware mean teacher approach is applied: the student model is alternatively supervised with pseudo-labels on the target domain and ground-truth labels on the source domain; the teacher model’s weights are updated using the EMA. Meanwhile, two noise-aware sampling strategies (Sec IV-C) are implemented using the aleatoric uncertainty indicator: frame-level sampling removes noisy frames, while object-level soft-sampling handles noisy labels.

II-B Uncertainty Estimation in 3D Object Detection

Uncertainty can serve as a valuable metric for quantifying both data and model noise within deep neural networks (DNNs) [17, 18, 19, 20, 21]. Uncertainty estimation methods typically address two main sources: epistemic uncertainty (EU) and aleatoric uncertainty (AU). EU is represented by a posterior distribution over model parameters[17, 18, 20], providing insights into the models uncertainty; AU is represented a distribution over model outputs[19, 21], reflecting intrinsic data stochasticity. Notably, AU varies with the quality of input data, suitable for quantifying the noise level of input data. Within the context of 3D detection tasks, several methodologies have integrated aleatoric uncertainty (AU) due to its ability to enhance detection performance[22, 23, 24, 25]. Meyer et al.[22] employ a mixture of Laplace distributions to fit the variances for each predefined regression variable, including box center positions, sizes, and orientation. Feng et al.[24] model AU using multivariate Gaussian distributions, with independent variables representing three distinct sets, i.e., RoI positions, bounding box positions, and orientation. However, few methods have leveraged the Aleatoric Uncertainty (AU) estimated from 3D detection results for the evaluation of data noise. Besides, existing approaches represent uncertainties using non-uniform variables, adding a complexity to further utilization. Therefore, this study proposes a uniform corner-based representation for bounding boxes with uncertainties, easy for the quality evaluation of the predicted psudo-labels.

III System Setup

In a standard two-stage detector like PointRCNN [26], the first stage roughly detects objects across a frame and the second stage refines localization. Directly applying PointRCNN for sim-to-real tasks led to a 60% decrease in Average Precision (AP) at an IoU threshold of 0.7 and a 20% decrease at IoU of 0.5 (see CARLA3D\rightarrow KITTI in Table I), which suggests a retained object detection and classification ability but much loss in localization precision. To enhance sim-to-real domain adaptation, the paper focuses on improving the domain adaptation of the second-stage localization network instead of adopting a complex two-stage UDA design, namely Complex-to-Simple (CTS).

The complete CTS framework diagram is shown in Fig 2. The CTS framework leverages simulated data from the source domain to develop detection capabilities, followed by model refinement through mean teacher-based domain adaptation in real traffic scenarios of the target domain. The mean teacher scheme involves two branches: the student and teacher models. They share identical architectures and are both initialized with parameters from the source-domain training. However, they undergo different update mechanisms:

Student Model: The student model utilizes augmented RoI points and features as input, supervised with pseudo-labels y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on the target domain or ground truth labels yssubscript𝑦𝑠y_{s}italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT in the source domain. It is worth noting that the generated pseudo-labels can serve as supervision for the 1st-stage network as well, thus enabling domain adaptation for the 1st-stage network. Thus, the total loss of this network includes: 1. first-stage RoI regression loss lreg1subscript𝑙𝑟𝑒𝑔1l_{reg1}italic_l start_POSTSUBSCRIPT italic_r italic_e italic_g 1 end_POSTSUBSCRIPT. 2. first-stage RoI classification loss lcls1subscript𝑙𝑐𝑙𝑠1l_{cls1}italic_l start_POSTSUBSCRIPT italic_c italic_l italic_s 1 end_POSTSUBSCRIPT. 3. second-stage regression Smooth-L1 loss lreg2subscript𝑙𝑟𝑒𝑔2l_{reg2}italic_l start_POSTSUBSCRIPT italic_r italic_e italic_g 2 end_POSTSUBSCRIPT. 4. second-stage classification loss lcls2subscript𝑙𝑐𝑙𝑠2l_{cls2}italic_l start_POSTSUBSCRIPT italic_c italic_l italic_s 2 end_POSTSUBSCRIPT. 5. second-stage AU-NLL loss lnllsubscript𝑙𝑛𝑙𝑙l_{nll}italic_l start_POSTSUBSCRIPT italic_n italic_l italic_l end_POSTSUBSCRIPT specified in Sec IV-B.

Teacher Model: The teacher model handles raw (non-augmented) data and keeps its weights fixed. Instead of standard backpropagation, it updates its weights using the exponential moving average (EMA):

θttea=β×θt1tea+(1β)×θtstusubscriptsuperscript𝜃𝑡𝑒𝑎𝑡𝛽subscriptsuperscript𝜃𝑡𝑒𝑎𝑡11𝛽subscriptsuperscript𝜃𝑠𝑡𝑢𝑡\theta^{tea}_{t}=\beta\times\theta^{tea}_{t-1}+(1-\beta)\times\theta^{stu}_{t}italic_θ start_POSTSUPERSCRIPT italic_t italic_e italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_β × italic_θ start_POSTSUPERSCRIPT italic_t italic_e italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ( 1 - italic_β ) × italic_θ start_POSTSUPERSCRIPT italic_s italic_t italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (1)

where θtstusubscriptsuperscript𝜃𝑠𝑡𝑢𝑡\theta^{stu}_{t}italic_θ start_POSTSUPERSCRIPT italic_s italic_t italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the student weight, β𝛽\betaitalic_β is the EMA decay that controls the updating ratio and t𝑡titalic_t stand for tthsubscript𝑡𝑡t_{th}italic_t start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT iteration.

IV Proposed Methods

IV-A Enhancement of Pseudo-Label Quality

IV-A1 Anchor Head (AH)

The second-stage model typically predicts size residuals ΔwhlsubscriptΔ𝑤𝑙\Delta_{whl}roman_Δ start_POSTSUBSCRIPT italic_w italic_h italic_l end_POSTSUBSCRIPT between proposals from the first-stage and final bounding boxes, denoted as B^^𝐵\hat{B}over^ start_ARG italic_B end_ARG. This approach avoids regressing the size of bounding boxes entirely from scratch. However, a challenge arises when the first-stage model, trained with biased supervision from source domain labels, exhibits inaccuracy in estimating proposal sizes. Unreliable proposal box sizes can lead to size errors accumulating in the second stage, degrading final bounding box refinement accuracy and the effectiveness of pseudo-labels. Inspired by anchor-based detectors [26], we introduce a fixed-size anchor box wan,han,lansubscript𝑤𝑎𝑛subscript𝑎𝑛subscript𝑙𝑎𝑛{w_{an},h_{an},l_{an}}italic_w start_POSTSUBSCRIPT italic_a italic_n end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_a italic_n end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_a italic_n end_POSTSUBSCRIPT to replace the proposal, termed the anchor head (AH). With AH, the second-stage network no longer refines proposals, but rather operates on a globally fixed-size 3D anchor. By employing AH in both the source and target domain training, AH ensures consistency in the second-stage network’s behavior across domains, thereby facilitating domain adaptation and improving the quality of pseudo-labels.

IV-A2 RoI Random Scaling (RRS) and Augmentation

To enhance the diversity in the features of the learning object from the simulated data, we introduce RoI Random Scaling (RRS) and Augmentation. In our setup, the second-stage model utilizes localized points (RoI points) and corresponding RoI features from the first-stage model as inputs. Specifically, only the points undergo augmentation, while their features remain unchanged. Let X~3×N~𝑋superscript3𝑁\widetilde{X}\in\mathbb{R}^{3\times N}over~ start_ARG italic_X end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_N end_POSTSUPERSCRIPT denote the decentralized points within a RoI box of dimensions l,w,h𝑙𝑤l,w,hitalic_l , italic_w , italic_h, and let ql,qw,qhsubscript𝑞𝑙subscript𝑞𝑤subscript𝑞q_{l},q_{w},q_{h}italic_q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT represent random scaling factors. The scaled RoI sizes are derived by multiplying the original dimensions by the scaling factors, resulting in qll,qww,qhhsubscript𝑞𝑙𝑙subscript𝑞𝑤𝑤subscript𝑞q_{l}l,q_{w}w,q_{h}hitalic_q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_l , italic_q start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_w , italic_q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_h. Furthermore, to enhance the second-stage model’s robustness, we apply augmentations that involve random rotation, flip**, and translation within specified ranges, as described in [27].

IV-B 3D Detection with Aleatoric Uncertainty

As noted in [17], Deep Neural Networks (DNNs) are capable of predicting aleatoric uncertainty effectively. Specifically, in the case where the regression y𝑦yitalic_y follows a Gaussian distribution with parameters (μ,σ2)𝜇superscript𝜎2(\mu,\sigma^{2})( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), the following loss function nllsubscript𝑛𝑙𝑙\mathcal{L}_{nll}caligraphic_L start_POSTSUBSCRIPT italic_n italic_l italic_l end_POSTSUBSCRIPT can be employed for optimization:

nll=(yfμ(𝐱,θ))22fσ2(𝐱,θ)+12log(fσ2(𝐱,θ))subscript𝑛𝑙𝑙superscript𝑦subscript𝑓𝜇𝐱𝜃22subscript𝑓superscript𝜎2𝐱𝜃12subscript𝑓superscript𝜎2𝐱𝜃\mathcal{L}_{nll}=\frac{(y-f_{\mu}(\mathbf{x},\theta))^{2}}{2f_{\sigma^{2}}(% \mathbf{x},\theta)}+\frac{1}{2}\log(f_{\sigma^{2}}(\mathbf{x},\theta))caligraphic_L start_POSTSUBSCRIPT italic_n italic_l italic_l end_POSTSUBSCRIPT = divide start_ARG ( italic_y - italic_f start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( bold_x , italic_θ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_f start_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_x , italic_θ ) end_ARG + divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log ( italic_f start_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_x , italic_θ ) ) (2)

where θ𝜃\thetaitalic_θ is the model parameter, fμsubscript𝑓𝜇f_{\mu}italic_f start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT and fσ2subscript𝑓superscript𝜎2f_{\sigma^{2}}italic_f start_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT represent sub-networks for predicting the mean and the variance.

When training the regression part of the detector, since the predicted bounding box y𝑦yitalic_y is usually encoded with 7 values, i.e., 𝐲b={μbx,μby,μbz,μbh,μbw,μbl,μbα}subscript𝐲𝑏subscript𝜇𝑏𝑥subscript𝜇𝑏𝑦subscript𝜇𝑏𝑧subscript𝜇𝑏subscript𝜇𝑏𝑤subscript𝜇𝑏𝑙subscript𝜇𝑏𝛼\mathbf{y}_{b}=\{\mu_{bx},\mu_{by},\mu_{bz},\mu_{bh},\mu_{bw},\mu_{bl},\mu_{b% \alpha}\}bold_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = { italic_μ start_POSTSUBSCRIPT italic_b italic_x end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_b italic_y end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_b italic_z end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_b italic_h end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_b italic_w end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_b italic_l end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_b italic_α end_POSTSUBSCRIPT } (called box format, BF), the matched variance values are encoded primarily as 𝝈b2={σbx2,σby2,σbz2,σbh2,σbw2,σbl2,σbα2}subscriptsuperscript𝝈2𝑏superscriptsubscript𝜎𝑏𝑥2superscriptsubscript𝜎𝑏𝑦2superscriptsubscript𝜎𝑏𝑧2superscriptsubscript𝜎𝑏2superscriptsubscript𝜎𝑏𝑤2superscriptsubscript𝜎𝑏𝑙2superscriptsubscript𝜎𝑏𝛼2\boldsymbol{\sigma}^{2}_{b}=\{\sigma_{bx}^{2},\sigma_{by}^{2},\sigma_{bz}^{2},% \sigma_{bh}^{2},\sigma_{bw}^{2},\sigma_{bl}^{2},\sigma_{b\alpha}^{2}\}bold_italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = { italic_σ start_POSTSUBSCRIPT italic_b italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_b italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_b italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_b italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_b italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_b italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_b italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }, of which each element corresponding to the uncertainty of an element in the bounding box representation. Nevertheless, the BF bounding box regression variable, specifically the centroid positions, extents (length, width, height), and orientations exhibits numerical magnitude inconsistencies. These disparities also indicate varying magnitudes of variances across each variable. Applying reduction methods (such as maximum or average) to these variances naively may result in overlooking uncertainties arising from specific components, particularly the orientation, due to its significantly smaller magnitudes.

Inspired by the corner loss methodology[28], we introduce a corner-based uncertainty measurement by encoding the bounding box equally with its 8 corner points, as illustrated in Fig 3. To be specific, during the training process, we firstly perform corner transformation on both model-predicated BF box and corresponding ground truth:

[μcxiμcyiμczi]=Rz(μbα)×[±μbw2±μbh2±μbl2]+[μbxμbyμbz]matrixsuperscriptsubscript𝜇𝑐𝑥𝑖superscriptsubscript𝜇𝑐𝑦𝑖superscriptsubscript𝜇𝑐𝑧𝑖subscriptR𝑧subscript𝜇𝑏𝛼matrixplus-or-minussubscript𝜇𝑏𝑤2plus-or-minussubscript𝜇𝑏2plus-or-minussubscript𝜇𝑏𝑙2matrixsubscript𝜇𝑏𝑥subscript𝜇𝑏𝑦subscript𝜇𝑏𝑧\begin{bmatrix}\mu_{cx}^{i}\\ \mu_{cy}^{i}\\ \mu_{cz}^{i}\end{bmatrix}=\textit{R}_{z}(\mu_{b\alpha})\times\begin{bmatrix}% \pm\frac{\mu_{bw}}{2}\\ \pm\frac{\mu_{bh}}{2}\\ \pm\frac{\mu_{bl}}{2}\end{bmatrix}+\begin{bmatrix}\mu_{bx}\\ \mu_{by}\\ \mu_{bz}\end{bmatrix}[ start_ARG start_ROW start_CELL italic_μ start_POSTSUBSCRIPT italic_c italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_μ start_POSTSUBSCRIPT italic_c italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_μ start_POSTSUBSCRIPT italic_c italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] = R start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_b italic_α end_POSTSUBSCRIPT ) × [ start_ARG start_ROW start_CELL ± divide start_ARG italic_μ start_POSTSUBSCRIPT italic_b italic_w end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG end_CELL end_ROW start_ROW start_CELL ± divide start_ARG italic_μ start_POSTSUBSCRIPT italic_b italic_h end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG end_CELL end_ROW start_ROW start_CELL ± divide start_ARG italic_μ start_POSTSUBSCRIPT italic_b italic_l end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG end_CELL end_ROW end_ARG ] + [ start_ARG start_ROW start_CELL italic_μ start_POSTSUBSCRIPT italic_b italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_μ start_POSTSUBSCRIPT italic_b italic_y end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_μ start_POSTSUBSCRIPT italic_b italic_z end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] (3)

Where Rz(μbα)subscriptR𝑧subscript𝜇𝑏𝛼{\textit{R}_{z}(\mu_{b\alpha})}R start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_b italic_α end_POSTSUBSCRIPT ) represents the rotation matrix corresponding to the yaw angle μbαsubscript𝜇𝑏𝛼\mu_{b\alpha}italic_μ start_POSTSUBSCRIPT italic_b italic_α end_POSTSUBSCRIPT, and pci=μcxi,μcyi,μczii=18superscriptsubscript𝑝𝑐𝑖superscriptsubscript𝜇𝑐𝑥𝑖superscriptsubscript𝜇𝑐𝑦𝑖superscriptsubscriptsuperscriptsubscript𝜇𝑐𝑧𝑖𝑖18p_{c}^{i}={\mu_{cx}^{i},\mu_{cy}^{i},\mu_{cz}^{i}}_{i=1}^{8}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_μ start_POSTSUBSCRIPT italic_c italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT italic_c italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT italic_c italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT denotes the positions of the 8 corners of the transformed CF-encoded box. For the sake of regression simplification, we assume that the distribution of each corner’s coordinates follows distinct Gaussian all sharing the same variance, denoted as:

𝐲ci=[ycxiycyiyczi]𝒩([μcxiμcyiμczi],(σci)2𝐈),i=18formulae-sequencesuperscriptsubscript𝐲𝑐𝑖matrixsuperscriptsubscript𝑦𝑐𝑥𝑖superscriptsubscript𝑦𝑐𝑦𝑖superscriptsubscript𝑦𝑐𝑧𝑖similar-to𝒩matrixsuperscriptsubscript𝜇𝑐𝑥𝑖superscriptsubscript𝜇𝑐𝑦𝑖superscriptsubscript𝜇𝑐𝑧𝑖superscriptsuperscriptsubscript𝜎𝑐𝑖2𝐈𝑖18\mathbf{y}_{c}^{i}=\begin{bmatrix}y_{cx}^{i}\\ y_{cy}^{i}\\ y_{cz}^{i}\end{bmatrix}\sim\mathcal{N}\left(\begin{bmatrix}\mu_{cx}^{i}\\ \mu_{cy}^{i}\\ \mu_{cz}^{i}\end{bmatrix},(\sigma_{c}^{i})^{2}\mathbf{I}\right),i=1\ldots 8bold_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_c italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_c italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_c italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] ∼ caligraphic_N ( [ start_ARG start_ROW start_CELL italic_μ start_POSTSUBSCRIPT italic_c italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_μ start_POSTSUBSCRIPT italic_c italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_μ start_POSTSUBSCRIPT italic_c italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] , ( italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) , italic_i = 1 … 8 (4)

where 𝐈𝐈\mathbf{I}bold_I is the identity matrix. Consequently, we predict 8 (rather than 24) independent variances (σc2)isuperscriptsuperscriptsubscript𝜎𝑐2𝑖(\sigma_{c}^{2})^{i}( italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for a CF encoded box, the overall NLL loss \mathcal{L}caligraphic_L and aleatoric uncertainty u^^𝑢\hat{u}over^ start_ARG italic_u end_ARG can be easily reduced with:

nlli=(𝐲ci¯𝐲^ci¯)22(σci)2+12log(σci)2\displaystyle\mathcal{L}_{nll}^{i}=\frac{\left(\overline{\mathbf{y}_{c}^{i}}-% \overline{\hat{\mathbf{y}}_{c}^{i}}\right)^{2}}{2(\sigma_{c}^{i})^{2}}+\frac{1% }{2}\log(\sigma_{c}^{i})^{2}caligraphic_L start_POSTSUBSCRIPT italic_n italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = divide start_ARG ( over¯ start_ARG bold_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG - over¯ start_ARG over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log ( italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT nll=18i=18nllisubscript𝑛𝑙𝑙18superscriptsubscript𝑖18superscriptsubscript𝑛𝑙𝑙𝑖\displaystyle\mathcal{L}_{nll}=\frac{1}{8}\sum_{i=1}^{8}\mathcal{L}_{nll}^{i}caligraphic_L start_POSTSUBSCRIPT italic_n italic_l italic_l end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 8 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_n italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT (5)
ubox=18i=18(σci)2subscript𝑢𝑏𝑜𝑥18superscriptsubscript𝑖18superscriptsuperscriptsubscript𝜎𝑐𝑖2\displaystyle u_{box}=\frac{1}{8}\sum_{i=1}^{8}(\sigma_{c}^{i})^{2}italic_u start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 8 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (6)

And all components contribute equally to the loss and final uncertainty metric.

(x,y,z)𝑥𝑦𝑧(x,y,z)( italic_x , italic_y , italic_z )h{h}italic_hw𝑤{w}italic_wl𝑙{l}italic_lα𝛼{\alpha}italic_α
(a) Encoded with BF
32674158
(b) Encoded with CF
Figure 3: An illustration of two coding schemes of bounding boxes with uncertainties. (a) BF: box format; (b) CF: corner format, where the red areas stand for the potential ranges, that is, the aleatoric uncertainty.

IV-C Noise-aware Mean Teacher

Aligning transformations on both student-model inputs and teacher-model output facilitates the acquisition of domain-invariant representations, thereby aiding in adaptation to the target domain using pseudo-labels. However, noisy pseudo-labels can lead to error accumulation. To address this challenge, we leverage aleatoric uncertainties predicted by a model to annotate data in the target domain and mitigate the impact of noisy data during mean teacher domain adaptation with following sampling strategies:

IV-C1 Object-Level Soft Sampling

During each iteration, the final second-stage regression loss Lreg2subscript𝐿𝑟𝑒𝑔2L_{reg2}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_g 2 end_POSTSUBSCRIPT is computed using the supervision provided by pseudo-labels assigned to individual objects. Rather than solely depending on these pseudo-labels, the loss is weighted by the inverse of their uncertainty u𝑢uitalic_u, denoted as:

𝐰label={1uu𝐮^tea}subscript𝐰𝑙𝑎𝑏𝑒𝑙conditional-set1𝑢for-all𝑢subscript^𝐮𝑡𝑒𝑎\displaystyle\mathbf{w}_{label}=\{\frac{1}{u}\mid\forall u\in\hat{\mathbf{u}}_% {tea}\}bold_w start_POSTSUBSCRIPT italic_l italic_a italic_b italic_e italic_l end_POSTSUBSCRIPT = { divide start_ARG 1 end_ARG start_ARG italic_u end_ARG ∣ ∀ italic_u ∈ over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_t italic_e italic_a end_POSTSUBSCRIPT } l2=𝐰label𝐥2subscript𝑙2direct-productsubscript𝐰𝑙𝑎𝑏𝑒𝑙subscript𝐥2\displaystyle l_{2}=\mathbf{w}_{label}\odot\mathbf{l}_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_w start_POSTSUBSCRIPT italic_l italic_a italic_b italic_e italic_l end_POSTSUBSCRIPT ⊙ bold_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (7)

Where 𝐥2subscript𝐥2\mathbf{l}_{2}bold_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the second-stage loss produced per object in the whole point cloud frame, direct-product\odot is the element-wise product. Consequently, objects with higher uncertainty associated with their pseudo-labels are softly filtered out, mitigating the adverse effects of noisy objects.

IV-C2 Frame-Level Sampling

Instead of using all target data, the sampling process selects a subset based on frame-level uncertainty. Low-noise target frames are sampled to train the model, enhancing its ability to detect objects in the target domain. By integrating curriculum learning strategies[29], the model refines its pseudo-labels and becomes more confident in uncertainty estimates after several training epochs. This iterative process gradually includes more frames until eventually, all target data are sampled. A detailed explanation of the frame-level sampling refers to Algorithm 1.

Algorithm 1 Noise-aware Frame-Level Sampling
1:𝒯𝒯\mathcal{T}caligraphic_T: Unlabeled Target Domain Dataset
2:𝒯subsubscript𝒯𝑠𝑢𝑏\mathcal{T}_{sub}caligraphic_T start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT: Target Sub-dataset after Sampling
3:Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: Number of samples in 𝒯𝒯\mathcal{T}caligraphic_T
4:Nsubsubscript𝑁𝑠𝑢𝑏N_{sub}italic_N start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT: Amount of data to be selected \Ensure𝒟𝒟\mathcal{D}caligraphic_D: Noise-aware Model \WhileNsub<Ntsubscript𝑁𝑠𝑢𝑏subscript𝑁𝑡N_{sub}<N_{t}italic_N start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT < italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
5:Uframe{}subscript𝑈𝑓𝑟𝑎𝑚𝑒{U}_{frame}\leftarrow\{\}italic_U start_POSTSUBSCRIPT italic_f italic_r italic_a italic_m italic_e end_POSTSUBSCRIPT ← { }, 𝒯sub{}subscript𝒯𝑠𝑢𝑏\mathcal{T}_{sub}\leftarrow\{\}caligraphic_T start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT ← { } \Foreach frame xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT in 𝒯𝒯\mathcal{T}caligraphic_T
6:𝐲^t,𝐮^tsuperscript^𝐲𝑡superscript^𝐮𝑡absent\hat{\mathbf{y}}^{t},\hat{\mathbf{u}}^{t}\leftarrowover^ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , over^ start_ARG bold_u end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← inference 𝒟𝒟\mathcal{D}caligraphic_D for xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
7:u^tmean of 𝐮^t for all valid object in xtsuperscript^𝑢𝑡mean of superscript^𝐮𝑡 for all valid object in xt\hat{u}^{t}\leftarrow\text{mean of }\hat{\mathbf{u}}^{t}\text{ for all valid % object in $x^{t}$}over^ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← mean of over^ start_ARG bold_u end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for all valid object in italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
8:Uframesubscript𝑈𝑓𝑟𝑎𝑚𝑒absent{U}_{frame}\leftarrowitalic_U start_POSTSUBSCRIPT italic_f italic_r italic_a italic_m italic_e end_POSTSUBSCRIPT ← append u^tsubscript^𝑢𝑡\hat{u}_{t}over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to Uframesubscript𝑈𝑓𝑟𝑎𝑚𝑒{U}_{frame}italic_U start_POSTSUBSCRIPT italic_f italic_r italic_a italic_m italic_e end_POSTSUBSCRIPT \EndFor\Fori𝑖iitalic_i in {1,,Nsub}1subscript𝑁𝑠𝑢𝑏\{1,\ldots,{N}_{sub}\}{ 1 , … , italic_N start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT }
9:jargmin of Uframe𝑗argmin of subscript𝑈𝑓𝑟𝑎𝑚𝑒j\leftarrow\text{argmin of }{U}_{frame}italic_j ← argmin of italic_U start_POSTSUBSCRIPT italic_f italic_r italic_a italic_m italic_e end_POSTSUBSCRIPT
10:𝒯subsubscript𝒯𝑠𝑢𝑏absent\mathcal{T}_{sub}\leftarrowcaligraphic_T start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT ← append the jthsubscript𝑗𝑡j_{th}italic_j start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT element xjtsuperscriptsubscript𝑥𝑗𝑡x_{j}^{t}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT in 𝒯𝒯\mathcal{T}caligraphic_T to 𝒯subsubscript𝒯𝑠𝑢𝑏\mathcal{T}_{sub}caligraphic_T start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT
11:pop the jthsubscript𝑗𝑡j_{th}italic_j start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT element u^jtsuperscriptsubscript^𝑢𝑗𝑡\hat{u}_{j}^{t}over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT from Uframesubscript𝑈𝑓𝑟𝑎𝑚𝑒{U}_{frame}italic_U start_POSTSUBSCRIPT italic_f italic_r italic_a italic_m italic_e end_POSTSUBSCRIPT \EndFor
12:𝒟𝒟absent\mathcal{D}\leftarrowcaligraphic_D ← fine-tune 𝒟𝒟\mathcal{D}caligraphic_D with 𝒯subsubscript𝒯𝑠𝑢𝑏\mathcal{T}_{sub}caligraphic_T start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT
13:Nsub+=Nsublimit-fromsubscript𝑁𝑠𝑢𝑏subscript𝑁𝑠𝑢𝑏{N}_{sub}+={N}_{sub}italic_N start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT + = italic_N start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT \EndWhile
14:return 𝒟𝒟\mathcal{D}caligraphic_D
\Require

V Experiments

  Task Method APBEV@0.7𝐴subscript𝑃𝐵𝐸𝑉@0.7AP_{BEV}@0.7italic_A italic_P start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT @ 0.7 AP3D@0.7𝐴subscript𝑃3𝐷@0.7AP_{3D}@0.7italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT @ 0.7
Easy Moderate Hard Easy Moderate Hard
  CARLA3D\rightarrowLyft Source Only 66.70 54.35 51.76 18.82 13.85 13.64
SN[13] 66.92 53.31 50.52 23.05 16.79 15.99
MLC-Net[6] 77.95 64.46 62.13 53.97 40.04 37.47
ST3D++[8] 75.57 61.68 57.49 51.02 37.24 35.41
Ours 81.66 67.86 65.17 61.93 45.87 43.87
Oracle 90.92 83.97 81.70 80.06 66.05 64.01
                         CARLA3D\rightarrowKITTI Source Only 27.45 20.55 17.51 5.67 4.06 3.23
SN[13] 31.21 30.23 28.18 9.37 9.15 7.63
MLC-Net[6] 70.45 56.66 49.41 43.02 32.68 27.39
ST3D++[8] 64.50 54.91 49.75 34.34 27.22 23.99
Ours 78.92 64.17 57.37 58.41 45.28 39.61
Oracle 93.18 83.26 80.20 86.02 71.70 66.86
                         CARLA3D\rightarrowTinySUScape Source Only 18.02 16.69 N/A 4.59 3.83 N/A
SN[13] 27.45 14.96 N/A 1.42 1.36 N/A
MLC-Net[6] 19.64 18.81 N/A 8.27 7.59 N/A
ST3D++[8] 40.86 38.17 N/A 26.09 23.86 N/A
Ours 42.45 38.62 N/A 31.47 28.02 N/A
 
TABLE I: Comparison results of three different sim-to-real domain adaptation tasks. We report APBEV𝐴subscript𝑃𝐵𝐸𝑉AP_{BEV}italic_A italic_P start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT and AP3D𝐴subscript𝑃3𝐷AP_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT of the car category at IoU = 0.7 for different difficulty levels. As TinySUSCape[12] does not provide labels with the occlusion level, Hard is marked as Not Available (N/A).

V-A Experimental Setup

V-A1 Datasets

We conduct supervised training in a simulated source domain, namely CARLA3D, acquired within the CARLA simulator [9]. All samples are taken from eight built-in scenarios in CARLA to ensure data diversity. The ego-vehicle is positioned randomly, collecting about 100 samples per scenario, each comprising eight frames at 2Hz. Out of the eight frames per sample, five are randomly chosen for the training set, yielding 3,990 frames with a total of 25,192 objects. Further details of the CARLA3D dataset are outlined in Table II. The target domains chosen include KITTI [10], Lyft [11], and TinySUScape used in [12]. During the testing phase, samples from these datasets along with their corresponding labels will be utilized, whereas only samples will be used during the training phase. A summary of these datasets is presented in Table III.

V-A2 Evaluation Metric

In our 3D object detection evaluation, referring to [13], we utilize the official KITTI evaluation metric from [10] for the Car category. We report two average precision (AP) metrics: APBEV𝐴subscript𝑃𝐵𝐸𝑉AP_{BEV}italic_A italic_P start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT based on bird’s-eye view IoUs, and AP3D𝐴subscript𝑃3𝐷AP_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT based on 3D IoUs.

V-A3 Implementation Details

Our proposed method is implemented based on OpenPCDet[27], using PointRCNN[30] as our baseline detector. All experiments were conducted on a Ubuntu Linux server equipped with 12 GiB NVIDIA TITAN V GPUs. The proposed model is first trained in CARLA3D for 50 epochs, in which the learning rate, the weight decay, and the momentum are set as 0.005, 0.0001, and 0.9, respectively. For the anchor head configuration, the anchor dimensions are globally set to lan=3.9subscript𝑙𝑎𝑛3.9l_{an}=3.9italic_l start_POSTSUBSCRIPT italic_a italic_n end_POSTSUBSCRIPT = 3.9, han=1.6subscript𝑎𝑛1.6h_{an}=1.6italic_h start_POSTSUBSCRIPT italic_a italic_n end_POSTSUBSCRIPT = 1.6, and wan=1.56subscript𝑤𝑎𝑛1.56w_{an}=1.56italic_w start_POSTSUBSCRIPT italic_a italic_n end_POSTSUBSCRIPT = 1.56. These values are derived from the statistical average of the dimensions of all labeled car objects in the KITTI dataset, deemed a reasonable metric. RoI augmentation is applied, involving random scaled by a factor of range from 0.70.70.70.7 to 1.31.31.31.3, translated by up to ±0.5plus-or-minus0.5\pm 0.5± 0.5 meter, rotated by an angle between π4𝜋4-\frac{\pi}{4}- divide start_ARG italic_π end_ARG start_ARG 4 end_ARG and π4𝜋4\frac{\pi}{4}divide start_ARG italic_π end_ARG start_ARG 4 end_ARG, and flipped by a chance of 50%. During mean teacher domain adaptation, the model achieving the highest accuracy in the source domain training phase is selected, and both teacher and student models are initialized from it. The Exponential Moving Average (EMA) factor (β𝛽\betaitalic_β) is set to 0.999, and the training lasts for 30 epochs for the Lyft dataset and 50 epochs for the KITTI/TinySUScape datasets. To ensure stability, we train the student model by alternating between source (with ground-truth labels) and target (with pseudo-labels) domain data. Regarding noise-aware training settings, the uncertainty pool is refreshed at the 1st, 6th, 16th, and 21st epochs for the Lyft dataset and at the 1st, 11th, 21st, and 31st epochs for the KITTI and TinySUScape datasets. In each of these epochs, sub-datasets are resampled at percentages of 30%, 50%, 70%, and 100% of the total dataset size for subsequent training iterations.

V-B Main Results

Our CTS framework was compared with the following methods: 1. SN[13]: A domain adaptation method has been considered effective on various datasets; 2. MLC-Net[6]: A domain adaptation method also based on mean teacher, which is similar to ours in the mean teacher part; 3. ST3D++[8]: A recent self-training based method that achieved state-of-the-art performance in real-to-real (e.g., Nusenses[31] \to KITTI[10]) domain adaptation tasks.

Besides, we provide two possible boundaries of results, they are: 1. Source Only: The model is solely trained in a supervised manner on the source domain and is directly applied to the target domain without employing any domain adaptation methods, which serve as a lower bound; 2. Oracle: A fully supervised model trained on the target/reality domain with actual labels, considered as an upper bound.

Scenario Frames Easy Moderate Hard Times
Town01 800800800800 309309309309 798798798798 1572157215721572 100100100100
Town02 800800800800 577577577577 898898898898 1983198319831983 100100100100
Town03 800800800800 581581581581 1574157415741574 3471347134713471 100100100100
Town04 792792792792 555555555555 3167316731673167 5978597859785978 99999999
Town05 800800800800 695695695695 1727172717271727 3855385538553855 100100100100
Town06 800800800800 229229229229 445445445445 2495249524952495 100100100100
Town07 800800800800 251251251251 758758758758 1967196719671967 100100100100
Town10 792792792792 823823823823 1648164816481648 2998299829982998 99999999
Total 6384638463846384 4020402040204020 11015110151101511015 24319243192431924319 798798798798
TABLE II: Overview of CARLA3D dataset. Frames represents the number of point cloud frames sampled in the scenario; Easy, Moderate, and Hard represent the quantities of objects with different difficult levels in the scenario, respectively. Times refers to the number of sampling.
Dataset Size(Train/Test) LiDAR Beams Points Per Frame
CARLA3D 3990 / 2394 1×641641\times 641 × 64 286.2K286.2𝐾286.2K286.2 italic_K
KITTI[10] 3712 / 3769 1×641641\times 641 × 64 118.7K118.7𝐾118.7K118.7 italic_K
Lyft[11] 12017 / 2891 1×40 or 64140 or 641\times 40\textit{ or }641 × 40 or 64 72.3K72.3𝐾72.3K72.3 italic_K
TinySUScape[12] 2579 / 965 1×12811281\times 1281 × 128 230.4K230.4𝐾230.4K230.4 italic_K
TABLE III: A summary of datasets. The Size(Train/Test) refers to the number of samples used in training and testing.
AH Aug2 MT NLL FL-NA OL-NA mAP3D𝑚𝐴subscript𝑃3𝐷mAP_{3D}italic_m italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT
15.44
\checkmark 34.63
\checkmark \checkmark 43.51
\checkmark \checkmark CF 43.83
\checkmark \checkmark \checkmark 45.67
\checkmark \checkmark \checkmark CF 45.91
\checkmark \checkmark \checkmark CF \checkmark 46.47
\checkmark \checkmark \checkmark CF \checkmark 48.67
\checkmark \checkmark \checkmark BF \checkmark \checkmark 49.37
\checkmark \checkmark \checkmark CF \checkmark \checkmark 50.56
TABLE IV: Ablation study results on CARLA3D \to Lyft. AH: anchor head scheme proposed in Sec IV-A1; Aug2: second-stage augmentation in Sec IV-A2; MT: mean teacher based domain adaptation; NLL: usage of NLL loss for aleatoric uncertainty; CF and BF refer to corner-format and box-format encoding respectively in Sec IV-B; FL-NA and OL-NA: frame-level and object-level noise-aware sampling strategies respectively in Sec IV-C. The mAP3D𝑚𝐴subscript𝑃3𝐷mAP_{3D}italic_m italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT metric is obtained by averaging over the three difficulty levels.

The results obtained using different UDA methods are summarized in Table I. Our CTS method surpasses all others in sim-to-real detection tasks. Specifically, compared to the source only method, our approach improves APBEV𝐴subscript𝑃𝐵𝐸𝑉AP_{BEV}italic_A italic_P start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT by approximately 15%35%percent15percent3515\%-35\%15 % - 35 % and AP3D𝐴subscript𝑃3𝐷AP_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT by around 25%50%percent25percent5025\%-50\%25 % - 50 %. However, due to the significant domain shift between the simulator and reality, our CTS method still exhibits a noticeable gap compared to the supervised Oracle. In contrast, the SN method, which generally performs well in various real-world domains, struggles in sim-to-real cross-domain tasks, experiencing performance degradation, such as in the CARLA3D \to TinySUSCape scenario.

Refer to caption
Figure 4: An illustration of the car sizes distribution of Lyft[11], and CARLA3D datasets with different processing methods, i.e., SN[13] and Random Scaling.

V-C Ablation Study

To further demonstrate the effectiveness of the individual components in our proposed method, we conducted extensive ablation experiments on the CARLA3D \to Lyft task.

V-C1 Benefits of Anchor Head

Incorporating the anchor head (AH) into the second-stage detector effectively reduces regression complexity while enhancing cross-domain robustness. As described in Table IV, compared to the original setup, the AH scheme yields over 19%percent1919\%19 % improvement, highlighting its effectiveness in cross-domain tasks even with a simple anchor size replacement.

V-C2 Benefits of RRS and Second-stage Augmentation

Compared to SN’s approach [13], our RoI Random Scaling (RRS) method effectively encourages the sizes of processed objects to resemble an unimodal distribution similar to real-world data, rather than solely aligning with statistical volumes that still exhibit multi-modal, as illustrated in Figure 4. Furthermore, integrating RRS into our second-stage augmentation (Aug2) resulted in a performance improvement of approximately 9%percent99\%9 %, as demonstrated in Table IV. These augmentation techniques enhance data diversity at the object level, enabling the model to learn diverse information.

V-C3 Benefits of Corner-Format AU

In contrast to BF, CF encoding uniformly distributes the localization uncertainty of the object across each corner component without requiring additional operations. Table IV demonstrates that using BF and CF representations for noise-aware sampling improves performance by 3.7%percent3.73.7\%3.7 % and 4.9%percent4.94.9\%4.9 %, respectively. This suggests that CF is more effective in identifying reliable pseudo-labels. Employing the CF encoding scheme, we investigate the aleatoric uncertainties (AUs) associated with predicted objects, considering their Intersection over Union (IoU) with ground truths and their ego-to-object distance, as depicted in Figure 5. Our observations reveal a decrease in AU values with increasing IoU, while they increase with greater ego-to-object distance. Furthermore, Figure 6 showcases examples where sparse and corrupted point clouds lead to elevated AU. These findings underscore the efficacy of predicted AUs in evaluating pseudo-label noise and their utility as a reliability metric for pseudo-labels.

V-C4 Benefits of Noise Awareness in Mean Teacher

As mentioned in Sec IV-C, two diverse noise-aware sampling strategies are used to minimize the adverse impacts of noisy pseudo-labels generated during mean teacher domain adaptation. with both the frame-level noise-aware (FL-NA) and object-level noise-aware (OL-NA) strategies, performance improves by 4.65%percent4.654.65\%4.65 %.

Additionally, utilization of NLL loss function solely has been shown to bring improvement [24]. Table IV also indicates a minor increase from 43.51% to 43.81% in source-only training with NLL. However, While adding NLL loss and extra uncertainty layers yields only a 0.3%percent0.30.3\%0.3 % improvement, employing both FL-NA and OL-NA results in an extra significant improvement of 4.3%percent4.34.3\%4.3 %. This demonstrates that the main performance gain arises from noise-aware sampling strategies rather than just loss function replacement.

Refer to caption
(a)
Refer to caption
(b)
Figure 5: An illustration of the correlation between AU value and IoU/ego-to-object distance for the target dataset. Blue points denote the AU values of detected objects; the red line represents the means of the AU values.
Refer to caption
(a) Easy, AU=0.007absent0.007=0.007= 0.007
Refer to caption
(b) Moderate, AU=0.018absent0.018=0.018= 0.018
Refer to caption
(c) Moderate, AU=0.016absent0.016=0.016= 0.016
Refer to caption
(d) Hard, AU=0.060absent0.060=0.060= 0.060
Figure 6: Examples of different levels of difficulties in 3D object detection. The blue boxes represent the ground truth; the green boxes represent the predicted results. The points in different colors at the box corners represent the 8 AU value components, whose mean is the final AU value of the entire object.

V-D Limitations

Although our proposed model shows enhanced adaptation performance within the target domain via multiple schemes, sim-to-real UDA still lags behind real-to-real methods due to limitations inherent in simulators. The restricted vehicle assets in simulators like CARLA fail to represent the diverse range of real-world vehicles. Additionally, simulators struggle to replicate complex real-world scenarios, including dynamic traffic patterns and diverse urban landscapes (e.g., different weather conditions), thus limiting their effectiveness in providing realistic training data for domain adaptation.

VI Conclusion

This paper has presented a CTS framework for unsupervised domain adaptation (UDA) in 3D object detection from simulation to real-world domains. The proposed techniques including RoI random scaling and augmentation as well as the fixed-size archor head can enrich the diversity of simulation data and mitigate object size deviation across domains, respectively, enhancing the quality of pseudo-labels. The proposed aleatoric uncertainty (AU) estimation based uniform corner-format representation of bounding boxes can help integrate the awareness of pseudo-label noise into the mean teacher domain adaptation process, achieving high-quality pseudo-label sampling. Experimental results using the CARLA, KITTI, Lyft, and TinySUScape datasets can demonstrate significant improvements over existing methods in various sim-to-real UDA tasks, including 5%-17% improvement in AP3D𝐴subscript𝑃3𝐷AP_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT and 2%-10% improvement in APBEV𝐴subscript𝑃𝐵𝐸𝑉AP_{BEV}italic_A italic_P start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT. Our future work will focus on extending our method to encompass both sim-to-real and real-to-real UDA scenarios.

References

  • [1] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in International Conference on Machine Learning.   PMLR, 2015, pp. 1180–1189.
  • [2] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” The journal of machine learning research, vol. 17, no. 1, pp. 2096–2030, 2016.
  • [3] M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable features with deep adaptation networks,” in International Conference on Machine Learning.   PMLR, 2015, pp. 97–105.
  • [4] M. Chen, S. Zhao, H. Liu, and D. Cai, “Adversarial-learned loss for domain adaptation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 04, 2020, pp. 3521–3528.
  • [5] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” Advances in neural information processing systems, vol. 30, 2017.
  • [6] Z. Luo, Z. Cai, C. Zhou, G. Zhang, H. Zhao, S. Yi, S. Lu, H. Li, S. Zhang, and Z. Liu, “Unsupervised domain adaptive 3d detection with multi-level consistency,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8866–8875.
  • [7] J. Yang, S. Shi, Z. Wang, H. Li, and X. Qi, “St3d: Self-training for unsupervised domain adaptation on 3d object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 10 368–10 378.
  • [8] J. Yang, S. Shi, Z. Wang, H. Li, and X. Qi, “St3d++: Denoised self-training for unsupervised domain adaptation on 3d object detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 5, pp. 6354–6371, 2022.
  • [9] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in Conference on Robot Learning.   PMLR, 2017, pp. 1–16.
  • [10] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2012, pp. 3354–3361.
  • [11] R. Kesten, M. Usman, T. P. J. Houston, K. Nadhamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P. Ondruska, S. O. a. S. Shah, A. Kulkarni, A. Kazakova, L. P. C. Tao, W. Jiang, and a. V. Shet, “Lyft level 5 perception dataset 2020,” 2019.
  • [12] G. Ding, M. Zhang, E. Li, and Q. Hao, “Jst: Joint self-training for unsupervised domain adaptation on 2d&3d object detection,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 477–483.
  • [13] Y. Wang, X. Chen, Y. You, L. E. Li, B. Hariharan, M. Campbell, K. Q. Weinberger, and W.-L. Chao, “Train in germany, test in the usa: Making 3d object detectors generalize,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 713–11 723.
  • [14] D. Hegde, V. Sindagi, V. Kilic, A. B. Cooper, M. Foster, and V. Patel, “Uncertainty-aware mean teacher for source-free unsupervised domain adaptive 3d object detection,” arXiv preprint arXiv:2109.14651, 2021.
  • [15] J. Deng, W. Li, Y. Chen, and L. Duan, “Unbiased Mean Teacher for Cross-Domain Object Detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4091–4101.
  • [16] W. Zhang, W. Li, and D. Xu, “SRDAN: Scale-aware and range-aware domain adaptation network for cross-dataset 3D object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6769–6779.
  • [17] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” Advances in neural information processing systems, vol. 30, 2017.
  • [18] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in International Conference on Machine Learning.   PMLR, 2016, pp. 1050–1059.
  • [19] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7482–7491.
  • [20] N. Tagasovska and D. Lopez-Paz, “Single-model uncertainties for deep learning,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [21] D. J. MacKay, “A practical Bayesian framework for backpropagation networks,” Neural computation, vol. 4, no. 3, pp. 448–472, 1992.
  • [22] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K. Wellington, “Lasernet: An efficient probabilistic 3d object detector for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 677–12 686.
  • [23] G. P. Meyer and N. Thakurdesai, “Learning an uncertainty-aware object detector for autonomous driving,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 10 521–10 527.
  • [24] D. Feng, L. Rosenbaum, F. Timm, and K. Dietmayer, “Leveraging heteroscedastic aleatoric uncertainties for robust real-time lidar 3d object detection,” in 2019 IEEE Intelligent Vehicles Symposium (IV).   IEEE, 2019, pp. 1280–1287.
  • [25] L. Ding, D. Li, B. Liu, W. Lan, B. Bai, Q. Hao, W. Cao, and K. Pei, “Capture uncertainties in deep neural networks for safe operation of autonomous driving vehicles,” in 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom).   IEEE, 2021, pp. 826–835.
  • [26] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 697–12 705.
  • [27] O. D. Team, “OpenPCDet: An Open-source Toolbox for 3D Object Detection from Point Clouds,” 2020. [Online]. Available: https://github.com/open-mmlab/OpenPCDet
  • [28] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object detection network for autonomous driving,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1907–1915.
  • [29] X. Wang, Y. Chen, and W. Zhu, “A survey on curriculum learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, pp. 4555–4576, 2021.
  • [30] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 770–779.
  • [31] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “Nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 621–11 631.