License: arXiv.org perpetual non-exclusive license
arXiv:2303.12787v3 [cs.CV] 17 Dec 2023

EPro-PnP: Generalized End-to-End Probabilistic
Perspective-n-Points for Monocular
Object Pose Estimation

Hansheng Chen, Wei Tian, Pichao Wang, Fan Wang, Lu Xiong, and Hao Li H. Chen is with the Department of Computer Science, Stanford University, Stanford, CA 94305 USA. Work done previously at Tongji University, Shanghai 201804, China. E-mail: [email protected] W. Tian and L. Xiong are with the School of Automotive Studies, Tongji University, Shanghai 201804, China. E-mail: {tian_wei, xiong_lu}@tongji.edu.cn P. Wang is with Amazon.com Inc, Seattle, WA 98109 USA. Work done previously at Alibaba Group (U.S.), Bellevue, WA 98004 USA. E-mail: [email protected] F. Wang is with Alibaba Group (U.S.), Sunnyvale, CA 94085 USA. E-mail: [email protected] H. Li is with Artificial Intelligence Innovation and Incubation Institute, Fudan University, Shanghai 200433, China. Work done previously at Alibaba Group, Hangzhou 311121, China. E-mail: [email protected](Corresponding author: Wei Tian.)
Abstract

Locating 3D objects from a single RGB image via Perspective-n-Point (PnP) is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest interpreting PnP as a differentiable layer, allowing for partial learning of 2D-3D point correspondences by backpropagating the gradients of pose loss. Yet, learning the entire correspondences from scratch is highly challenging, particularly for ambiguous pose solutions, where the globally optimal pose is theoretically non-differentiable w.r.t. the points. In this paper, we propose the EPro-PnP, a probabilistic PnP layer for general end-to-end pose estimation, which outputs a distribution of pose with differentiable probability density on the SE(3) manifold. The 2D-3D coordinates and corresponding weights are treated as intermediate variables learned by minimizing the KL divergence between the predicted and target pose distribution. The underlying principle generalizes previous approaches, and resembles the attention mechanism. EPro-PnP can enhance existing correspondence networks, closing the gap between PnP-based method and the task-specific leaders on the LineMOD 6DoF pose estimation benchmark. Furthermore, EPro-PnP helps to explore new possibilities of network design, as we demonstrate a novel deformable correspondence network with the state-of-the-art pose accuracy on the nuScenes 3D object detection benchmark. Our code is available at https://github.com/tjiiv-cprg/EPro-PnP-v2.

Index Terms:
Pose estimation, imaging geometry, probabilistic deep learning, 3D vision, autonomous vehicles

1 Introduction

Estimating the pose (i.e., position and orientation) of 3D objects from a single RGB image is an important problem in computer vision. This field is often subdivided into specific tasks, e.g., 6DoF pose estimation for robot manipulation and 3D object detection for autonomous driving. Although they share the same fundamentals of pose estimation, the different nature of the data leads to biased choice of methods. Top performers [1, 2, 3] on the 3D object detection benchmarks [4, 5] fall into the category of direct 4DoF pose prediction, leveraging the advances in end-to-end deep learning. On the other hand, the 6DoF pose estimation benchmark [6] is largely dominated by geometry-based methods [7, 8], which exploit the provided 3D object models and achieve a stable generalization performance. However, it is quite challenging to bring together the best of both worlds, i.e., training a geometric model to learn the object pose in an end-to-end manner.

Refer to caption
Figure 1: An overview of the proposed framework. The predicted 2D-3D correspondences formulate a PnP problem. Instead of solving the optimal pose, EPro-PnP outputs a pose distribution, allowing the gradients of the KL loss w.r.t. the probability density to be backpropagated to train the correspondence network.

There has been recent proposals for an end-to-end framework based on the Perspective-n-Point (PnP) approach [9, 10, 11, 12]. The PnP algorithm itself solves the pose from a set of 3D points in object space and their corresponding 2D projections in image space, leaving the problem of constructing these correspondences. Vanilla correspondence learning [13, 14, 15, 16, 17, 8, 18, 19, 20, 17] leverages the geometric prior to build surrogate loss functions, forcing the network to learn a set of pre-defined correspondences. End-to-end correspondence learning [9, 10, 11, 12] interprets the PnP solver as a differentiable layer and employs pose-driven loss function, so that gradient of the pose error can be backpropagated to the 2D-3D correspondences.

However, existing work on differentiable PnP learns only a portion of the correspondences (either 2D coordinates [12], 3D coordinates [9, 10] or corresponding weights [11]), assuming other components are given a priori. This raises an important question: why not learn the entire set of points and weights altogether in an end-to-end manner? Our intuition is: under such relaxed settings, the PnP problem could better describe pose ambiguity [21, 22], in the cases of symmetric objects [17] or uncertain observations. However, with the presence of ambiguity, the PnP problem has multiple local minima. Existing methods try to differentiate a point estimate of the pose (a single local minima), which is unstable in general, while the global optimum is neither easy to find nor differentiable.

To overcome the above limitations, we propose a generalized end-to-end probabilistic PnP (EPro-PnP) module that enables learning the weighted 2D-3D point correspondences entirely from scratch. The main idea is straightforward: a point estimate of pose is non-differentiable, but the probability density of pose is apparently differentiable, just like categorical classification scores. As shown in Figure 1, we interpret the output of PnP as a probabilistic distribution parameterized by the learnable 2D-3D correspondences. During training, the Kullback-Leibler (KL) divergence between the predicted and target pose distributions is minimized as the loss function, which can be efficiently calculated using the Adaptive Multiple Importance Sampling [23] algorithm.

As a general approach, EPro-PnP inherently unifies existing correspondence learning techniques (Section 3.1). Moreover, just like the attention mechanism [24], the corresponding weights can be trained to automatically focus on important point pairs, allowing the networks to be designed with inspiration from attention-related work [25, 26, 27].

To summarize, our main contributions are as follows:

  • We propose the EPro-PnP, a probabilistic PnP layer for general end-to-end pose estimation with learnable 2D-3D correspondences, which can cope with pose ambiguity.

  • We demonstrate that EPro-PnP can easily reach top-tier performance for 6DoF pose estimation by simply inserting it into the CDPN [18] framework.

  • We demonstrate the flexibility of EPro-PnP by proposing deformable correspondence learning for accurate 3D object detection, where the entire 2D-3D correspondences are learned from scratch.

This extended paper presents new experiments with improved results and rigorous ablation studies. For 6DoF pose estimation on LineMOD, feeding 2D box size to the model has improved uncertainty handling, boosting pose accuracy to outperform RePOSE [7]. New ablation studies reveal each loss’s contribution and show that EPro-PnP can achieve competitive performance even without 3D models (B2 in Table II). For 3D object detection on nuScenes, EPro-PnP with an enhanced network now leads the field of single-frame image-based detectors, and the ablation studies highlight the importance of the Monte Carlo pose loss in handling ambiguous poses. Furthermore, we have also expanded our discussion on the derivative regularization loss.

2 Related Work

2.1 Geometry-Based Object Pose Estimation

In general, geometry-based methods exploit the points, edges or other types of representation that are subject to the projection constraints under the perspective camera. Then, the pose can be solved by optimization. A large body of work utilizes point representation, which can be categorized into sparse keypoints and dense correspondences. BB8 [15] and RTM3D [19] locate the corners of the 3D bounding box as keypoints, while PVNet [13] defines the keypoints by farthest point sampling and Deep MANTA [20] by handcrafted templates. On the other hand, dense correspondence methods [17, 18, 8, 28, 16] predict pixel-wise 3D coordinates within a cropped 2D region. Most existing geometry-based methods follow a two-stage strategy, where the intermediate representations (i.e., 2D-3D correspondences) are learned with a surrogate loss function, which is sub-optimal compared to end-to-end learning.

2.2 End-to-End Correspondence Learning

To mitigate the limitation of surrogate correspondence learning, end-to-end approaches have been proposed to backpropagate the gradient from pose to intermediate representation. Using implicit differentiation w.r.t. the optimal pose or its approximations, Brachmann and Rother [10] propose a dense correspondence network where 3D points are learnable, BPnP [12] predicts 2D keypoint locations, and BlindPnP [11] learns the corresponding weight matrix given a set of unordered 2D/3D points. The above methods are all coupled with surrogate regularization loss, otherwise convergence is not guaranteed due to numerical instability [10] and the non-differentiable nature of the optimal pose. Under the probabilistic framework, these methods can be regarded as a Laplace approximation approach (Section 3.1).

Beyond point correspondence, RePOSE [7] proposes a feature-metric correspondence network trained by backpropagating the PnP solver (e.g. Levenberg-Marquardt), but it is insufficient under pose ambiguity although it can be leveraged as a local regularization technique in our framework (Section 3.4).

2.3 Probabilistic Deep Learning

Probabilistic methods account for uncertainty in the model and the data, known respectively as epistemic and aleatoric uncertainty [29]. The latter involves interpreting the prediction as learnable probabilistic distributions. Discrete categorical distribution via Softmax has been widely adopted as a smooth approximation of one-hot argmaxargmax\operatorname*{arg\,max}roman_arg roman_max for end-to-end classification. This inspired works such as DSAC [9], a smooth RANSAC with a finite hypothesis pool. Meanwhile, tractable parametric distributions (e.g., normal distribution) are often used in predicting continuous variables [30, 31, 29, 32, 33, 28], and mixture distributions can be employed to further capture ambiguity [34, 35, 36], e.g., ambiguous 6DoF pose [37]. In this paper, we propose yet a unique contribution: backpropagating a complicated continuous distribution derived from a nested optimization layer (the PnP layer) approximated by importance sampling, essentially making it a continuous counterpart of Softmax.

3 Generalized End-to-End Probabilistic PnP

3.1 Overview

Given an object proposal, our goal is to predict a set X={xi3D,xi2D,wi2D}i=1N𝑋superscriptsubscriptsubscriptsuperscript𝑥3D𝑖subscriptsuperscript𝑥2D𝑖subscriptsuperscript𝑤2D𝑖𝑖1𝑁X=\mathopen{}\mathclose{{}\left\{x^{\text{3D}}_{i},x^{\text{2D}}_{i},w^{\text{% 2D}}_{i}}\right\}_{i=1}^{N}italic_X = { italic_x start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT of N𝑁Nitalic_N corresponding points, with 3D object coordinates xi3D3subscriptsuperscript𝑥3D𝑖superscript3x^{\text{3D}}_{i}\in\mathbb{R}^{3}italic_x start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, 2D image coordinates xi2D2subscriptsuperscript𝑥2D𝑖superscript2x^{\text{2D}}_{i}\in\mathbb{R}^{2}italic_x start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and 2D weights wi2D+2subscriptsuperscript𝑤2D𝑖subscriptsuperscript2w^{\text{2D}}_{i}\in\mathbb{R}^{2}_{+}italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, from which a weighted PnP problem can be formulated to estimate the object pose relative to the camera.

The essence of a PnP layer is searching for an optimal pose y𝑦yitalic_y (expanded as rotation matrix R𝑅Ritalic_R and translation vector t𝑡titalic_t) that minimizes the cumulative squared weighted reprojection error:

argminy12i=1Nwi2D(π(Rxi3D+t)xi2D)fi(y)22,subscriptargmin𝑦12superscriptsubscript𝑖1𝑁superscriptnormsubscriptsuperscriptsubscript𝑤𝑖2D𝜋𝑅superscriptsubscript𝑥𝑖3D𝑡superscriptsubscript𝑥𝑖2Dsubscript𝑓𝑖𝑦superscript22\smash[b]{\operatorname*{arg\,min}_{y}\frac{1}{2}\sum_{i=1}^{N}\mathopen{}% \mathclose{{}\left\|\smash[b]{\underbrace{w_{i}^{\text{2D}}\circ\mathopen{}% \mathclose{{}\left(\pi(Rx_{i}^{\text{3D}}+t)-x_{i}^{\text{2D}}}\right)}_{f_{i}% (y)\in\mathbb{R}^{2}}}}\right\|^{2},}\vphantom{\underbrace{\mathopen{}% \mathclose{{}\left((O_{o}^{O})}\right)}_{o}}start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ under⏟ start_ARG italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT ∘ ( italic_π ( italic_R italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT + italic_t ) - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (1)

where π()𝜋\pi(\cdot)italic_π ( ⋅ ) is the projection function with camera intrinsics involved, \circ stands for element-wise product, and fi(y)subscript𝑓𝑖𝑦f_{i}(y)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) compactly denotes the weighted reprojection error.

Eq. (1) formulates a non-linear least squares problem that may have non-unique solutions, i.e., pose ambiguity [21, 22]. Previous work [10, 12, 11] only backpropagates through a local solution ysuperscript𝑦y^{\ast}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, which is inherently unstable and non-differentiable. To construct a differentiable alternative for end-to-end learning, we model the PnP output as a distribution of pose, which guarantees differentiable probability density. The cumulative error is considered to be the negative logarithm of the likelihood function p(X|y)𝑝conditional𝑋𝑦p(X|y)italic_p ( italic_X | italic_y ) defined as:

p(X|y)=exp12i=1Nfi(y)2.p\mathopen{}\mathclose{{}\left(X\middle|y}\right)=\exp-\frac{1}{2}\sum_{i=1}^{% N}\mathopen{}\mathclose{{}\left\|f_{i}(y)}\right\|^{2}.italic_p ( italic_X | italic_y ) = roman_exp - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (2)

With an additional prior pose distribution p(y)𝑝𝑦p(y)italic_p ( italic_y ), we can derive the posterior pose p(y|X)𝑝conditional𝑦𝑋p(y|X)italic_p ( italic_y | italic_X ) via the Bayes theorem. Using an uniform prior in the domain Y𝑌Yitalic_Y, the posterior density is simplified to the normalized likelihood:

p(y|X)=exp12i=1Nfi(y)2Yexp12i=1Nfi(y)2dy.𝑝conditional𝑦𝑋12superscriptsubscript𝑖1𝑁superscriptnormsubscript𝑓𝑖𝑦2subscript𝑌12superscriptsubscript𝑖1𝑁superscriptnormsubscript𝑓𝑖𝑦2d𝑦p(y|X)=\frac{\exp-\frac{1}{2}\sum_{i=1}^{N}\mathopen{}\mathclose{{}\left\|f_{i% }(y)}\right\|^{2}}{\int_{Y}\exp-\frac{1}{2}\sum_{i=1}^{N}\mathopen{}\mathclose% {{}\left\|f_{i}(y)}\right\|^{2}\mathop{}\!\mathrm{d}{y}}.italic_p ( italic_y | italic_X ) = divide start_ARG roman_exp - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∫ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT roman_exp - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_y end_ARG . (3)

Eq. (3) can be interpreted as a continuous counterpart of categorical Softmax.

3.1.1 KL Loss Function

During training, given a target pose distribution with probability density t(y)𝑡𝑦t(y)italic_t ( italic_y ), the KL divergence DKL(t(y)p(y|X))D_{\text{KL}}\mathopen{}\mathclose{{}\left(t(y)\|p(y|X)}\right)italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_t ( italic_y ) ∥ italic_p ( italic_y | italic_X ) ) is minimized as training loss. Intuitively, pose ambiguity can be captured by the multiple modes of p(y|X)𝑝conditional𝑦𝑋p(y|X)italic_p ( italic_y | italic_X ), and convergence is ensured such that wrong modes are suppressed by the loss function. Substituting Eq. (3), the KL divergence loss can be re-written as follows:

KLsubscriptKL\displaystyle\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT =Yt(y)(logt(y)logp(y|X))dyabsentsubscript𝑌𝑡𝑦𝑡𝑦𝑝conditional𝑦𝑋differential-d𝑦\displaystyle=\int_{Y}t(y)\mathopen{}\mathclose{{}\left(\log{t(y)}-\log{p(y|X)% }}\right)\mathop{}\!\mathrm{d}{y}= ∫ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT italic_t ( italic_y ) ( roman_log italic_t ( italic_y ) - roman_log italic_p ( italic_y | italic_X ) ) roman_d italic_y
=Yt(y)logp(X|y)Yp(X|y)dydy+𝑐𝑜𝑛𝑠𝑡absentsubscript𝑌𝑡𝑦𝑝conditional𝑋𝑦subscript𝑌𝑝conditional𝑋𝑦differential-d𝑦d𝑦𝑐𝑜𝑛𝑠𝑡\displaystyle=-\int_{Y}t(y)\log{\frac{p(X|y)}{\int_{Y}p(X|y)\mathop{}\!\mathrm% {d}{y}}}\mathop{}\!\mathrm{d}{y}+\mathit{const}= - ∫ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT italic_t ( italic_y ) roman_log divide start_ARG italic_p ( italic_X | italic_y ) end_ARG start_ARG ∫ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT italic_p ( italic_X | italic_y ) roman_d italic_y end_ARG roman_d italic_y + italic_const
=Yt(y)logp(X|y)dy+logYp(X|y)dy+𝑐𝑜𝑛𝑠𝑡.absentsubscript𝑌𝑡𝑦𝑝conditional𝑋𝑦differential-d𝑦subscript𝑌𝑝conditional𝑋𝑦differential-d𝑦𝑐𝑜𝑛𝑠𝑡\displaystyle=-\int_{Y}t(y)\log{p(X|y)}\mathop{}\!\mathrm{d}{y}+\log{\int_{Y}p% (X|y)\mathop{}\!\mathrm{d}{y}}+\mathit{const}.= - ∫ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT italic_t ( italic_y ) roman_log italic_p ( italic_X | italic_y ) roman_d italic_y + roman_log ∫ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT italic_p ( italic_X | italic_y ) roman_d italic_y + italic_const . (4)

In practice, we drop the constant relevant to the target distribution so that it is effectively a cross-entropy loss. In addition, we empirically find it effective to set a narrow (Dirac-like) target distribution centered at the ground truth ygtsubscript𝑦gty_{\text{gt}}italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT, yielding the simplified loss (after substituting Eq. (2)):

KL=12i=1Nfi(ygt)2tgt (reproj. at target pose)+logYexp12i=1Nfi(y)2dypred (reproj. at predicted pose).subscriptKLsubscript12superscriptsubscript𝑖1𝑁superscriptnormsubscript𝑓𝑖subscript𝑦gt2subscripttgt (reproj. at target pose)subscriptsubscript𝑌12superscriptsubscript𝑖1𝑁superscriptnormsubscript𝑓𝑖𝑦2d𝑦subscriptpred (reproj. at predicted pose)\mathcal{L}_{\text{KL}}=\underbrace{\frac{1}{2}\sum_{i=1}^{N}\mathopen{}% \mathclose{{}\left\|f_{i}(y_{\text{gt}})}\right\|^{2}}_{\mathclap{\mathcal{L}_% {\text{tgt}}\text{ (reproj. at target pose)}}}+\underbrace{\log\int_{Y}\exp-% \frac{1}{2}\sum_{i=1}^{N}\mathopen{}\mathclose{{}\left\|f_{i}(y)}\right\|^{2}% \mathop{}\!\mathrm{d}{y}}_{\mathcal{L}_{\text{pred}}\text{ (reproj. at % predicted pose)}}.caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT = under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT (reproj. at target pose) end_POSTSUBSCRIPT + under⏟ start_ARG roman_log ∫ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT roman_exp - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_y end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT (reproj. at predicted pose) end_POSTSUBSCRIPT . (5)

The only remaining problem is the integration in the second term, which is elaborated in Section 3.2.

Refer to caption
Figure 2: Learning a discrete classifier vs. Learning the continuous pose distribution. A discriminative loss function (left) shall encourage the unnormalized probability for the correct prediction as well as penalize for the incorrect. A one-sided loss (right) will degrade the distribution if the model is not well-regularized.

3.1.2 Comparison to Reprojection-Based Method

The two terms in Eq. (5) are concerned with the reprojection errors at target and predicted pose respectively. The former is often used as a surrogate loss in previous work [12, 10, 28]. However, the first term alone cannot handle learning all 2D-3D points without imposing strict regularization, as the minimization could simply collapse all the 2D-3D points. The second term originates from the normalization factor in Eq. (3), and is crucial to a discriminative loss function, as shown in Figure 2.

3.1.3 Comparison to Implicit Differentiation Method

Existing work on end-to-end PnP [12, 11] derives a single solution of a particular solver y=𝑃𝑛𝑃(X)superscript𝑦𝑃𝑛𝑃𝑋y^{\ast}=\mathit{PnP}(X)italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_PnP ( italic_X ) via implicit function theorem [38], assuming y12i=1Nfi(y)2|y=y=0\nabla_{y}\frac{1}{2}\sum_{i=1}^{N}\mathopen{}\mathclose{{}\left\|f_{i}(y)}% \right\|^{2}\negmedspace\bigm{|}_{y=y^{\ast}}=0∇ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT italic_y = italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 0. In the probabilistic framework, this is essentially the Laplace method that approximates the posterior by 𝒩(y,Σy)𝒩superscript𝑦subscriptΣsuperscript𝑦\mathcal{N}(y^{\ast},\Sigma_{y^{\ast}})caligraphic_N ( italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ), where both ysuperscript𝑦y^{\ast}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and ΣysubscriptΣsuperscript𝑦\Sigma_{y^{\ast}}roman_Σ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT can be estimated by the PnP solver with analytical derivatives [28]. If ΣysubscriptΣsuperscript𝑦\Sigma_{y^{\ast}}roman_Σ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is simplified to be isotropic, the approximated KL divergence can be simplified into the L2 loss yygt2superscriptnormsuperscript𝑦subscript𝑦gt2\|y^{\ast}-y_{\text{gt}}\|^{2}∥ italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT used in [11]. However, the Laplace approximation is inaccurate for non-normal posteriors with ambiguity, therefore does not guarantee global convergence. Besides, implicit differentiation itself may be prone to numerical instability [10].

3.2 Monte Carlo Pose Loss

In this section, we introduce a GPU-friendly efficient Monte Carlo approach to the integration in the proposed loss function, based on the Adaptive Multiple Importance Sampling (AMIS) algorithm [23].

Considering q(y)𝑞𝑦q(y)italic_q ( italic_y ) to be the probability density function of a proposal distribution that approximates the shape of the integrand exp12i=1Nfi(y)212superscriptsubscript𝑖1𝑁superscriptnormsubscript𝑓𝑖𝑦2\exp-\frac{1}{2}\sum_{i=1}^{N}\mathopen{}\mathclose{{}\left\|f_{i}(y)}\right\|% ^{2}roman_exp - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to be one of the K𝐾Kitalic_K samples drawn from q(y)𝑞𝑦q(y)italic_q ( italic_y ), the estimation of the second term predsubscriptpred\mathcal{L}_{\text{pred}}caligraphic_L start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT in Eq. (5) is thus:

predlog1Kj=1Kexp12i=1Nfi(yj)2q(yj)vj (importance weight),subscriptpred1𝐾superscriptsubscript𝑗1𝐾subscript12superscriptsubscript𝑖1𝑁superscriptnormsubscript𝑓𝑖subscript𝑦𝑗2𝑞subscript𝑦𝑗subscript𝑣𝑗 (importance weight)\mathcal{L}_{\text{pred}}\approx\log\frac{1}{K}\sum_{j=1}^{K}\underbrace{\frac% {\exp-\frac{1}{2}\sum_{i=1}^{N}\mathopen{}\mathclose{{}\left\|f_{i}(y_{j})}% \right\|^{2}}{q(y_{j})}}_{v_{j}\text{ (importance weight)}},caligraphic_L start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT ≈ roman_log divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT under⏟ start_ARG divide start_ARG roman_exp - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_q ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (importance weight) end_POSTSUBSCRIPT , (6)

where vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT compactly denotes the importance weight at yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Eq. (6) gives the vanilla importance sampling, where the choice of proposal q(y)𝑞𝑦q(y)italic_q ( italic_y ) strongly affects the numerical stability. The AMIS algorithm is a better alternative as it iteratively adapts the proposal to the integrand.

In brief, AMIS utilizes the sampled importance weights from past iterations to estimate the new proposal. Then, all previous samples are re-weighted as being homogeneously sampled from a mixture of the overall sum of proposals. [23] Initial proposal can be determined by the mode and covariance of the predicted pose distribution (see supplementary for details). A pseudo-code is given below.

Input : X={xi3D,xi2D,wi2D}i=1N𝑋superscriptsubscriptsuperscriptsubscript𝑥𝑖3Dsuperscriptsubscript𝑥𝑖2Dsuperscriptsubscript𝑤𝑖2D𝑖1𝑁X=\{x_{i}^{\text{3D}},x_{i}^{\text{2D}},w_{i}^{\text{2D}}\}_{i=1}^{N}italic_X = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT
Output : predsubscriptpred\mathcal{L}_{\text{pred}}caligraphic_L start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT
y,Σy𝑃𝑛𝑃(X)superscript𝑦subscriptΣsuperscript𝑦𝑃𝑛𝑃𝑋y^{\ast},\Sigma_{y^{\ast}}\leftarrow\mathit{PnP}(X)italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ← italic_PnP ( italic_X )   // Laplace approximation
Fit q1(y)subscript𝑞1𝑦q_{1}(y)italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y ) to y,Σysuperscript𝑦subscriptΣsuperscript𝑦y^{\ast},\Sigma_{y^{\ast}}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT   // initial proposal
1 for 1tT1𝑡𝑇1\leq t\leq T1 ≤ italic_t ≤ italic_T do
2       Generate Ksuperscript𝐾K^{\prime}italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT samples yj=1Ktsubscriptsuperscript𝑦𝑡𝑗1superscript𝐾y^{t}_{j=1\cdots K^{\prime}}italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 ⋯ italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT from qt(y)subscript𝑞𝑡𝑦q_{t}(y)italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) for 1jK1𝑗superscript𝐾normal-′1\leq j\leq K^{\prime}1 ≤ italic_j ≤ italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT do
             Pjtp(X|yjt)superscriptsubscript𝑃𝑗𝑡𝑝conditional𝑋superscriptsubscript𝑦𝑗𝑡P_{j}^{t}\leftarrow p(X|y_{j}^{t})italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← italic_p ( italic_X | italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )   // evaluate integrand
3            
4      for 1τt1𝜏𝑡1\leq\tau\leq t1 ≤ italic_τ ≤ italic_t and 1jK1𝑗superscript𝐾normal-′1\leq j\leq K^{\prime}1 ≤ italic_j ≤ italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT do
             Qjτ1tm=1tqm(yjτ)superscriptsubscript𝑄𝑗𝜏1𝑡superscriptsubscript𝑚1𝑡subscript𝑞𝑚superscriptsubscript𝑦𝑗𝜏Q_{j}^{\tau}\leftarrow\frac{1}{t}\sum_{m=1}^{t}q_{m}(y_{j}^{\tau})italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ← divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT )   // evaluate proposal mix
             vjτPjτ/Qjτsuperscriptsubscript𝑣𝑗𝜏superscriptsubscript𝑃𝑗𝜏superscriptsubscript𝑄𝑗𝜏v_{j}^{\tau}\leftarrow P_{j}^{\tau}/Q_{j}^{\tau}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ← italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT / italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT   // importance weight
5            
6      if t<T𝑡𝑇t<Titalic_t < italic_T then
7             Estimate qt+1(y)subscript𝑞𝑡1𝑦q_{t+1}(y)italic_q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_y ) from all weighted samples {yjτ,vjτ| 1τt,1jK}conditional-setsuperscriptsubscript𝑦𝑗𝜏superscriptsubscript𝑣𝑗𝜏formulae-sequence1𝜏𝑡1𝑗superscript𝐾\{y_{j}^{\tau},v_{j}^{\tau}\ |\,1\leq\tau\leq t,1\leq j\leq K^{\prime}\}{ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT | 1 ≤ italic_τ ≤ italic_t , 1 ≤ italic_j ≤ italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }
8      
predlog1TKt=1Tj=1Kvjtsubscriptpred1𝑇superscript𝐾superscriptsubscript𝑡1𝑇superscriptsubscript𝑗1superscript𝐾superscriptsubscript𝑣𝑗𝑡\mathcal{L}_{\text{pred}}\leftarrow\log\frac{1}{TK^{\prime}}\sum_{t=1}^{T}\sum% _{j=1}^{K^{\prime}}v_{j}^{t}caligraphic_L start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT ← roman_log divide start_ARG 1 end_ARG start_ARG italic_T italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT

In this paper, we empirically set the AMIS iteration count T𝑇Titalic_T to 4, and the number of samples per iteration Ksuperscript𝐾K^{\prime}italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to 128 for 6DoF pose and 32 for 4DoF pose (1D yaw-only orientation). These hyperparameters can be adjusted to balance computation and accuracy.

3.2.1 Choice of Proposal Distribution

We use separate proposal distributions for position and orientation, as the orientation space is non-Euclidean. For position, we adopt the 3DoF multivariate t-distribution. For 1D yaw-only orientation, we use a mixture of von Mises and uniform distribution. For 3D orientation represented by unit quaternion, the angular central Gaussian distribution [39] is adopted.

3.3 Backpropagation

Although backpropagation can be simply implemented with automatic differentiation packages, here we analyze the gradients of the loss function for an intuitive understanding of the learning process. In general, the gradients of the loss function defined in Eq. (5) is:

KL=12i=1Nfi(ygt)2𝔼yp(y|X)12i=1Nfi(y)2,subscriptKL12superscriptsubscript𝑖1𝑁superscriptnormsubscript𝑓𝑖subscript𝑦gt2subscript𝔼similar-to𝑦𝑝conditional𝑦𝑋12superscriptsubscript𝑖1𝑁superscriptnormsubscript𝑓𝑖𝑦2\nabla\mathcal{L}_{\text{KL}}=\nabla\frac{1}{2}\sum_{i=1}^{N}\mathopen{}% \mathclose{{}\left\|f_{i}(y_{\text{gt}})}\right\|^{2}-\mathop{\mathbb{E}}_{y% \sim p(y|X)}{\nabla\frac{1}{2}\sum_{i=1}^{N}\mathopen{}\mathclose{{}\left\|f_{% i}(y)}\right\|^{2}},∇ caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT = ∇ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_p ( italic_y | italic_X ) end_POSTSUBSCRIPT ∇ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (7)

where the first term is the gradient of reprojection errors at target pose, and the second term is the expected gradient of reprojection errors over predicted pose distribution, which is approximated by backpropagating the importance weights in the Monte Carlo pose loss.

3.3.1 Balancing Uncertainty and Discrimination

Consider the negative gradient w.r.t. the corresponding weights wi2Dsuperscriptsubscript𝑤𝑖2Dw_{i}^{\text{2D}}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT:

wi2DKL=wi2D(ri2(ygt)+𝔼yp(y|X)ri2(y)),subscriptsuperscriptsubscript𝑤𝑖2DsubscriptKLsuperscriptsubscript𝑤𝑖2Dsuperscriptsubscript𝑟𝑖absent2subscript𝑦gtsubscript𝔼similar-to𝑦𝑝conditional𝑦𝑋superscriptsubscript𝑟𝑖absent2𝑦-\nabla_{w_{i}^{\text{2D}}}\mathcal{L}_{\text{KL}}=w_{i}^{\text{2D}}\circ% \mathopen{}\mathclose{{}\left(-r_{i}^{\circ 2}(y_{\text{gt}})+\mathop{\mathbb{% E}}_{y\sim p(y|X)}{r_{i}^{\circ 2}(y)}}\right),- ∇ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT ∘ ( - italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∘ 2 end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) + blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_p ( italic_y | italic_X ) end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∘ 2 end_POSTSUPERSCRIPT ( italic_y ) ) , (8)

where ri(y)=π(Rxi3D+t)xi2Dsubscript𝑟𝑖𝑦𝜋𝑅superscriptsubscript𝑥𝑖3D𝑡superscriptsubscript𝑥𝑖2Dr_{i}(y)=\pi(Rx_{i}^{\text{3D}}+t)-x_{i}^{\text{2D}}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) = italic_π ( italic_R italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT + italic_t ) - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT (unweighted reprojection error), and ()2superscriptabsent2(\cdot)^{\circ 2}( ⋅ ) start_POSTSUPERSCRIPT ∘ 2 end_POSTSUPERSCRIPT stands for element-wise square. The first bracketed term ri2(ygt)superscriptsubscript𝑟𝑖absent2subscript𝑦gt-r_{i}^{\circ 2}(y_{\text{gt}})- italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∘ 2 end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) with negative sign indicates that correspondences with large reprojection error (hence high uncertainty) shall be weighted less. The second term 𝔼yp(y|X)ri2(y)subscript𝔼similar-to𝑦𝑝conditional𝑦𝑋superscriptsubscript𝑟𝑖absent2𝑦\mathop{\mathbb{E}}_{y\sim p(y|X)}{r_{i}^{\circ 2}(y)}blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_p ( italic_y | italic_X ) end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∘ 2 end_POSTSUPERSCRIPT ( italic_y ) is relevant to the variance of reprojection error over the predicted pose. The positive sign indicates that correspondences sensitive to pose variation should be weighted more, because they provide stronger pose discrimination. The final gradient is thus a balance between the uncertainty and discrimination, as shown in Figure 3. Existing work [28, 13] on learning uncertainty-aware correspondences only considers the former, hence lacking the discriminative ability.

Refer to caption
Figure 3: The learned corresponding weight can be factorized into inverse uncertainty and discrimination. Typically, inverse uncertainty roughly resembles the foreground mask, while discrimination emphasizes the 3D extremities of the object.

3.4 Limitations and Derivative Regularization Loss

In practice, we observe that the KL divergence loss has two limitations:

  • While the KL divergence is a good metric for the probabilistic distribution, existing evaluation protocols are all based on the point estimate of pose ysuperscript𝑦y^{\ast}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Therefore, for inference it is still required to locate a mode ysuperscript𝑦y^{\ast}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of the posterior p(y|X)𝑝conditional𝑦𝑋p(y|X)italic_p ( italic_y | italic_X ) by solving the PnP problem in Eq. (1), which could be sub-optimal if trained solely with the KL loss.

  • The 2D-3D correspondences are underdetermined if we only impose the KL loss when training the network. Learning these entangled elements could be difficult if the network architecture is not designed carefully with preferable inductive bias.

The above limitations can be mitigated by an additional regularization loss on ysuperscript𝑦y^{\ast}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that backpropagates through the Gauss-Newton (GN) least squares solver or its variants [7]. We call it the derivative regularization loss, since GN is a derivative-based optimizer, and the loss therefore acts on the derivatives of the log-density logp(y|X)𝑝conditional𝑦𝑋\log{p(y|X)}roman_log italic_p ( italic_y | italic_X ) to direct the GN increment ΔyΔ𝑦\Delta yroman_Δ italic_y towards the true pose ygtsubscript𝑦gty_{\text{gt}}italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT.

To employ the regularization during training, a detached solution ysuperscript𝑦y^{\ast}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is obtained first. Then, at ysuperscript𝑦y^{\ast}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, a final GN increment is evaluated (which ideally equals 0 if ysuperscript𝑦y^{\ast}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT has already converged to the local optimum):

Δy=(JTJ)H (approx.)1JTF(y)g,Δ𝑦superscriptsubscriptsuperscript𝐽T𝐽𝐻 (approx.)1subscriptsuperscript𝐽T𝐹superscript𝑦𝑔\Delta y=-{\underbrace{(J^{\text{T}}J)}_{\mathclap{H\text{ (approx.)}}}}^{-1}% \underbrace{J^{\text{T}}F(y^{\ast})}_{g},roman_Δ italic_y = - under⏟ start_ARG ( italic_J start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT italic_J ) end_ARG start_POSTSUBSCRIPT italic_H (approx.) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT under⏟ start_ARG italic_J start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT italic_F ( italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , (9)

where F(y)=[f1T(y),f2T(y),,fNT(y)]T𝐹superscript𝑦superscriptsuperscriptsubscript𝑓1Tsuperscript𝑦superscriptsubscript𝑓2Tsuperscript𝑦superscriptsubscript𝑓𝑁Tsuperscript𝑦TF(y^{\ast})=\mathopen{}\mathclose{{}\left[f_{1}^{\text{T}}(y^{\ast}),f_{2}^{% \text{T}}(y^{\ast}),\cdots,f_{N}^{\text{T}}(y^{\ast})}\right]^{\text{T}}italic_F ( italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , ⋯ , italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT is the flattened weighted reprojection errors of all points, J=F(y)/yT|y=yJ=\mathop{}\!\mathrm{\partial}{F(y)}/\mathop{}\!\mathrm{\partial}{y^{\text{T}}% }\negmedspace\bigm{|}_{y=y^{\ast}}italic_J = ∂ italic_F ( italic_y ) / ∂ italic_y start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT italic_y = italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is the Jacobian matrix, JTF(y)superscript𝐽T𝐹𝑦J^{\text{T}}F(y)italic_J start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT italic_F ( italic_y ) equals the gradient g𝑔gitalic_g of the negative log-likelihood (NLL) w.r.t. object pose, i.e., 12i=1Nfi(y)2/y12superscriptsubscript𝑖1𝑁superscriptnormsubscript𝑓𝑖𝑦2𝑦\mathop{}\!\mathrm{\partial}{\frac{1}{2}\sum_{i=1}^{N}\mathopen{}\mathclose{{}% \left\|f_{i}(y)}\right\|^{2}}/\mathop{}\!\mathrm{\partial}{y}∂ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ∂ italic_y, and JTJsuperscript𝐽T𝐽J^{\text{T}}Jitalic_J start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT italic_J is an approximation of the Hessian matrix H=g/yT𝐻𝑔superscript𝑦TH=\mathop{}\!\mathrm{\partial}{g}/\mathop{}\!\mathrm{\partial}{y^{\text{T}}}italic_H = ∂ italic_g / ∂ italic_y start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT. We therefore design the regularization loss as follows:

reg=l(y+Δy,ygt),subscriptreg𝑙superscript𝑦Δ𝑦subscript𝑦gt\mathcal{L}_{\text{reg}}=l(y^{\ast}+\Delta y,y_{\text{gt}}),caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = italic_l ( italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + roman_Δ italic_y , italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) , (10)

where l(,)𝑙l(\cdot,\cdot)italic_l ( ⋅ , ⋅ ) is a distance metric for pose. We adopt smooth L1 for position and cosine similarity for orientation (see supplementary materials for details). Note that the gradient is only backpropagated through ΔyΔ𝑦\Delta yroman_Δ italic_y, which is analytically differentiable w.r.t. the 2D-3D correspondences.

This loss not only addresses the first limitation by moving ysuperscript𝑦y^{\ast}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT towards ygtsubscript𝑦gty_{\text{gt}}italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT, but also partially disentangles the 2D-3D correspondences. To analyze the effect of the loss on the correspondences, we consider a local approximation of Eq. (10), assuming equal weights for position and orientation:

regsubscriptreg\displaystyle\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT y+Δyygt2absentsuperscriptnormsuperscript𝑦Δ𝑦subscript𝑦gt2\displaystyle\approx\mathopen{}\mathclose{{}\left\|y^{\ast}+\Delta y-y_{\text{% gt}}}\right\|^{2}≈ ∥ italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + roman_Δ italic_y - italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=y(JTJ)1JTJ+F(y)ygt2.absentsuperscriptnormsuperscript𝑦subscriptsuperscriptsuperscript𝐽T𝐽1superscript𝐽Tsuperscript𝐽𝐹superscript𝑦subscript𝑦gt2\displaystyle=\mathopen{}\mathclose{{}\left\|y^{\ast}-\smash[b]{\underbrace{(J% ^{\text{T}}J)^{-1}J^{\text{T}}}_{J^{+}}}F(y^{\ast})-y_{\text{gt}}}\right\|^{2}% .\vphantom{\underbrace{(J^{\text{T}}J)^{-1}J^{\text{T}}}_{J^{+}}}= ∥ italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - under⏟ start_ARG ( italic_J start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT italic_J ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_F ( italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (11)

Note that (JTJ)1JTsuperscriptsuperscript𝐽T𝐽1superscript𝐽T(J^{\text{T}}J)^{-1}J^{\text{T}}( italic_J start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT italic_J ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT is also the pseudo inverse of the matrix J𝐽Jitalic_J, which can be denoted by J+superscript𝐽J^{+}italic_J start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT for brevity. Then, taking the first-order approximation F(y)=F(ygt)+J(yygt)𝐹superscript𝑦𝐹subscript𝑦gt𝐽superscript𝑦subscript𝑦gtF(y^{\ast})=F(y_{\text{gt}})+J\mathopen{}\mathclose{{}\left(y^{\ast}-y_{\text{% gt}}}\right)italic_F ( italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_F ( italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) + italic_J ( italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ), the loss can be approximated into:

regsubscriptreg\displaystyle\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT yJ+(F(ygt)+J(yygt))ygt2absentsuperscriptnormsuperscript𝑦superscript𝐽𝐹subscript𝑦gt𝐽superscript𝑦subscript𝑦gtsubscript𝑦gt2\displaystyle\approx\mathopen{}\mathclose{{}\left\|y^{\ast}-J^{+}\mathopen{}% \mathclose{{}\left(F(y_{\text{gt}})+J\mathopen{}\mathclose{{}\left(y^{\ast}-y_% {\text{gt}}}\right)}\right)-y_{\text{gt}}}\right\|^{2}≈ ∥ italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_J start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_F ( italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) + italic_J ( italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) ) - italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=J+F(ygt)2.absentsuperscriptnormsuperscript𝐽𝐹subscript𝑦gt2\displaystyle=\mathopen{}\mathclose{{}\left\|J^{+}F(y_{\text{gt}})}\right\|^{2}.= ∥ italic_J start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_F ( italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (12)

This indicates that the derivative regularization loss is analogous to the reprojection-based surrogate loss F(ygt)2superscriptnorm𝐹subscript𝑦gt2\mathopen{}\mathclose{{}\left\|F(y_{\text{gt}})}\right\|^{2}∥ italic_F ( italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (Section 3.1.2). Although the extra weighting matrix J+superscript𝐽J^{+}italic_J start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT makes the individual elements in the reprojection vector F(ygt)𝐹subscript𝑦gtF(y_{\text{gt}})italic_F ( italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) underdetermined, over multiple samples and mini-batches there remains a tendency of independently minimizing each of the elements, i.e., minimizing the reprojection error of each correspondence. Thus, it helps to overcome the potential training difficulties associated with the KL loss.

The regularization loss can also serve as an independent objective for training pose estimators, akin to RePOSE [7]. However, since we observe that this objective alone is not effective in addressing pose ambiguity, it is treated as a secondary regularization in this study.

4 Implementation Details

4.1 Dynamic KL Loss Weight

Following [28], we compute a dynamic loss weight for KLsubscriptKL\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT so that the magnitude of its gradients is consistent regardless of the entropy of the distribution. This is implemented by computing the exponential moving average (EMA) of the 1-norm of the sum of weights i=1Nwi2D1subscriptnormsuperscriptsubscript𝑖1𝑁subscriptsuperscript𝑤2D𝑖1\mathopen{}\mathclose{{}\left\|\sum_{i=1}^{N}{w^{\text{2D}}_{i}}}\right\|_{1}∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and using the reciprocal of the EMA value as the dynamic loss weight for KLsubscriptKL\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT. Intuitively, this cancels out the effect of the magnitude of wi2Dsubscriptsuperscript𝑤2D𝑖w^{\text{2D}}_{i}italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on the loss gradients w.r.t. xi2Dsubscriptsuperscript𝑥2D𝑖x^{\text{2D}}_{i}italic_x start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and xi3Dsubscriptsuperscript𝑥3D𝑖x^{\text{3D}}_{i}italic_x start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

4.2 Adaptive Huber Kernel

For the PnP formulation in Eq. (1), the plain L2 reprojection errors fi(y)2superscriptnormsubscript𝑓𝑖𝑦2\mathopen{}\mathclose{{}\left\|f_{i}(y)}\right\|^{2}∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are sensitive to outliers, which limits the model’s expressiveness in representing multi-modal distributions that characterizes ambiguity. Therefore, we robustify the reprojection errors using the Huber kernel ρ()𝜌\rho(\cdot)italic_ρ ( ⋅ ), yielding an alternative formulation:

argminy12i=1Nρ(fi(y)2).subscriptargmin𝑦12superscriptsubscript𝑖1𝑁𝜌superscriptnormsubscript𝑓𝑖𝑦2\operatorname*{arg\,min}_{y}\frac{1}{2}\sum_{i=1}^{N}\rho\mathopen{}\mathclose% {{}\left(\mathopen{}\mathclose{{}\left\|f_{i}(y)}\right\|^{2}}\right).start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ρ ( ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (13)

The Huber kernel with threshold δ𝛿\deltaitalic_δ is defined as:

ρ(s)={s,sδ2,δ(2sδ),s>δ2.𝜌𝑠cases𝑠𝑠superscript𝛿2𝛿2𝑠𝛿𝑠superscript𝛿2\rho(s)=\begin{dcases}s,&s\leq\delta^{2},\\ \delta(2\sqrt{s}-\delta),&s>\delta^{2}.\end{dcases}italic_ρ ( italic_s ) = { start_ROW start_CELL italic_s , end_CELL start_CELL italic_s ≤ italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_δ ( 2 square-root start_ARG italic_s end_ARG - italic_δ ) , end_CELL start_CELL italic_s > italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW (14)

To robustify the weighted reprojection errors of various scales, we adopt an adaptive threshold δ𝛿\deltaitalic_δ defined as a function of the weights wi2Dsubscriptsuperscript𝑤2D𝑖{w^{\text{2D}}_{i}}italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 2D coordinates xi2Dsubscriptsuperscript𝑥2D𝑖{x^{\text{2D}}_{i}}italic_x start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

δ=δrelw¯2D12(1N1i=1Nxi2Dx¯2D2)12𝛿subscript𝛿relsubscriptnormsuperscript¯𝑤2D12superscript1𝑁1superscriptsubscript𝑖1𝑁superscriptnormsubscriptsuperscript𝑥2D𝑖superscript¯𝑥2D212\delta=\delta_{\text{rel}}\frac{\mathopen{}\mathclose{{}\left\|\bar{w}^{\text{% 2D}}}\right\|_{1}}{2}\mathopen{}\mathclose{{}\left(\frac{1}{N-1}\sum_{i=1}^{N}% {\mathopen{}\mathclose{{}\left\|x^{\text{2D}}_{i}-\bar{x}^{\text{2D}}}\right\|% ^{2}}}\right)^{\negthickspace\frac{1}{2}}italic_δ = italic_δ start_POSTSUBSCRIPT rel end_POSTSUBSCRIPT divide start_ARG ∥ over¯ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ( divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT (15)

with the relative threshold δrelsubscript𝛿rel\delta_{\text{rel}}italic_δ start_POSTSUBSCRIPT rel end_POSTSUBSCRIPT as a hyperparameter, and the mean vectors w¯2D=1Ni=1Nwi2D,x¯2D=1Ni=1Nxi2Dformulae-sequencesuperscript¯𝑤2D1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑤𝑖2Dsuperscript¯𝑥2D1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑥𝑖2D\bar{w}^{\text{2D}}=\frac{1}{N}\sum_{i=1}^{N}{w_{i}^{\text{2D}}},\,\bar{x}^{% \text{2D}}=\frac{1}{N}\sum_{i=1}^{N}{x_{i}^{\text{2D}}}over¯ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT , over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT.

Accordingly, the reprojection errors F(y)𝐹𝑦F(y)italic_F ( italic_y ) and Jacobian matrix J𝐽Jitalic_J in Eq. (9) have to be rescaled (see supplementary).

4.3 Initialization

Since the LM solver only finds a local solution, initialization plays a determinant role in dealing with ambiguity. We implement a random sampling algorithm analogous to RANSAC, to search for the global optimum efficiently.

Given the N𝑁Nitalic_N-point correspondence set X={xi3D,xi2D,wi2D}i=1N𝑋superscriptsubscriptsubscriptsuperscript𝑥3D𝑖subscriptsuperscript𝑥2D𝑖subscriptsuperscript𝑤2D𝑖𝑖1𝑁X=\mathopen{}\mathclose{{}\left\{x^{\text{3D}}_{i},x^{\text{2D}}_{i},w^{\text{% 2D}}_{i}}\right\}_{i=1}^{N}italic_X = { italic_x start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, we generate M𝑀Mitalic_M subsets consisting of n𝑛nitalic_n corresponding points each (3n<N3𝑛𝑁3\leq n<N3 ≤ italic_n < italic_N), by repeatedly sub-sampling n𝑛nitalic_n indices without replacement from a multinomial distribution, whose probability mass function p(i)𝑝𝑖p(i)italic_p ( italic_i ) is defined by the corresponding weights:

p(i)=wi2D1i=1Nwi2D1.𝑝𝑖subscriptnormsubscriptsuperscript𝑤2D𝑖1superscriptsubscript𝑖1𝑁subscriptnormsubscriptsuperscript𝑤2D𝑖1p(i)=\frac{\mathopen{}\mathclose{{}\left\|w^{\text{2D}}_{i}}\right\|_{1}}{\sum% _{i=1}^{N}{\mathopen{}\mathclose{{}\left\|w^{\text{2D}}_{i}}\right\|_{1}}}.italic_p ( italic_i ) = divide start_ARG ∥ italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG . (16)

From each subset, a pose hypothesis can be solved via the LM algorithm within very few iterations (e.g. 3 iterations). This is implemented as a batch operation on GPU, and is rather efficient for small subsets. We take the hypothesis with maximum log-likelihood logp(X|y)𝑝conditional𝑋𝑦\log{p(X|y)}roman_log italic_p ( italic_X | italic_y ) as the initial point, starting from which subsequent LM iterations are computed on the full set X𝑋Xitalic_X.

4.3.1 Training Mode Initialization

During training, the LM PnP solver is utilized for estimating the location and concentration of the initial proposal distribution in the AMIS algorithm. The location is very important to the stability of Monte Carlo training. If the LM solver fails to find the global optimum, and the location of the local optimum is far from the true pose ygtsubscript𝑦gty_{\text{gt}}italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT, the balance between the two opposite signed terms in Eq. (5) may be broken, leading to exploding gradient in the worst case scenario. To avoid such problem, we adopt a simple initialization trick: we compare the log-likelihood logp(X|y)𝑝conditional𝑋𝑦\log{p(X|y)}roman_log italic_p ( italic_X | italic_y ) of the ground truth ygtsubscript𝑦gty_{\text{gt}}italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT and the selected hypothesis, and then keep the one with higher likelihood as the initial state of the LM solver.

5 6DoF Pose Estimation based on CDPN

To demonstrate that EPro-PnP can be applied to off-the-shelf 2D-3D correspondence networks, experiments have been conducted on CDPN [18], a dense correspondence network for 6DoF pose estimation.

5.1 Network Architecture

The original CDPN feeds cropped image regions within the detected 2D boxes into the pose estimation network, to which two decoupled heads are appended for rotation and translation respectively. The rotation head is PnP-based while the translation head uses explicit center and depth regression. This paper discards the translation head to focus entirely on PnP, and modifies only the last layer of the rotation head for strict comparison to the baseline.

As shown in Fig. 4, apart from the standard 3D coordinate map, the network predicts a 2-channel weight map (originally it is a single channel segmentation mask). We find it necessary to predict a global scale wS2Dsubscriptsuperscript𝑤2DSw^{\text{2D}}_{\text{S}}italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT S end_POSTSUBSCRIPT separately, and apply it to the normalized weights wN2Disubscriptsubscriptsuperscript𝑤2DN𝑖{w^{\text{2D}}_{\text{N}}}_{i}italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT N end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that satisfies i=1NwN2Di=[1,1]Tsuperscriptsubscript𝑖1𝑁subscriptsubscriptsuperscript𝑤2DN𝑖superscript11T\sum_{i=1}^{N}{{w^{\text{2D}}_{\text{N}}}_{i}}=[1,1]^{\text{T}}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT N end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ 1 , 1 ] start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT. Intuitively, the global scale controls the entropy of the pose distribution p(y|X)𝑝conditional𝑦𝑋p(y|X)italic_p ( italic_y | italic_X ) as it scales the entire log-likelihood, while the normalized weights determines the relative importance of each correspondence. This helps to overcome the entangling effect of the KL loss mentioned in Section 3.4. Inspired by the attention mechanism [24], the normalized weights are activated via spatial Softmax, focusing on important regions in the image. The global scale is usually inversely proportional to the 2D size of the object due to the uncertainty in reprojection, and is hard-coded as such in this network.

The original CDPN imposes masked coordinate regression loss[18] to learn the dense correspondences, using the ground truth object 3D models to render the target masks and 3D coordinate maps. With EPro-PnP, however, this extra geometry supervision is optional, as we demonstrate that the entire network can be trained solely by the KL loss KLsubscriptKL\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT and/or the derivative regularization loss regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT. To reduce the Monte Carlo overhead, 512 points are randomly sampled from the 64×64 dense points to compute KLsubscriptKL\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT.

Refer to caption
Figure 4: The 6DoF pose estimation network modified from CDPN [18], with spatial Softmax and global weight scaling.
TABLE I: Results of the CDPN baseline. A0 and A1 are reproduced with the official code (https://git.io/JXZv6).
ID Method ADD(-S) Mean
0.02d 0.05d 0.1d
A0 CDPN-Full [18] 29.10 69.50 91.03 63.21
A1 CDPN w/o trans. head 15.93 46.79 74.54 45.75
A2 A1 → Batch=32, LM solver 21.17 55.00 79.96 52.04

5.2 Dataset and Metrics

As in CDPN, we use the LineMOD [6] 6DoF pose estimation dataset to conduct our experiments. The dataset consists of 13 sequences, each containing about 1.2K images annotated with 6DoF poses of a single object. Following [36], the images are split into the training and testing sets, with about 200 images per object for training. For data augmentation, we use the same synthetic data as in CDPN [18].

We use two common metrics for evaluation: ADD(-S) and n°,ncm𝑛°𝑛cmn\text{\textdegree},n\,\text{cm}italic_n ° , italic_n cm. The ADD measures whether the average deviation of the transformed model points is less than a certain fraction of the object’s diameter (e.g., ADD-0.1d). For symmetric objects, ADD-S computes the average distance to the closest model point. n°,ncm𝑛°𝑛cmn\text{\textdegree},n\,\text{cm}italic_n ° , italic_n cm measures the accuracy of pose based on angular/positional error thresholds. All metrics are presented as percentages.

Despite that some objects in the dataset are nearly rotational symmetric, we observe that our model has no trouble identifying their exact orientations. Therefore, the presented results shall be closer to the scenario without pose ambiguity.

5.3 Baseline

For strict comparison, general settings are kept the same as in CDPN [18] (with ResNet-34 [40] as backbone). As shown in Table I, the original CDPN-Full (A0) trains the network in 3 stages totaling 480 epochs using RMSprop. With the translation head removed, we only train the rotation head in a single stage of 160 epochs (A1), which greatly impacts the pose accuracy (45.75 vs. 63.21). Additionally, we improve the baseline by using the LM solver with Huber kernel at test time, and increase the batch size to 32 for less training wall time (A2). Instead of using the advanced initialization technique in Section 4.3, we adopt the simple EPnP [41] initialization without RANSAC.

5.4 Main Results and Discussions

As shown in Table II, we conduct ablation studies to reveal the contributions of the Monte Carlo KL loss KLsubscriptKL\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT, the derivative regularization loss regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT, the original coordinate regression loss crdsubscriptcrd\mathcal{L}_{\text{crd}}caligraphic_L start_POSTSUBSCRIPT crd end_POSTSUBSCRIPT in CDPN [18], and initializing the model with pretrained weights from A1.

5.4.1 KL Loss vs. Coordinate Regression

Training the model from scratch with the KL loss alone (B0) significantly outperforms the baseline model (A2) trained with the coordinate regression loss (61.87 vs. 52.04), despite the lack of geometry supervision from the ground truth object 3D models.

5.4.2 KL Loss and Derivative Regularization

Both the KL loss (B0) and the derivative regularization loss (B1) performs well independently on this benchmark. Because pose ambiguity is not noticeable in LineMOD dataset, the solver-based derivative regularization loss performs better than the KL loss (63.15 vs. 61.87). Nevertheless, the best possible pose accuracy without knowing the object geometry can be achieved when combining both loss functions together (B2), even outperforming CDPN-Full (A0) by a clear margin (67.36 vs. 63.21).

5.4.3 With Knowledge of the Object 3D Models

On top of B2, one can further impose the coordinate regression loss crdsubscriptcrd\mathcal{L}_{\text{crd}}caligraphic_L start_POSTSUBSCRIPT crd end_POSTSUBSCRIPT (B4) with target 3D coordinates rendered from the object 3D models, further improving the pose accuracy. Yet a better approach to exploiting the 3D models is to pretrain the network in the traditional way (A1) and then finetune it with EPro-PnP (B5), yielding significantly better results (73.87). This training scheme partially benefits from more training epochs (2×160 in total). Furthermore, kee** the coordinate regression loss during finetuning (B6) slightly improves the score (73.95 vs. 73.87).

We also observe that both the derivative regularization loss (B2) and the coordinate regression loss (B3) improve the results of the bare KL loss setup (B0) to similar extends (67.36 vs. 67.74), as they are both disentangled objectives.

5.5 Comparison to Implicit Differentiation and Reprojection-Based Loss

As shown in Table III, when the coordinate regression loss is removed, i.e., object 3D models are unavailable, both implicit differentiation and reprojection loss fail to learn the pose properly. Yet EPro-PnP manages to learn the 3D coordinates and weights from scratch. This validates that EPro-PnP can be used as a general pose estimator without relying on geometric prior.

TABLE II: Results on EPro-PnP-enhanced CDPN. crdsubscriptcrd\mathcal{L}_{\text{crd}}caligraphic_L start_POSTSUBSCRIPT crd end_POSTSUBSCRIPT refers to the masked coordinate regression loss in the original [18], here the loss is imposed only on x3Dsuperscript𝑥3Dx^{\text{3D}}italic_x start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT, not w2Dsuperscript𝑤2Dw^{\text{2D}}italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT. Init. refers to initializing the model with pretrained weights from A1.
ID KLsubscriptKL\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT crdsubscriptcrd\mathcal{L}_{\text{crd}}caligraphic_L start_POSTSUBSCRIPT crd end_POSTSUBSCRIPT Init. ADD(-S) Mean
0.02d 0.05d 0.1d
B0 28.48 67.20 89.93 61.87
B1 25.86 70.90 92.68 63.15
B2 34.08 74.16 93.85 67.36
B3 34.40 75.00 93.83 67.74
B4 36.22 75.97 94.64 68.94
B5 43.34 82.13 96.14 73.87
B6 43.77 81.73 96.36 73.95
TABLE III: Comparison among loss functions by experiments conducted on the same dense correspondence network. For implicit differentiation, we minimize the distance metric of pose in Eq. (10) instead of the reprojection-metric pose loss in BPnP [12].
Main Loss crdsubscriptcrd\mathcal{L}_{\text{crd}}caligraphic_L start_POSTSUBSCRIPT crd end_POSTSUBSCRIPT 2 cm 2°, 2 cm ADD(-S) 0.1d
Implicit diff. [12] divergence
Reprojection [28] 00.32 42.30 00.16 14.56
KL div. (ours) 58.28 91.17 55.71 89.93
Implicit diff. [12] 56.13 91.13 53.33 88.74
Reprojection [28] 62.79 92.91 60.65 92.04
KL div. (ours) 69.95 94.97 68.38 93.83
TABLE IV: Comparison to the state-of-the-art geometric methods. BPnP [12] is not included as it adopts a different train/test split. *Although GDRNet [43] only reports the performance in its ablation section, it is still a fair comparison to our method, since both use the same baseline (CDPN).
Method Type ADD(-S)
0.02d 0.05d 0.1d
CDPN [18] PnP + Explicit depth - - 89.86
HybridPose [14] Hybrid constraints - - 91.3
GDRNet* [43] PnP + Explicit depth 35.6 76.0 93.6
DPOD [8] PnP + Explicit refiner - - 95.15
PVNet-RePOSE [7] PnP + Implicit refiner - - 96.1
PVNet-RNNPose [42] PnP + Implicit refiner 50.39 85.56 97.37
Ours PnP 43.77 81.73 96.36

5.6 Comparison to the State of the Art

As shown in Table IV, although we base EPro-PnP on the older baseline CDPN [18], the results are better than some of the more advanced methods, e.g., the pose refiner RePOSE [7] that adds extra overhead to the PnP-based initial estimator PVNet [13]. Among all these entries, EPro-PnP is the most straightforward as it simply solves the PnP problem itself, without refinement network [7, 8, 42], explicit depth prediction [18, 43], or multiple representations [14].

Moreover, removing the translation head (depth prediction) from the original CDPN-Full results in far fewer parameters in our model (from 113M to 27M) , and the overall inference speed is more than twice as fast as CDPN-Full (including dataloading, measured at a batch size of 32), even though we introduce the iterative LM solver. Furthermore, faster inference is possible if the number of points N=64×64𝑁6464N=64\times 64italic_N = 64 × 64 is reduced to an optimal level.

5.7 Visualizations

As illustrated in Figure 5, the weight maps predicted by the model trained with the KL loss (B0) tend to be more focused on important parts of the objects (e.g., the head and handle of the watering can), while those with the derivative regularization loss (B1) are more evenly spread out. Combining the two loss functions (B2) leads to more reasonable weighting, and more details in the object geometry (represented by x3Dsuperscript𝑥3Dx^{\text{3D}}italic_x start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT). With additional geometry pretraining and supervision (B6), the model outputs sharper correspondence maps, which contribute to higher pose accuracy and lower entropy of the probabilistic pose.

Refer to caption
Figure 5: Visualizations of the inferred orientation distributions, weight maps, and coordinate maps on LineMOD test set.

6 3D Object Detection based on Deformable Correspondence Network

To demonstrate that EPro-PnP can learn the entire set of 2D-3D correspondences {xi3D,xi2D,wi2D}i=1Nsuperscriptsubscriptsubscriptsuperscript𝑥3D𝑖subscriptsuperscript𝑥2D𝑖subscriptsuperscript𝑤2D𝑖𝑖1𝑁\mathopen{}\mathclose{{}\left\{x^{\text{3D}}_{i},x^{\text{2D}}_{i},w^{\text{2D% }}_{i}}\right\}_{i=1}^{N}{ italic_x start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT from scratch, and the possibility of designing novel correspondence networks capable of handling pose ambiguity, we propose a novel deformable correspondence network for 3D object detection. The network owes its name to the Deformable DETR [27], a work that inspired our model architecture.

6.1 Network Architecture

As shown in Figure 6, the deformable correspondence network is an extension of the FCOS3D [44] framework. The original FCOS3D is a one-stage detector that directly regresses the center offset, depth, and yaw orientation of multiple objects for 4DoF pose estimation. In our adaptation, the outputs of the multi-level FCOS head [45] are modified to generate object queries instead of directly predicting the pose. Inspired by Deformable DETR [27], the appearance and position of a query is disentangled into the object embedding vector and the reference point. Moreover, to better distinguish objects of different classes, we learn a set of class embedding vectors, one of which will be selected according to the object label to be aggregated into the object embedding vector via addition (not shown in Figure 6 for brevity).

With the object queries, a multi-head deformable attention layer [27] is adopted to sample the key-value pairs from interpolated dense feature map, with the value projected into point-wise features (point feat), and meanwhile aggregated into the object-level features (obj feat).

The point features are passed into a subnet that predicts the 3D points and corresponding weights (normalized by Softmax). Following MonoRUn [28], the 3D points are set in the normalized object coordinate (NOC) space to handle categorical objects of various sizes.

The object features are responsible for predicting the object-level properties: (a) the 3D score (i.e., 3D localization confidence), (b) the global weight scale, (c) the 3D box size for recovering the absolute scale of the 3D points, and (d) other optional properties (velocity, attribute) required by the nuScenes benchmark [4].

6.1.1 Implementation Details

We adopt the same detector architecture as in FCOS3D [44], with ResNet-101-DCN [46] as backbone. The deformable correspondence head predicts N=128𝑁128N=128italic_N = 128 pairs of 2D-3D points. The network is trained for 12 epochs by the AdamW [47] optimizer, with a batch size of 12 images across 2 GPUs on the nuScenes dataset [4].

Refer to caption
Figure 6: The deformable correspondence network based on the FCOS3D [44] detector. Note that the sampled point-wise features are shared by the point-level subnet and the deformable attention layer that aggregates the features for object-level predictions.

6.2 Loss Functions

6.2.1 Correspondence Loss

The deformable 2D-3D correspondences can be learned solely with the KL divergence loss KLsubscriptKL\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT, or in conjunction with the regularization loss regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT.

6.2.2 Auxiliary Correspondence Loss (Optional)

Inspired by the dense correspondence network MonoRUn [28], we regularize the dense features by appending a small auxiliary network that predicts the multi-head dense 3D coordinates and weights corresponding to densely-sampled 2D points within the ground truth (using RoI Align [48]). This allows us to employ the uncertainty-aware reprojection loss projsubscriptproj\mathcal{L}_{\text{proj}}caligraphic_L start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT [28] without directly regularizing the deformable correspondences. Furthermore, we can convert the LiDAR scan of objects into sparse 3D object coordinate maps, so that the classical coordinate regression loss crdsubscriptcrd\mathcal{L}_{\text{crd}}caligraphic_L start_POSTSUBSCRIPT crd end_POSTSUBSCRIPT can be imposed on the auxiliary branch as well. Both of the loss functions are implemented as the NLL of Gaussian mixtures to deal with ambiguity (see supplementary for details).

6.2.3 Other Loss Functions

Loss functions on the FCOS head include:

  • Basic detector loss, including focal loss [49] for classification and cross entropy loss for centerness.

  • A smooth L1 loss for regressing the 2D reference points, with the target defined as the center of the visible region of the objects.

  • A GIoU loss [50] for auxiliary 2D box regression, following the 2D auxiliary supervision in M2BEV [51].

Loss functions for the object-level predictions include:

  • A cross entropy loss for the 3D score.

  • A smooth L1 loss for regressing the 3D box size.

  • A smooth L1 loss for regressing the velocity and a cross entropy loss for attribute classification.

Additionally, inspired by DD3D [2], we further exploit the available LiDAR data to build an auxiliary depth supervision. By projecting the LiDAR points to the camera frame, we extract the point-wise features from the interpolated dense feature map, which are then fed into a small 2-layer MLP to predict the scene depth. Same as the auxiliary correspondence loss functions in Section 6.2.2, the depth loss is implemented as the NLL of Gaussian mixtures, which allows modeling discontinuities around sharp edges [52].

6.3 Dataset and Metrics

We evaluate the deformable correspondence network on the nuScenes 3D object detection benchmark [4], which provides a large scale of data collected in 1000 scenes. Each scene contains 40 keyframes, annotated with a total of 1.4M 3D bounding boxes from 10 categories. Each keyframe includes 6 RGB images collected from surrounding cameras. The data is split into 700/150/150 scenes for training/validation/testing. The official benchmark evaluates the average precision with true positives judged by 2D center error on the ground plane. The mAP metric is computed by averaging over the thresholds of 0.5, 1, 2, 4 meters. Besides, there are 5 true positive metrics: Average Translation Error (ATE), Average Scale Error (ASE), Average Orientation Error (AOE), Average Velocity Error (AVE) and Average Attribute Error (AAE). Finally, there is a nuScenes detection score (NDS) computed as a weighted average of the above metrics.

6.4 Main Results and Discussions

6.4.1 Comparison Among Correspondence Loss Functions

As shown in Table V, the model trained with KL loss alone (C0) is significantly stronger than the model trained with the derivative regularization loss alone (C1) in all the metrics of concern, especially the orientation error (0.332 vs. 0.607). This is due to the presence of orientation ambiguity in the nuScenes dataset. Even if all the auxiliary loss functions (C2) are applied, the derivative regularization loss still fail to reach comparable performance to the Monte Carlo KL loss. Adding up all the loss functions (C3), the results can be boosted even further.

TABLE V: Experiments on the nuScenes validation set.
ID KLsubscriptKL\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT Aux. loss NDS↑ mAP↑ mATE↓ mAOE↓
crdsubscriptcrd\mathcal{L}_{\text{crd}}caligraphic_L start_POSTSUBSCRIPT crd end_POSTSUBSCRIPT projsubscriptproj\mathcal{L}_{\text{proj}}caligraphic_L start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT
C0 0.447 0.380 0.656 0.332
C1 0.408 0.363 0.683 0.607
C2 0.429 0.363 0.691 0.397
C3 0.463 0.392 0.626 0.282
TABLE VI: Comparison to the state-of-the-art single-frame image-based 3D object detectors on the nuScenes test set. Methods with extra pretraining other than ImageNet backbone are not included for comparison. § indicates test-time flip augmentation (TTA). † indicates model ensemble.
Method Backbone NDS↑ mAP↑ mATE↓ mASE↓ mAOE↓ mAVE↓ mAAE↓
FCOS3D §† [44] R101 0.428 0.358 0.690 0.249 0.452 1.434 0.124
PGD § [1] R101 0.448 0.386 0.626 0.245 0.451 1.509 0.127
PETR [53] R101 0.455 0.391 0.647 0.251 0.433 0.933 0.143
BEVFormer [54] R101 0.462 0.409 0.650 0.261 0.439 0.925 0.147
PolarFormer [55] R101 0.470 0.415 0.657 0.263 0.405 0.911 0.139
PETR [53] Swin-B 0.483 0.445 0.627 0.249 0.449 0.927 0.141
Ours R101 0.481 0.409 0.559 0.239 0.325 1.090 0.115
Ours § R101 0.490 0.423 0.547 0.236 0.302 1.071 0.123

6.4.2 Comparison to the State of the Art

Results on the nuScenes test set [4] are shown in Table VI. At the time of submitting the manuscript (Jan 2023), EPro-PnP is the No. 1 single-frame monocular 3D object detector without extra data, according to the official nuScenes detection leaderboard. Among the models using ResNet-101 as backbones, EPro-PnP outperforms PolarFormer [55] by a clear margin (NDS 0.481 vs. 0.470), despite basing the deformable correspondence network on the older FCOS detector. With test-time flip augmentation (following FCOS3D [44]), our model even outperforms PGD [1] with the bulky Swin-B [56] backbone.

Since EPro-PnP is targeted at improving pose accuracy, it is not surprising to see that our model obtains exceptional results regarding the mATE and mAOE metrics, outperforming PolarFormer by a wide margin (mATE 0.559 vs. 0.657, mAOE 0.325 vs. 0.405).

It is worth noting that, EPro-PnP is currently the only method among the entries in Table VI that utilizes geometric pose reasoning, which is not a popular choice because previous non-end-to-end geometric methods usually fall behind when trained on large-scale real-life data.

6.5 Visualizations

An example of the monocular detection result is shown in Figure 7. We observe that the red 2D points (indicating greater x3Dsuperscript𝑥3Dx^{\text{3D}}italic_x start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT in the X axis) are usually spread right over the objects, which mainly determines the orientation, while the green 2D points (indicating greater x3Dsuperscript𝑥3Dx^{\text{3D}}italic_x start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT in the Y axis) are off the top and bottom of the objects, which determines the position (mainly the depth). It seems that the network learns to associate object depth to the height of the object’s projection, since the height invariant to 1D orientation in the ground plane.

Figure 8 shows that the flexibility of EPro-PnP allows predicting multimodal distributions with strong expressive power, successfully capturing the orientation ambiguity without discrete multi-bin classification [44, 57] or complicated mixture model [37].

6.6 Inference Time

The average inference time per frame (comprising a batch of 6 surrounding 1600×672 images, without TTA) is shown in Table VII, measured on RTX 3090 GPU and Core i9-10920X CPU. On average, the batch PnP solver takes 26 ms/46 ms processing 655.3 objects per frame, before non-maximum suppression (NMS).

TABLE VII: Inference time (sec) of the deformable correspondence network on nuScenes [4]. The PnP solver (including initialization) works faster (26 ms) with PyTorch v1.8.1, for which the code was originally developed, while the full model works faster (304 ms) with PyTorch v1.10.1.
   PyTorch Backbone & FPN Heads PnP Total

FCOS

Deform

   

v1.8.1+cu111

0.194

0.074

0.029

0.026

0.328

   

v1.10.1+cu113

0.173

0.056

0.025

0.046

0.304
Refer to caption
Figure 7: Inferred results on nuScenes validation set. On the top-left are the predicted 3D bounding boxes. On the bottom-left are the 2D points x2Dsuperscript𝑥2Dx^{\text{2D}}italic_x start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT colored by the XY corresponding weights w2Dsuperscript𝑤2Dw^{\text{2D}}italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT (red indicates wX2D>wY2Dsubscriptsuperscript𝑤2DXsubscriptsuperscript𝑤2DYw^{\text{2D}}_{\text{X}}>w^{\text{2D}}_{\text{Y}}italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT X end_POSTSUBSCRIPT > italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT Y end_POSTSUBSCRIPT, green indicates wX2D<wY2Dsubscriptsuperscript𝑤2DXsubscriptsuperscript𝑤2DYw^{\text{2D}}_{\text{X}}<w^{\text{2D}}_{\text{Y}}italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT X end_POSTSUBSCRIPT < italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT Y end_POSTSUBSCRIPT). On the right side are the inferred bounding boxes (red), marginal position density (blue), marginal orientation density (grey line), and ground truth bounding boxes (green) in bird’s eye view.
Refer to caption
Figure 8: Examples of ambiguous orientations in the nuScenes dataset and the predicted orientation distribution (conditioned on the optimal position). The ambiguity originates from either uncertain observations (e.g., distant pedestrian, night time motorcycle), or symmetric objects.

7 Limitations

Training the network with the Monte Carlo pose loss is inevitably slower than the baseline. With the batch size of 32 on a GTX 1080 Ti GPU, training the CDPN (without translation head) takes 143 seconds per epoch with the original coordinate regression loss, and 241 seconds per epoch with the Monte Carlo pose loss, which is about 70% longer time. However, the training time can be controlled by adjusting the number of Monte Carlo samples or the number of 2D-3D corresponding points.

Although the underlying principles are theoretically generalizable to other learning models with nested optimization layer, known as declarative networks [38], the Monte Carlo pose loss would become impractical with the growth of dimensionality.

While EPro-PnP seems to be a universal approach to end-to-end geometric pose estimation, it should be noted that the design of 2D-3D correspondence network still plays a major role in the model. For example, simply removing the 2D box size from Figure 4 would result in a notable decrease in pose accuracy. Future work may explore the feature-metric correspondence in [7, 42, 58] as a more expressive alternative to plain Euclidean reprojection error.

8 Conclusion

This paper proposes the EPro-PnP, which translates the non-differentiable PnP operation into a differentiable probabilistic layer, empowering end-to-end 2D-3D correspondence learning of unprecedented flexibility. The connections to previous work [28, 12, 11, 10, 7] have been thoroughly discussed with theoretical and experimental proofs, revealing the contributions of the Monte Carlo KL loss and the derivative regularization loss. For application, EPro-PnP can be simply integrated into existing PnP-based networks, or inspire novel solutions such as the deformable correspondence network.

Acknowledgments

This project was supported by the National Natural Science Foundation of China [No. 52002285], the Shanghai Science and Technology Commission [No. 21ZR1467400], the original research project of Tongji University [No. 22120220593], and the National Key R&D Program of China [No. 2021YFB2501104]. Part of the work was done when H. Chen was interning at Alibaba Group, supported by the Alibaba Research Intern Program.

References

  • [1] T. Wang, X. Zhu, J. Pang, and D. Lin, “Probabilistic and geometric depth: Detecting objects in perspective,” in Conference on Robot Learning (CoRL), 2021.
  • [2] D. Park, R. Ambrus, V. Guizilini, J. Li, and A. Gaidon, “Is pseudo-lidar needed for monocular 3d object detection?” in ICCV, 2021.
  • [3] Y. Wang, V. Guizilini, T. Zhang, Y. Wang, H. Zhao, and J. Solomon, “Detr3d: 3d object detection from multi-view images via 3d-to-2d queries,” in Conference on Robot Learning (CoRL), 2021.
  • [4] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in CVPR, 2020.
  • [5] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012.
  • [6] S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Konolige, N. Navab, and V. Lepetit, “Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes,” in ICCV, 2011.
  • [7] S. Iwase, X. Liu, R. Khirodkar, R. Yokota, and K. M. Kitani, “Repose: Fast 6d object pose refinement via deep texture rendering,” in ICCV, 2021.
  • [8] S. Zakharov, I. Shugurov, and S. Ilic, “Dpod: 6d pose object detector and refiner,” in ICCV, 2019.
  • [9] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “Dsac - differentiable ransac for camera localization,” in CVPR, 2017.
  • [10] E. Brachmann and C. Rother, “Learning less is more - 6d camera localization via 3d surface regression,” in CVPR, 2018.
  • [11] D. Campbell, L. Liu, and S. Gould, “Solving the blind perspective-n-point problem end-to-end with robust differentiable geometric optimization,” in ECCV, 2020.
  • [12] B. Chen, A. Parra, J. Cao, N. Li, and T.-J. Chin, “End-to-end learnable geometric vision by backpropagating pnp optimization,” in CVPR, 2020.
  • [13] S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “Pvnet: Pixel-wise voting network for 6dof pose estimation,” in CVPR, 2019.
  • [14] C. Song, J. Song, and Q. Huang, “Hybridpose: 6d object pose estimation under hybrid representations,” in CVPR, 2020.
  • [15] M. Rad and V. Lepetit, “BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth,” in ICCV, 2017.
  • [16] H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas, “Normalized object coordinate space for category-level 6d object pose and size estimation,” in CVPR, 2019.
  • [17] K. Park, T. Patten, and M. Vincze, “Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation,” in ICCV, 2019.
  • [18] Z. Li, G. Wang, and X. Ji, “Cdpn: Coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation,” in ICCV, 2019.
  • [19] P. Li, H. Zhao, P. Liu, and F. Cao, “Rtm3d: Real-time monocular 3d detection from object keypoints for autonomous driving,” in ECCV, 2020.
  • [20] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teulière, and T. Chateau, “Deep manta: A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image,” in CVPR, 2017.
  • [21] F. Manhardt, D. M. Arroyo, C. Rupprecht, B. Busam, N. Navab, and F. Tombari, “Explaining the ambiguity of object detection and 6d pose from visual data,” in ICCV, 2019.
  • [22] G. Schweighofer and A. Pinz, “Robust pose estimation from a planar target,” IEEE TPAMI, vol. 28, no. 12, pp. 2024–2030, 2006.
  • [23] J.-M. Cornuet, J.-M. Marin, A. Mira, and C. P. Robert, “Adaptive multiple importance sampling,” Scandinavian Journal of Statistics, vol. 39, no. 4, p. 798–812, 2012.
  • [24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.
  • [25] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, 2020.
  • [26] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in CVPR, 2018.
  • [27] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” in ICLR, 2021.
  • [28] H. Chen, Y. Huang, W. Tian, Z. Gao, and L. Xiong, “Monorun: Monocular 3d object detection by reconstruction and uncertainty propagation,” in CVPR, 2021.
  • [29] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” in NIPS, 2017.
  • [30] Y. He, C. Zhu, J. Wang, M. Savvides, and X. Zhang, “Bounding box regression with uncertainty for accurate object detection,” in CVPR, 2019.
  • [31] S. Wu, C. Rupprecht, and A. Vedaldi, “Unsupervised learning of probably symmetric deformable 3d objects from images in the wild,” in CVPR, 2020.
  • [32] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
  • [33] I. Gilitschenski, R. Sahoo, W. Schwarting, A. Amini, S. Karaman, and D. Rus, “Deep orientation uncertainty learning based on a bingham loss,” in ICLR, 2020.
  • [34] O. Makansi, E. Ilg, O. Cicek, and T. Brox, “Overcoming limitations of mixture density networks: A sampling and fitting framework for multimodal future prediction,” in CVPR, 2019.
  • [35] C. M. Bishop, “Mixture density networks,” 1994.
  • [36] E. Brachmann, F. Michel, A. Krull, M. Y. Yang, S. Gumhold, and c. Rother, “Uncertainty-driven 6d pose estimation of objects and scenes from a single rgb image,” in CVPR, 2016.
  • [37] M. Bui, T. Birdal, H. Deng, S. Albarqouni, L. Guibas, S. Ilic, and N. Navab, “6d camera relocalization in ambiguous scenes via continuous multimodal inference,” in ECCV, 2020.
  • [38] S. Gould, R. Hartley, and D. J. Campbell, “Deep declarative networks,” IEEE TPAMI, 2021.
  • [39] D. E. Tyler, “Statistical analysis for the angular central gaussian distribution on the sphere,” Biometrika, vol. 74, no. 3, pp. 579–589, 1987. [Online]. Available: http://www.jstor.org/stable/2336697
  • [40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [41] V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o(n) solution to the pnp problem,” International Journal Of Computer Vision, vol. 81, pp. 155–166, 2009.
  • [42] Y. Xu, K.-Y. Lin, G. Zhang, X. Wang, and H. Li, “Rnnpose: Recurrent 6-dof object pose refinement with robust correspondence field estimation and pose optimization,” in CVPR, 2022.
  • [43] G. Wang, F. Manhardt, F. Tombari, and X. Ji, “Gdr-net: Geometry-guided direct regression network for monocular 6d object pose estimation,” in CVPR, 2021.
  • [44] T. Wang, X. Zhu, J. Pang, and D. Lin, “FCOS3D: Fully convolutional one-stage monocular 3d object detection,” in ICCV Workshops, 2021.
  • [45] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in CVPR, 2019.
  • [46] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in CVPR, 2017.
  • [47] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in ICLR, 2019.
  • [48] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in ICCV, 2017.
  • [49] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” IEEE TPAMI, vol. 42, no. 2, pp. 318–327, 2020.
  • [50] S. H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. D. Reid, and S. Savarese, “Generalized intersection over union: A metric and A loss for bounding box regression,” in CVPR, 2019.
  • [51] E. Xie, Z. Yu, D. Zhou, J. Philion, A. Anandkumar, S. Fidler, P. Luo, and J. M. Alvarez, “M2bev: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation,” 2022.
  • [52] F. Tosi, Y. Liao, C. Schmitt, and A. Geiger, “Smd-nets: Stereo mixture density networks,” in CVPR, 2021.
  • [53] Y. Liu, T. Wang, X. Zhang, and J. Sun, “Petr: Position embedding transformation for multi-view 3d object detection,” in ECCV, 2022.
  • [54] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and J. Dai, “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” in ECCV, 2023.
  • [55] Y. Jiang, L. Zhang, Z. Miao, X. Zhu, J. Gao, W. Hu, and Y.-G. Jiang, “Polarformer: Multi-camera 3d object detection with polar transformers,” in AAAI, 2023.
  • [56] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
  • [57] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka, “3d bounding box estimation using deep learning and geometry,” in CVPR, 2017.
  • [58] Q. Lian, P. Li, and X. Chen, “Monojsg: Joint semantic and geometric cost volume for monocular 3d object detection,” in CVPR, 2022, pp. 1060–1069.
  • [59] S. Agarwal, K. Mierle, and Others, “Ceres solver,” http://ceres-solver.org.
  • [60] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle adjustment: A modern synthesis,” in International Workshop on Vision Algorithms: Theory and Practice, 2000.
  • [61] E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan, T. Karaletsos, R. Singh, P. Szerlip, P. Horsfall, and N. D. Goodman, “Pyro: Deep Universal Probabilistic Programming,” Journal of Machine Learning Research, 2018.
  • [62] I. S. Dhillon and S. Sra, “Modeling data using directional distributions,” 2003.
[Uncaptioned image] Hansheng Chen received the M.S.E. and B.E. degree in vehicle engineering from Tongji University, Shanghai, in 2020 and 2023, respectively. He is pursuing the PhD degree in the Computer Science Department of Stanford University. His current research interests lie in computer graphics and 3D vision, with a specific focus on 3D generation, reconstruction, editing, and neural rendering.
[Uncaptioned image] Wei Tian received the B.Sc degree in mechatronics engineering from Tongji University, Shanghai, China, in 2010, and received his the M.Sc. degree in electrical engineering and information technology at KIT, Karlsruhe, Germany, in May 2013. From October 2013, he was with the institute of measurement and control systems at KIT and received the Ph.D. degree in January 2019. From May 2020, he is a leader of comprehensive perception research group at School of Automotive Studies, Tongji University. He is currently working on research areas of robust object detection and trajectory prediction.
[Uncaptioned image] Pichao Wang received the Ph.D. degree in computer science from the University of Wollongong, Wollongong, NSW, Australia. He is currently a Senior Research Scientist in Amazon Prime Video, USA. He has authored 90+ peer reviewed papers, including those in highly regarded journals and conferences such as IJCV, IEEE TMM, CVPR, ICCV, ECCV, ICLR, AAAI, ACM MM, etc. He is the recipient of CVPR2022 Best Student Paper Award. He is named AI 2000 Most Influential Scholar during 2012-2022 by Miner, due to his contributions in the field of multimedia. He is also in the list of World’s Top 2% Scientists named by Stanford University. He serves as the Area Chair of ICME 2021,2022. He also serves as an Associate Editor of Journal of Computer Science and Technology (Tier 1, CCF B).
[Uncaptioned image] Fan Wang received the B.S. and M.S. degrees from the Department of Automation, Tsinghua University, Bei**g, China, and the Ph.D. degree from the Department of Electrical Engineering, Stanford University, Stanford, CA, USA. She is currently working as a Senior Staff Algorithm Engineer with Alibaba Group. Her research interests include object tracking and recognition, 3D vision, and multi-sensor fusion.
[Uncaptioned image] Lu Xiong received the B.E., M.E., and the Ph.D. degrees in vehicle engineering from the School of Automotive Studies, Tongji University, Shanghai, China, in 1999, 2002, and 2005, respectively. From November 2008 to 2009, he was a Postdoctoral Fellow with the Institute of Automobile Engineering and Vehicle Engines, University of Stuttgart, Stuttgart, Germany, with Dr. Jochen Wiedemann. He is currently a Professor with Tongji University and the Deputy Dean of School of Automotive Studies. His research interests include perception, decision and planning, dynamics control and state estimation, and testing and evaluation of autonomous vehicles.
[Uncaptioned image] Hao Li received the Ph.D. degree from the Chinese Academy of Sciences. He is in charge of real-scene visual understanding technologies. He has published more than 20 papers and owns more than 20 licensed patents. His research interests include smart interpretation of remote sensing images, facial recognition-based clocking in systems, new retail, smart campuses, deep learning model compression, facial recognition, person re-identification, and image search.

Appendix A Levenberg-Marquardt PnP Solver

For parallel processing on GPU, we have implemented a PyTorch-based batch Levenberg-Marquardt (LM) PnP solver. The implementation generally follows the Ceres solver [59]. Here, we discuss some important details that are related to the proposed Monte Carlo pose sampling and derivative regularization.

A.1 LM Step with Huber Kernel

Adding the Huber kernel influences every related aspect from the likelihood function to the LM iteration step and derivative regularization loss. Thanks to PyTorch’s automatic differentiation, the robustified Monte Carlo KL divergence loss does not require much special handling. For the LM solver, however, the residual F(y)𝐹𝑦F(y)italic_F ( italic_y ) (concatenated weighted reprojection errors) and the Jacobian matrix J𝐽Jitalic_J have to be rescaled before computing the robustified LM step [60].

The rescaled residual block f~i(y)subscript~𝑓𝑖𝑦\tilde{f}_{i}(y)over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) and Jacobian block J~i(y)subscript~𝐽𝑖𝑦\tilde{J}_{i}(y)over~ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) of the i𝑖iitalic_i-th point pair are defined as:

f~i(y)=ρifi(y),subscript~𝑓𝑖𝑦subscriptsuperscript𝜌𝑖subscript𝑓𝑖𝑦\tilde{f}_{i}(y)=\sqrt{\rho^{\prime}_{i}}f_{i}(y),over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) = square-root start_ARG italic_ρ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) , (17)
J~i(y)=ρiJi(y),subscript~𝐽𝑖𝑦subscriptsuperscript𝜌𝑖subscript𝐽𝑖𝑦\tilde{J}_{i}(y)=\sqrt{\rho^{\prime}_{i}}J_{i}(y),over~ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) = square-root start_ARG italic_ρ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) , (18)

where

ρi={1,fi(y)δ,δfi(y),fi(y)>δ,subscriptsuperscript𝜌𝑖cases1normsubscript𝑓𝑖𝑦𝛿𝛿normsubscript𝑓𝑖𝑦normsubscript𝑓𝑖𝑦𝛿\rho^{\prime}_{i}=\begin{dcases}1,&\|f_{i}(y)\|\leq\delta,\\ \frac{\delta}{\|f_{i}(y)\|},&\|f_{i}(y)\|>\delta,\end{dcases}italic_ρ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL 1 , end_CELL start_CELL ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) ∥ ≤ italic_δ , end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_δ end_ARG start_ARG ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) ∥ end_ARG , end_CELL start_CELL ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) ∥ > italic_δ , end_CELL end_ROW (19)
Ji(y)=fi(y)yT.subscript𝐽𝑖𝑦subscript𝑓𝑖𝑦superscript𝑦TJ_{i}(y)=\frac{\mathop{}\!\mathrm{\partial}{f_{i}(y)}}{\mathop{}\!\mathrm{% \partial}{y^{\text{T}}}}.italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) = divide start_ARG ∂ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT end_ARG . (20)

Following the implementation of Ceres solver [59], the robustified LM iteration step is:

Δy=(J~TJ~+λD2)1J~TF~,Δ𝑦superscriptsuperscript~𝐽T~𝐽𝜆superscript𝐷21superscript~𝐽T~𝐹\Delta y=-\mathopen{}\mathclose{{}\left(\tilde{J}^{\text{T}}\tilde{J}+\lambda D% ^{2}}\right)^{-1}\tilde{J}^{\text{T}}\tilde{F},roman_Δ italic_y = - ( over~ start_ARG italic_J end_ARG start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT over~ start_ARG italic_J end_ARG + italic_λ italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_J end_ARG start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT over~ start_ARG italic_F end_ARG , (21)

where

J~=[J~1(y)J~N(y)],F~=[f~1(y)f~N(y)],formulae-sequence~𝐽matrixsubscript~𝐽1𝑦subscript~𝐽𝑁𝑦~𝐹matrixsubscript~𝑓1𝑦subscript~𝑓𝑁𝑦\tilde{J}=\begin{bmatrix}\tilde{J}_{1}(y)\\ \vdots\\ \tilde{J}_{N}(y)\end{bmatrix},\tilde{F}=\begin{bmatrix}\tilde{f}_{1}(y)\\ \vdots\\ \tilde{f}_{N}(y)\end{bmatrix},over~ start_ARG italic_J end_ARG = [ start_ARG start_ROW start_CELL over~ start_ARG italic_J end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_y ) end_CELL end_ROW end_ARG ] , over~ start_ARG italic_F end_ARG = [ start_ARG start_ROW start_CELL over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_y ) end_CELL end_ROW end_ARG ] , (22)

D𝐷Ditalic_D is the square root of the diagonal of the matrix J~TJ~superscript~𝐽T~𝐽\tilde{J}^{\text{T}}\tilde{J}over~ start_ARG italic_J end_ARG start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT over~ start_ARG italic_J end_ARG, and λ𝜆\lambdaitalic_λ is the reciprocal of the LM trust region radius [59].

Note that the rescaled residual and Jacobian affects the derivative regularization, as well as the covariance estimation in the next subsection.

A.1.1 Fast Inference Mode

We empirically observe that in a well-trained model, the LM trust region radius can be initialized with a very large value, effectively rendering the LM algorithm redundant. We therefore use the simple Gauss-Newton implementation for fast inference:

Δy=(J~TJ~+εI)1J~TF~,Δ𝑦superscriptsuperscript~𝐽T~𝐽𝜀𝐼1superscript~𝐽T~𝐹\Delta y=-\mathopen{}\mathclose{{}\left(\tilde{J}^{\text{T}}\tilde{J}+% \varepsilon I}\right)^{-1}\tilde{J}^{\text{T}}\tilde{F},roman_Δ italic_y = - ( over~ start_ARG italic_J end_ARG start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT over~ start_ARG italic_J end_ARG + italic_ε italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_J end_ARG start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT over~ start_ARG italic_F end_ARG , (23)

where ε𝜀\varepsilonitalic_ε is a small value for numerical stability.

A.2 Covariance Estimation

During training, the concentration of the AMIS proposal is determined by the local estimation of pose covariance matrix ΣysubscriptΣsuperscript𝑦\Sigma_{y^{\ast}}roman_Σ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, defined as:

Σy=(J~TJ~+εI)1|y=y,\Sigma_{y^{\ast}}=\mathopen{}\mathclose{{}\left(\tilde{J}^{\text{T}}\tilde{J}+% \varepsilon I}\right)^{-1}\Big{\rvert}_{y=y^{\ast}},roman_Σ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ( over~ start_ARG italic_J end_ARG start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT over~ start_ARG italic_J end_ARG + italic_ε italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT italic_y = italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , (24)

where ysuperscript𝑦y^{\ast}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the LM solution that determines the location of the proposal distribution.

Appendix B Details on Monte Carlo Pose Sampling

B.1 Proposal Distribution for Position

For the proposal distribution of the translation vector t3𝑡superscript3t\in\mathbb{R}^{3}italic_t ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, we adopt the multivariate t-distribution, with the following probability density function (PDF):

qT(t)=Γ(ν+32)Γ(ν2)ν3π3|Σ|(1+1νtμΣ2)ν+32,subscript𝑞T𝑡Γ𝜈32Γ𝜈2superscript𝜈3superscript𝜋3Σsuperscript11𝜈superscriptsubscriptnorm𝑡𝜇Σ2𝜈32q_{\text{T}}(t)=\frac{\Gamma\mathopen{}\mathclose{{}\left(\frac{\nu+3}{2}}% \right)}{\Gamma\mathopen{}\mathclose{{}\left(\frac{\nu}{2}}\right)\sqrt{\nu^{3% }\pi^{3}|\Sigma|}}\mathopen{}\mathclose{{}\left(1+\frac{1}{\nu}\|t-\mu\|_{% \Sigma}^{2}}\right)^{\negmedspace-\frac{\nu+3}{2}},italic_q start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( italic_t ) = divide start_ARG roman_Γ ( divide start_ARG italic_ν + 3 end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG roman_Γ ( divide start_ARG italic_ν end_ARG start_ARG 2 end_ARG ) square-root start_ARG italic_ν start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT | roman_Σ | end_ARG end_ARG ( 1 + divide start_ARG 1 end_ARG start_ARG italic_ν end_ARG ∥ italic_t - italic_μ ∥ start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - divide start_ARG italic_ν + 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , (25)

where tμΣ2=(tμ)TΣ1(tμ)superscriptsubscriptnorm𝑡𝜇Σ2superscript𝑡𝜇TsuperscriptΣ1𝑡𝜇\|t-\mu\|_{\Sigma}^{2}=(t-\mu)^{\text{T}}\Sigma^{-1}(t-\mu)∥ italic_t - italic_μ ∥ start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( italic_t - italic_μ ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_t - italic_μ ), with the location μ𝜇\muitalic_μ, the 3×3 positive definite scale matrix ΣΣ\Sigmaroman_Σ, and the degrees of freedom ν𝜈\nuitalic_ν. Following [23], we set ν𝜈\nuitalic_ν to 3. Compared to the multivariate normal distribution, the t-distribution has a heavier tail, which is ideal for robust sampling.

The multivariate t-distribution has been implemented in the Pyro [61] package.

B.1.1 Initial Parameters

The initial location and scale is determined by the PnP solution and covariance matrix, i.e., μt,ΣΣtformulae-sequence𝜇superscript𝑡ΣsubscriptΣsuperscript𝑡\mu\leftarrow t^{\ast},\Sigma\leftarrow\Sigma_{t^{\ast}}italic_μ ← italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , roman_Σ ← roman_Σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, where ΣtsubscriptΣsuperscript𝑡\Sigma_{t^{\ast}}roman_Σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is the 3×3 submatrix of the full pose covariance ΣpsubscriptΣsuperscript𝑝\Sigma_{p^{\ast}}roman_Σ start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Note that the actual covariance of the t-distribution is thus νν1Σt𝜈𝜈1subscriptΣsuperscript𝑡\frac{\nu}{\nu-1}\Sigma_{t^{\ast}}divide start_ARG italic_ν end_ARG start_ARG italic_ν - 1 end_ARG roman_Σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, which is intentionally scaled up for robust sampling in a wider range.

B.1.2 Parameter Estimation from Weighted Samples

To update the proposal, we let the location μ𝜇\muitalic_μ and scale ΣΣ\Sigmaroman_Σ be the first and second moment of the weighted samples (i.e., weighted mean and covariance), respectively.

B.2 Proposal Distribution for 1D Orientation

For the proposal distribution of the 1D yaw-only orientation θ𝜃\thetaitalic_θ, we adopt a mixture of von Mises and uniform distribution. The von Mises is also known as the circular normal distribution, and its PDF is given by:

qVM(θ)=exp(κcos(θμ))2πI0(κ),subscript𝑞VM𝜃𝜅𝜃𝜇2𝜋subscript𝐼0𝜅q_{\text{VM}}(\theta)=\frac{\exp{(\kappa\cos{(\theta-\mu)})}}{2\pi I_{0}(% \kappa)},italic_q start_POSTSUBSCRIPT VM end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG roman_exp ( italic_κ roman_cos ( italic_θ - italic_μ ) ) end_ARG start_ARG 2 italic_π italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_κ ) end_ARG , (26)

where μ𝜇\muitalic_μ is the location parameter, κ𝜅\kappaitalic_κ is the concentration parameter, and I0()subscript𝐼0I_{0}(\cdot)italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ ) is the modified Bessel function with order zero. The mixture PDF is thus:

qmix(θ)=(1α)qVM(θ)+αquniform(θ),subscript𝑞mix𝜃1𝛼subscript𝑞VM𝜃𝛼subscript𝑞uniform𝜃q_{\text{mix}}(\theta)=(1-\alpha)q_{\text{VM}}(\theta)+\alpha q_{\text{uniform% }}(\theta),italic_q start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT ( italic_θ ) = ( 1 - italic_α ) italic_q start_POSTSUBSCRIPT VM end_POSTSUBSCRIPT ( italic_θ ) + italic_α italic_q start_POSTSUBSCRIPT uniform end_POSTSUBSCRIPT ( italic_θ ) , (27)

with the uniform mixture weight α𝛼\alphaitalic_α. The uniform component is added in order to capture other potential modes under orientation ambiguity. We set α𝛼\alphaitalic_α to a fixed value of 1/4141/41 / 4.

PyTorch has already implemented the von Mises distribution, but its random sample generation is rather slow. As an alternative we use the NumPy implementation for random sampling.

B.2.1 Initial Parameters

With the yaw angle θsuperscript𝜃\theta^{\ast}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and its variance σθ2subscriptsuperscript𝜎2superscript𝜃\sigma^{2}_{\theta^{\ast}}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT from the PnP solver, the parameters of the von Mises proposal is initialized by μθ,κ13σθ2formulae-sequence𝜇superscript𝜃𝜅13subscriptsuperscript𝜎2superscript𝜃\mu\leftarrow\theta^{\ast},\kappa\leftarrow\frac{1}{3\sigma^{2}_{\theta^{\ast}}}italic_μ ← italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_κ ← divide start_ARG 1 end_ARG start_ARG 3 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG.

B.2.2 Parameter Estimation from Weighted Samples

For the location μ𝜇\muitalic_μ, we simply adopt its maximum likelihood estimation, i.e., the circular mean of the weighted samples. For the concentration κ𝜅\kappaitalic_κ, we first compute an approximated estimation [62] by:

κ^=r¯(2r¯2)1r¯2,^𝜅¯𝑟2superscript¯𝑟21superscript¯𝑟2\hat{\kappa}=\frac{\bar{r}(2-\bar{r}^{2})}{1-\bar{r}^{2}},over^ start_ARG italic_κ end_ARG = divide start_ARG over¯ start_ARG italic_r end_ARG ( 2 - over¯ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 1 - over¯ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (28)

where r¯=jvj[sinθj,cosθj]T/jvj¯𝑟delimited-∥∥subscript𝑗subscript𝑣𝑗superscriptsubscript𝜃𝑗subscript𝜃𝑗Tsubscript𝑗subscript𝑣𝑗\bar{r}=\mathopen{}\mathclose{{}\left\lVert\sum_{j}v_{j}[\sin{\theta_{j}},\cos% {\theta_{j}}]^{\text{T}}/\sum_{j}v_{j}}\right\rVertover¯ start_ARG italic_r end_ARG = ∥ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ roman_sin italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , roman_cos italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT / ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ is the norm of the mean orientation vector, with the importance weight vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for the j𝑗jitalic_j-th sample θjsubscript𝜃𝑗\theta_{j}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Finally, the concentration is scaled down for robust sampling, such that κκ^/3𝜅^𝜅3\kappa\leftarrow\hat{\kappa}/3italic_κ ← over^ start_ARG italic_κ end_ARG / 3.

B.3 Proposal Distribution for 3D Orientation

Regarding the quaternion-based parameterization of 3D orientation, which can be represented by a unit 4D vector l𝑙litalic_l, we adopt the angular central Gaussian (ACG) distribution as the proposal. The support of the 4-dimensional ACG distribution is the unit hypersphere, and the PDF is given by:

qACG(l)=(lTΛ1l)2S4|Λ|12,subscript𝑞ACG𝑙superscriptsuperscript𝑙TsuperscriptΛ1𝑙2subscript𝑆4superscriptΛ12q_{\text{ACG}}(l)=\frac{(l^{\text{T}}\Lambda^{-1}l)^{-2}}{S_{4}|\Lambda|^{% \frac{1}{2}}},italic_q start_POSTSUBSCRIPT ACG end_POSTSUBSCRIPT ( italic_l ) = divide start_ARG ( italic_l start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_l ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT | roman_Λ | start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG , (29)

where S4=2π2subscript𝑆42superscript𝜋2S_{4}=2\pi^{2}italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 2 italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the 3D surface area of the 4D sphere, and ΛΛ\Lambdaroman_Λ is a 4×4 positive definite matrix.

The ACG density can be derived by integrating the zero-mean multivariate normal distribution 𝒩(0,Λ)𝒩0Λ\mathcal{N}(0,\Lambda)caligraphic_N ( 0 , roman_Λ ) along the radial direction from 00 to infinfimum\infroman_inf. Therefore, drawing samples from the ACG distribution is equivalent to sampling from 𝒩(0,Λ)𝒩0Λ\mathcal{N}(0,\Lambda)caligraphic_N ( 0 , roman_Λ ) and then normalizing the samples to unit radius.

B.3.1 Initial Parameters

Consider lsuperscript𝑙l^{\ast}italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to be the PnP solution and Σl1superscriptsubscriptΣsuperscript𝑙1\Sigma_{l^{\ast}}^{-1}roman_Σ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT to be the estimated 4×4 inverse covariance matrix. Note that Σl1superscriptsubscriptΣsuperscript𝑙1\Sigma_{l^{\ast}}^{-1}roman_Σ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is only valid in the local tangent space with rank 3, satisfying lTΣl1l=0superscriptsuperscript𝑙TsuperscriptsubscriptΣsuperscript𝑙1superscript𝑙0{l^{\ast}}^{\text{T}}\Sigma_{l^{\ast}}^{-1}l^{\ast}=0italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0. The initial parameters are determined by:

ΛΛ^+α|Λ^|14I,Λ^Λ𝛼superscript^Λ14𝐼\Lambda\leftarrow\hat{\Lambda}+\alpha|\hat{\Lambda}|^{\frac{1}{4}}I,roman_Λ ← over^ start_ARG roman_Λ end_ARG + italic_α | over^ start_ARG roman_Λ end_ARG | start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT italic_I , (30)

where Λ^=(Σl1+I)1^ΛsuperscriptsuperscriptsubscriptΣsuperscript𝑙1𝐼1\hat{\Lambda}=\mathopen{}\mathclose{{}\left(\Sigma_{l^{\ast}}^{-1}+I}\right)^{% -1}over^ start_ARG roman_Λ end_ARG = ( roman_Σ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, and α𝛼\alphaitalic_α is a hyperparameter that controls the dispersion of the proposal for robust sampling. We set α𝛼\alphaitalic_α to 0.001 in the experiments.

B.3.2 Parameter Estimation from Weighted Samples

Based on the samples ljsubscript𝑙𝑗l_{j}italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and weights vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the maximum likelihood estimation Λ^^Λ\hat{\Lambda}over^ start_ARG roman_Λ end_ARG is the solution to the following equation:

Λ^=4jvjjvjljljTljTΛ^1lj.^Λ4subscript𝑗subscript𝑣𝑗subscript𝑗subscript𝑣𝑗subscript𝑙𝑗superscriptsubscript𝑙𝑗Tsuperscriptsubscript𝑙𝑗Tsuperscript^Λ1subscript𝑙𝑗\hat{\Lambda}=\frac{4}{\sum_{j}v_{j}}\sum_{j}\frac{v_{j}l_{j}l_{j}^{\text{T}}}% {l_{j}^{\text{T}}\hat{\Lambda}^{-1}l_{j}}.over^ start_ARG roman_Λ end_ARG = divide start_ARG 4 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT end_ARG start_ARG italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT over^ start_ARG roman_Λ end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG . (31)

The solution to Eq. (31) can be computed by fixed-point iteration [39]. The final parameters of the updated proposal is determined the same way as in Eq. (30).

Appendix C Details on Derivative Regularization Loss

As stated in the main paper, the derivative regularization loss regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT consists of the position loss possubscriptpos\mathcal{L}_{\text{pos}}caligraphic_L start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT and the orientation loss orientsubscriptorient\mathcal{L}_{\text{orient}}caligraphic_L start_POSTSUBSCRIPT orient end_POSTSUBSCRIPT.

For possubscriptpos\mathcal{L}_{\text{pos}}caligraphic_L start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT, we adopt the smooth L1 loss based on the Euclidean distance dt=t+Δttgtsubscript𝑑𝑡normsuperscript𝑡Δ𝑡subscript𝑡gtd_{t}=\|t^{\ast}+\Delta t-t_{\text{gt}}\|italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∥ italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + roman_Δ italic_t - italic_t start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ∥, given by:

pos={dt22β,dtβ,dt0.5β,dt>β,subscriptposcasessuperscriptsubscript𝑑𝑡22𝛽subscript𝑑𝑡𝛽subscript𝑑𝑡0.5𝛽subscript𝑑𝑡𝛽\mathcal{L}_{\text{pos}}=\begin{dcases}\frac{d_{t}^{2}}{2\beta},&d_{t}\leq% \beta,\\ d_{t}-0.5\beta,&d_{t}>\beta,\end{dcases}caligraphic_L start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT = { start_ROW start_CELL divide start_ARG italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_β end_ARG , end_CELL start_CELL italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_β , end_CELL end_ROW start_ROW start_CELL italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - 0.5 italic_β , end_CELL start_CELL italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > italic_β , end_CELL end_ROW (32)

with the hyperparameter β𝛽\betaitalic_β.

For orientsubscriptorient\mathcal{L}_{\text{orient}}caligraphic_L start_POSTSUBSCRIPT orient end_POSTSUBSCRIPT, we adopt the cosine similarity loss based on the angular distance dθsubscript𝑑𝜃d_{\theta}italic_d start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. For 1D orientation parameterized by the angle θ𝜃\thetaitalic_θ, dθ=θ+Δθθgtsubscript𝑑𝜃superscript𝜃Δ𝜃subscript𝜃gtd_{\theta}=\theta^{\ast}+\Delta\theta-\theta_{\text{gt}}italic_d start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + roman_Δ italic_θ - italic_θ start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT. For 3D orientation parameterized by the quaternion vector l𝑙litalic_l, dθ=2arccos(l+Δl)Tlgtd_{\theta}=2\arccos{(l^{\ast}+\Delta l)^{\text{T}}l_{\text{gt}}}italic_d start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = 2 roman_arccos ( italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + roman_Δ italic_l ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT. The loss function is therefore defined as:

orient=1cosdθ.subscriptorient1subscript𝑑𝜃\mathcal{L}_{\text{orient}}=1-\cos{d_{\theta}}.caligraphic_L start_POSTSUBSCRIPT orient end_POSTSUBSCRIPT = 1 - roman_cos italic_d start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT . (33)

For 3D orientation, after the substitution, the loss function can be simplified to:

orient=22((l+Δl)Tlgt)2.subscriptorient22superscriptsuperscriptsuperscript𝑙Δ𝑙Tsubscript𝑙gt2\mathcal{L}_{\text{orient}}=2-2\mathopen{}\mathclose{{}\left((l^{\ast}+\Delta l% )^{\text{T}}l_{\text{gt}}}\right)^{2}.caligraphic_L start_POSTSUBSCRIPT orient end_POSTSUBSCRIPT = 2 - 2 ( ( italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + roman_Δ italic_l ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (34)

For the specific settings of the hyperparameter β𝛽\betaitalic_β and loss weights, please refer to the experiment configuration code.

Appendix D Details on the Deformable Correspondence Network

D.1 Network Architecture

The detailed network architecture of the deformable correspondence network is shown in Figure 9. Following deformable DETR [27], this paper adopts the multi-head deformable sampling. Let nheadsubscript𝑛headn_{\text{head}}italic_n start_POSTSUBSCRIPT head end_POSTSUBSCRIPT be the number of heads and nhptssubscript𝑛hptsn_{\text{hpts}}italic_n start_POSTSUBSCRIPT hpts end_POSTSUBSCRIPT be the number of points per head, a total number of N=nheadnhpts𝑁subscript𝑛headsubscript𝑛hptsN=n_{\text{head}}n_{\text{hpts}}italic_N = italic_n start_POSTSUBSCRIPT head end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT hpts end_POSTSUBSCRIPT points are sampled for each object. The sampling locations relative to the reference point are predicted from the object embedding by a single layer of linear transformation. We set nheadsubscript𝑛headn_{\text{head}}italic_n start_POSTSUBSCRIPT head end_POSTSUBSCRIPT to 8, which yields 256/nhead=32256subscript𝑛head32256/n_{\text{head}}=32256 / italic_n start_POSTSUBSCRIPT head end_POSTSUBSCRIPT = 32 channels for the point features.

The point-level branch on the left side of Figure 9 is responsible for predicting the 3D points xi3Dsubscriptsuperscript𝑥3D𝑖x^{\text{3D}}_{i}italic_x start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and corresponding weights wi2Dsubscriptsuperscript𝑤2D𝑖w^{\text{2D}}_{i}italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The sampled point features are first enhanced by the object-level context, by adding the reshaped head-wise object embedding to the point features. Then, the features of the N𝑁Nitalic_N points are processed by the self-attention layer, for which the 2D points are transformed into positional encoding. The attention layer is followed by standard layers of normalization, skip connection, and feedforward network (FFN).

Regarding the object-level branch on the right side of Figure 9, a multi-head attention layer is employed to aggregate the sampled point features. Unlike the original deformable attention layer [27] that predicts the attention weights by linear projection of the object embedding, we adopt the full Q-K dot-product attention with positional encoding. After being processed by the subsequent layers, the object-level features are finally transformed into to the object-level predictions, consisting of the 3D localization score, weight scale, 3D bounding box size, and other optional properties (velocity and attribute). Note that the attention layer is actually not a necessary component for object-level predictions, but rather a byproduct of the deformable point samples whose features can be leveraged with little computation overhead.

D.2 Loss Functions for Object-Level Predictions

As in FCOS3D [44], we adopt smooth L1 regression loss for 3D box size and velocity, and cross-entropy classification loss for attribute. Additionally, a binary cross-entropy loss is imposed upon the 3D localization score, with the target ctgtsubscript𝑐tgtc_{\text{tgt}}italic_c start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT defined as a score function of the position error:

ctgtsubscript𝑐tgt\displaystyle c_{\text{tgt}}italic_c start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT =𝑆𝑐𝑜𝑟𝑒(tXZtXZgt)absent𝑆𝑐𝑜𝑟𝑒normsubscriptsuperscript𝑡XZsubscriptsubscript𝑡XZgt\displaystyle=\mathit{Score}(\|t^{\ast}_{\text{XZ}}-{t_{\text{XZ}}}_{\text{gt}% }\|)= italic_Score ( ∥ italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT XZ end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT XZ end_POSTSUBSCRIPT start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ∥ )
=max(0,min(1,alogtXZtXZgt+b)),absent01𝑎normsubscriptsuperscript𝑡XZsubscriptsubscript𝑡XZgt𝑏\displaystyle=\max(0,\min(1,-a\log{\|t^{\ast}_{\text{XZ}}-{t_{\text{XZ}}}_{% \text{gt}}\|}+b)),= roman_max ( 0 , roman_min ( 1 , - italic_a roman_log ∥ italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT XZ end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT XZ end_POSTSUBSCRIPT start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ∥ + italic_b ) ) , (35)

where tXZsubscriptsuperscript𝑡XZt^{\ast}_{\text{XZ}}italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT XZ end_POSTSUBSCRIPT is the XZ components of the PnP solution, tXZgtsubscriptsubscript𝑡XZgt{t_{\text{XZ}}}_{\text{gt}}italic_t start_POSTSUBSCRIPT XZ end_POSTSUBSCRIPT start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT is the XZ components of the true pose, and a,b𝑎𝑏a,bitalic_a , italic_b are the linear coefficients. The predicted 3D localization score cpredsubscript𝑐predc_{\text{pred}}italic_c start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT shall reflect the positional uncertainty of an object, as a faster alternative to evaluating the uncertainty via the Monte Carlo method during inference (Section D.5). The final detection score is defined as the product of the predicted 3D score and the classification score from the base detector.

D.3 Auxiliary Loss Functions

D.3.1 Auxiliary Correspondence Loss

To regularize the dense features, we append an auxiliary branch that predicts the multi-head dense 3D coordinates and corresponding weights, as shown in Figure 10. Leveraging the ground truth of object 2D boxes, the features within the box regions are densely sampled via RoI Align [48], and transformed into the 3D coordinates x3Dsuperscript𝑥3Dx^{\text{3D}}italic_x start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT and weights w2Dsuperscript𝑤2Dw^{\text{2D}}italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT via an independent linear layer. Besides, the attention weights ϕitalic-ϕ\phiitalic_ϕ are obtained via Q-K dot-product and normalized along the nheadsubscript𝑛headn_{\text{head}}italic_n start_POSTSUBSCRIPT head end_POSTSUBSCRIPT dimension and across the overlap** regions of multiple RoIs via Softmax.

During training, we impose the reprojection-based auxiliary loss for the multi-head dense predictions, formulated as the negative log-likelihood (NLL) of the Gaussian mixture model [35]. Following [28], the reprojection error is further robustified by the Huber kernel ρ()𝜌\rho(\cdot)italic_ρ ( ⋅ ). The loss function for each sampled point is defined as:

proj=logRoIk=1nheadϕk|diagwk2D|exp12ρ(fk(ygt)2),subscriptprojsubscriptRoIsuperscriptsubscript𝑘1subscript𝑛headsubscriptitalic-ϕ𝑘diagsuperscriptsubscript𝑤𝑘2D12𝜌superscriptnormsubscript𝑓𝑘subscript𝑦gt2\mathcal{L}_{\text{proj}}=\smash[b]{-\log\sum_{\text{RoI}}\sum_{k=1}^{n_{\text% {head}}}\phi_{k}|\mathop{\mathrm{diag}}{w_{k}^{\text{2D}}}|\exp{-\frac{1}{2}% \rho\mathopen{}\mathclose{{}\left(\|f_{k}(y_{\text{gt}})\|^{2}}\right)}},% \vspace{1mm}caligraphic_L start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT = - roman_log ∑ start_POSTSUBSCRIPT RoI end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT head end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | roman_diag italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT | roman_exp - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_ρ ( ∥ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (36)

where k𝑘kitalic_k is the head index, fk(ygt)subscript𝑓𝑘subscript𝑦gtf_{k}(y_{\text{gt}})italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) is the weighted reprojection error of the k𝑘kitalic_k-th head at the truth pose ygtsubscript𝑦gty_{\text{gt}}italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT. In the above equation, the diagonal matrix diagwk2Ddiagsuperscriptsubscript𝑤𝑘2D\mathop{\mathrm{diag}}{w_{k}^{\text{2D}}}roman_diag italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT is interpreted as the inverse square root of the covariance matrix of the normal distribution, i.e., diagwk2D=Σ12diagsuperscriptsubscript𝑤𝑘2DsuperscriptΣ12\mathop{\mathrm{diag}}{w_{k}^{\text{2D}}}=\Sigma^{-\frac{1}{2}}roman_diag italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT = roman_Σ start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT, and the head attention weight ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is interpreted as the mixture component weight. RoIsubscriptRoI\sum_{\text{RoI}}∑ start_POSTSUBSCRIPT RoI end_POSTSUBSCRIPT is a special operation that takes the overlap** region of multiple RoIs into account, formulating a mixture of multiple heads and multiple RoIs (see code for details).

Another auxiliary loss is the coordinate regression loss that introduces the geometric knowledge. Following MonoRUn [28], we extract the sparse ground truth of 3D coordinates xgt3Dsubscriptsuperscript𝑥3Dgtx^{\text{3D}}_{\text{gt}}italic_x start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT from the 3D LiDAR point cloud. The robustified Gaussian mixture NLL loss for each sampled point with available ground truth is defined as:

crd=logk=1nheadϕkw2exp12ρ(w(xk3Dxgt3D)2),subscriptcrdsuperscriptsubscript𝑘1subscript𝑛headsubscriptitalic-ϕ𝑘superscript𝑤212𝜌superscriptnorm𝑤subscriptsuperscript𝑥3D𝑘subscriptsuperscript𝑥3Dgt2\mathcal{L}_{\text{crd}}=-\log{\sum_{k=1}^{n_{\text{head}}}\phi_{k}w^{2}\exp-% \frac{1}{2}\rho\mathopen{}\mathclose{{}\left(\mathopen{}\mathclose{{}\left\|w% \mathopen{}\mathclose{{}\left(x^{\text{3D}}_{k}-x^{\text{3D}}_{\text{gt}}}% \right)}\right\|^{2}}\right)},caligraphic_L start_POSTSUBSCRIPT crd end_POSTSUBSCRIPT = - roman_log ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT head end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_exp - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_ρ ( ∥ italic_w ( italic_x start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (37)

where w+𝑤superscriptw\in\mathbb{R}^{+}italic_w ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is a scalar weight parameter (bounded by exp\exproman_exp activation) to be optimized during training.

As with the KL loss in the main paper, dynamic loss weights [28] are employed to rescale these two auxiliary loss functions.

Refer to caption
Figure 9: Detailed architecture of the deformable correspondence network.
Refer to caption
Figure 10: Architecture of the auxiliary branch. This branch shares the same weights of Q, K projection with the deformable attention layer in the lower right of Figure 9.

D.3.2 Auxiliary Depth Loss

For each projected LiDAR point in the image, we extract the feature vector from the interpolated dense feature map, which are then fed into a small 2-layer MLP to predict the scene depth. The output depth is represented by a Gaussian mixture distribution encoded by {ϕDk,zk}k=1nDsuperscriptsubscriptsubscriptsubscriptitalic-ϕD𝑘subscript𝑧𝑘𝑘1subscript𝑛D\mathopen{}\mathclose{{}\left\{{\phi_{\text{D}}}_{k},z_{k}}\right\}_{k=1}^{n_{% \text{D}}}{ italic_ϕ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where the mixture weights ϕDksubscriptsubscriptitalic-ϕD𝑘{\phi_{\text{D}}}_{k}italic_ϕ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are normalized via Softmax. Given the ground truth depth zgtsubscript𝑧gtz_{\text{gt}}italic_z start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT of this point, the loss function is defined as:

D=logk=1nDϕDkwDexp12ρ(wD2(zkzgt)2),subscriptDsuperscriptsubscript𝑘1subscript𝑛Dsubscriptsubscriptitalic-ϕD𝑘subscript𝑤D12𝜌superscriptsubscript𝑤D2superscriptsubscript𝑧𝑘subscript𝑧gt2\mathcal{L}_{\text{D}}=-\log{\sum_{k=1}^{n_{\text{D}}}{\phi_{\text{D}}}_{k}w_{% \text{D}}\exp-\frac{1}{2}\rho\mathopen{}\mathclose{{}\left(w_{\text{D}}^{2}% \mathopen{}\mathclose{{}\left(z_{k}-z_{\text{gt}}}\right)^{2}}\right)},caligraphic_L start_POSTSUBSCRIPT D end_POSTSUBSCRIPT = - roman_log ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT D end_POSTSUBSCRIPT roman_exp - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_ρ ( italic_w start_POSTSUBSCRIPT D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (38)

where wD+subscript𝑤Dsuperscriptw_{\text{D}}\in\mathbb{R}^{+}italic_w start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is a weight parameter (bounded by exp\exproman_exp activation) to be optimized during training.

D.4 Training Strategy

During training, we randomly sample 48 positive object queries from the FCOS3D [44] detector for each image, which limits the batch size of the deformable correspondence network to control the computation overhead of the Monte Carlo pose loss.

D.5 Experiments on the Uncertainty of Object Pose

The entropy of the inferred pose distribution reflects the aleatoric uncertainty of the predicted pose. Previous work [28] reasons the pose uncertainty by propagating the reprojection uncertainty learned from a surrogate loss through the PnP operation, but that uncertainty requires calibration and is not reliable enough. In our work, the pose uncertainty is learned with the KL-divergence-based pose loss in an end-to-end manner, which is much more reliable fundamentally.

To quantitatively evaluate the reliability of the pose uncertainty in terms of measuring the localization confidence, a straightforward approach is to compute the 3D localization score cMCsubscript𝑐MCc_{\text{MC}}italic_c start_POSTSUBSCRIPT MC end_POSTSUBSCRIPT via Monte Carlo pose sampling, and compare the resulting mAP against the standard implementation with 3D score cpredsubscript𝑐predc_{\text{pred}}italic_c start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT predicted from the object-level branch. With the PnP solution tsuperscript𝑡t^{\ast}italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the sampled translation vector tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and its importance weight vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the Monte Carlo score is computed by:

cMC=1jvjjvj𝑆𝑐𝑜𝑟𝑒(tXZtXZj),subscript𝑐MC1subscript𝑗subscript𝑣𝑗subscript𝑗subscript𝑣𝑗𝑆𝑐𝑜𝑟𝑒normsubscriptsuperscript𝑡XZsubscriptsubscript𝑡XZ𝑗c_{\text{MC}}=\frac{1}{\sum_{j}v_{j}}\sum_{j}v_{j}\mathit{Score}\mathopen{}% \mathclose{{}\left(\|t^{\ast}_{\text{XZ}}-{t_{\text{XZ}}}_{j}\|}\right),italic_c start_POSTSUBSCRIPT MC end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_Score ( ∥ italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT XZ end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT XZ end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ ) , (39)

where the subscript ()XZsubscriptXZ(\cdot)_{\text{XZ}}( ⋅ ) start_POSTSUBSCRIPT XZ end_POSTSUBSCRIPT denotes taking the XZ components, and the function 𝑆𝑐𝑜𝑟𝑒()𝑆𝑐𝑜𝑟𝑒\mathit{Score}(\cdot)italic_Score ( ⋅ ) is the same as in Eq. 35.

As shown in Table VIII, the mAP obtained via Monte Carlo scoring is on par with the standard implementation (0.393 vs. 0.392), indicating that the pose uncertainty is a reliable measure of the detection confidence.

TABLE VIII: Comparison between the scoring methods on the nuScenes validation set.
Scoring method NDS↑ mAP↑ mATE↓ mAOE↓
Standard 0.463 0.392 0.626 0.282
Monte Carlo 0.463 0.393 0.623 0.286

Appendix E Notation

TABLE IX: A summary of frequently used notations.
Notation Description
xi3Dsubscriptsuperscript𝑥3D𝑖x^{\text{3D}}_{i}italic_x start_POSTSUPERSCRIPT 3D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 3absentsuperscript3\in\mathbb{R}^{3}∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT Coordinate vector of the i𝑖iitalic_i-th 3D object point
xi2Dsubscriptsuperscript𝑥2D𝑖x^{\text{2D}}_{i}italic_x start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 2absentsuperscript2\in\mathbb{R}^{2}∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Coordinate vector of the i𝑖iitalic_i-th 2D image point
wi2Dsubscriptsuperscript𝑤2D𝑖w^{\text{2D}}_{i}italic_w start_POSTSUPERSCRIPT 2D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT +2absentsubscriptsuperscript2\in\mathbb{R}^{2}_{+}∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT Weight vector of the i𝑖iitalic_i-th 2D-3D point pair
X𝑋Xitalic_X The set of weighted 2D-3D correspondences
y𝑦yitalic_y Object pose
ygtsubscript𝑦gty_{\text{gt}}italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT Ground truth of object pose
ysuperscript𝑦y^{\ast}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT Object pose estimated by the PnP solver
R𝑅Ritalic_R 3×3 rotation matrix representation of object orientation
θ𝜃\thetaitalic_θ 1D yaw angle representation of object orientation
l𝑙litalic_l Unit quaternion representation of object orientation
t𝑡titalic_t 3absentsuperscript3\in\mathbb{R}^{3}∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT Translation vector representation of object position
ΣysubscriptΣsuperscript𝑦\Sigma_{y^{\ast}}roman_Σ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT Pose covariance estimated by the PnP solver
J𝐽Jitalic_J Jacobian matrix
J~~𝐽\tilde{J}over~ start_ARG italic_J end_ARG Rescaled Jacobian matrix
F𝐹Fitalic_F Concatenated vector of weighted reprojection errors of all points
F~~𝐹\tilde{F}over~ start_ARG italic_F end_ARG Concatenated vector of rescaled weighted reprojection errors of all points
π()𝜋\pi(\cdot)italic_π ( ⋅ ) :32:absentsuperscript3superscript2:\mathbb{R}^{3}\rightarrow\mathbb{R}^{2}: blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Camera projection function
fi(y)subscript𝑓𝑖𝑦f_{i}(y)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) 2absentsuperscript2\in\mathbb{R}^{2}∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Weighted reprojection error of the i𝑖iitalic_i-th correspondence at pose y𝑦yitalic_y
ri(y)subscript𝑟𝑖𝑦r_{i}(y)italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) 2absentsuperscript2\in\mathbb{R}^{2}∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Unweighted reprojection error of the i𝑖iitalic_i-th correspondence at pose y𝑦yitalic_y
ρ()𝜌\rho(\cdot)italic_ρ ( ⋅ ) Huber kernel function
ρisubscriptsuperscript𝜌𝑖\rho^{\prime}_{i}italic_ρ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT The derivative of the Huber kernel function of the i𝑖iitalic_i-th correspondence
δ𝛿\deltaitalic_δ The Huber threshold
p(X|y)𝑝conditional𝑋𝑦p(X|y)italic_p ( italic_X | italic_y ) Likelihood function of object pose
p(y)𝑝𝑦p(y)italic_p ( italic_y ) PDF of the prior pose distribution
p(y|X)𝑝conditional𝑦𝑋p(y|X)italic_p ( italic_y | italic_X ) PDF of the posterior pose distribution
t(y)𝑡𝑦t(y)italic_t ( italic_y ) PDF of the target pose distribution
q(y),qt(y)𝑞𝑦subscript𝑞𝑡𝑦q(y),q_{t}(y)italic_q ( italic_y ) , italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) PDF of the proposal pose distribution (of the t𝑡titalic_t-th AMIS iteration)
yj,yjtsubscript𝑦𝑗superscriptsubscript𝑦𝑗𝑡y_{j},y_{j}^{t}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT The j𝑗jitalic_j-th random pose sample (of the t𝑡titalic_t-th AMIS iteration)
vj,vjtsubscript𝑣𝑗superscriptsubscript𝑣𝑗𝑡v_{j},v_{j}^{t}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT Importance weight of the j𝑗jitalic_j-th pose sample (of the t𝑡titalic_t-th AMIS iteration)
i𝑖iitalic_i Index of 2D-3D point pair
j𝑗jitalic_j Index of random pose sample
t𝑡titalic_t Index of AMIS iteration
N𝑁Nitalic_N Number of 2D-3D point pairs in total
K𝐾Kitalic_K Number of pose samples in total
T𝑇Titalic_T Number of AMIS iterations
Ksuperscript𝐾K^{\prime}italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT Number of pose samples per AMIS iteration
nheadsubscript𝑛headn_{\text{head}}italic_n start_POSTSUBSCRIPT head end_POSTSUBSCRIPT Number of heads in the deformable correspondence network
nhptssubscript𝑛hptsn_{\text{hpts}}italic_n start_POSTSUBSCRIPT hpts end_POSTSUBSCRIPT Number of points per head in the deformable correspondence network
KLsubscriptKL\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT KL divergence loss for object pose
tgtsubscripttgt\mathcal{L}_{\text{tgt}}caligraphic_L start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT The component of KLsubscriptKL\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT concerning the reprojection errors at target pose
predsubscriptpred\mathcal{L}_{\text{pred}}caligraphic_L start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT The component of KLsubscriptKL\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT concerning the reprojection errors over predicted pose
regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT Derivative regularization loss