EPro-PnP: Generalized End-to-End Probabilistic
Perspective-n-Points for Monocular
Object Pose Estimation

Hansheng Chen, Wei Tian, Pichao Wang, Fan Wang, Lu Xiong, and Hao Li H. Chen is with the Department of Computer Science, Stanford University, Stanford, CA 94305 USA. Work done previously at Tongji University, Shanghai 201804, China. E-mail: [email protected] W. Tian and L. Xiong are with the School of Automotive Studies, Tongji University, Shanghai 201804, China. E-mail: {tian_wei, xiong_lu}@tongji.edu.cn P. Wang is with Amazon.com Inc, Seattle, WA 98109 USA. Work done previously at Alibaba Group (U.S.), Bellevue, WA 98004 USA. E-mail: [email protected] F. Wang is with Alibaba Group (U.S.), Sunnyvale, CA 94085 USA. E-mail: [email protected] H. Li is with Artificial Intelligence Innovation and Incubation Institute, Fudan University, Shanghai 200433, China. Work done previously at Alibaba Group, Hangzhou 311121, China. E-mail: [email protected](Corresponding author: Wei Tian.)

Abstract

Locating 3D objects from a single RGB image via Perspective-n-Point (PnP) is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest interpreting PnP as a differentiable layer, allowing for partial learning of 2D-3D point correspondences by backpropagating the gradients of pose loss. Yet, learning the entire correspondences from scratch is highly challenging, particularly for ambiguous pose solutions, where the globally optimal pose is theoretically non-differentiable w.r.t. the points. In this paper, we propose the EPro-PnP, a probabilistic PnP layer for general end-to-end pose estimation, which outputs a distribution of pose with differentiable probability density on the SE(3) manifold. The 2D-3D coordinates and corresponding weights are treated as intermediate variables learned by minimizing the KL divergence between the predicted and target pose distribution. The underlying principle generalizes previous approaches, and resembles the attention mechanism. EPro-PnP can enhance existing correspondence networks, closing the gap between PnP-based method and the task-specific leaders on the LineMOD 6DoF pose estimation benchmark. Furthermore, EPro-PnP helps to explore new possibilities of network design, as we demonstrate a novel deformable correspondence network with the state-of-the-art pose accuracy on the nuScenes 3D object detection benchmark. Our code is available at https://github.com/tjiiv-cprg/EPro-PnP-v2.

Index Terms:

Pose estimation, imaging geometry, probabilistic deep learning, 3D vision, autonomous vehicles

1 Introduction

Estimating the pose (i.e., position and orientation) of 3D objects from a single RGB image is an important problem in computer vision. This field is often subdivided into specific tasks, e.g., 6DoF pose estimation for robot manipulation and 3D object detection for autonomous driving. Although they share the same fundamentals of pose estimation, the different nature of the data leads to biased choice of methods. Top performers [1, 2, 3] on the 3D object detection benchmarks [4, 5] fall into the category of direct 4DoF pose prediction, leveraging the advances in end-to-end deep learning. On the other hand, the 6DoF pose estimation benchmark [6] is largely dominated by geometry-based methods [7, 8], which exploit the provided 3D object models and achieve a stable generalization performance. However, it is quite challenging to bring together the best of both worlds, i.e., training a geometric model to learn the object pose in an end-to-end manner.

Refer to caption — Figure 1: An overview of the proposed framework. The predicted 2D-3D correspondences formulate a PnP problem. Instead of solving the optimal pose, EPro-PnP outputs a pose distribution, allowing the gradients of the KL loss w.r.t. the probability density to be backpropagated to train the correspondence network.

There has been recent proposals for an end-to-end framework based on the Perspective-n-Point (PnP) approach [9, 10, 11, 12]. The PnP algorithm itself solves the pose from a set of 3D points in object space and their corresponding 2D projections in image space, leaving the problem of constructing these correspondences. Vanilla correspondence learning [13, 14, 15, 16, 17, 8, 18, 19, 20, 17] leverages the geometric prior to build surrogate loss functions, forcing the network to learn a set of pre-defined correspondences. End-to-end correspondence learning [9, 10, 11, 12] interprets the PnP solver as a differentiable layer and employs pose-driven loss function, so that gradient of the pose error can be backpropagated to the 2D-3D correspondences.

However, existing work on differentiable PnP learns only a portion of the correspondences (either 2D coordinates [12], 3D coordinates [9, 10] or corresponding weights [11]), assuming other components are given a priori. This raises an important question: why not learn the entire set of points and weights altogether in an end-to-end manner? Our intuition is: under such relaxed settings, the PnP problem could better describe pose ambiguity [21, 22], in the cases of symmetric objects [17] or uncertain observations. However, with the presence of ambiguity, the PnP problem has multiple local minima. Existing methods try to differentiate a point estimate of the pose (a single local minima), which is unstable in general, while the global optimum is neither easy to find nor differentiable.

To overcome the above limitations, we propose a generalized end-to-end probabilistic PnP (EPro-PnP) module that enables learning the weighted 2D-3D point correspondences entirely from scratch. The main idea is straightforward: a point estimate of pose is non-differentiable, but the probability density of pose is apparently differentiable, just like categorical classification scores. As shown in Figure 1, we interpret the output of PnP as a probabilistic distribution parameterized by the learnable 2D-3D correspondences. During training, the Kullback-Leibler (KL) divergence between the predicted and target pose distributions is minimized as the loss function, which can be efficiently calculated using the Adaptive Multiple Importance Sampling [23] algorithm.

As a general approach, EPro-PnP inherently unifies existing correspondence learning techniques (Section 3.1). Moreover, just like the attention mechanism [24], the corresponding weights can be trained to automatically focus on important point pairs, allowing the networks to be designed with inspiration from attention-related work [25, 26, 27].

To summarize, our main contributions are as follows:

•

We propose the EPro-PnP, a probabilistic PnP layer for general end-to-end pose estimation with learnable 2D-3D correspondences, which can cope with pose ambiguity.
•

We demonstrate that EPro-PnP can easily reach top-tier performance for 6DoF pose estimation by simply inserting it into the CDPN [18] framework.
•

We demonstrate the flexibility of EPro-PnP by proposing deformable correspondence learning for accurate 3D object detection, where the entire 2D-3D correspondences are learned from scratch.

This extended paper presents new experiments with improved results and rigorous ablation studies. For 6DoF pose estimation on LineMOD, feeding 2D box size to the model has improved uncertainty handling, boosting pose accuracy to outperform RePOSE [7]. New ablation studies reveal each loss’s contribution and show that EPro-PnP can achieve competitive performance even without 3D models (B2 in Table II). For 3D object detection on nuScenes, EPro-PnP with an enhanced network now leads the field of single-frame image-based detectors, and the ablation studies highlight the importance of the Monte Carlo pose loss in handling ambiguous poses. Furthermore, we have also expanded our discussion on the derivative regularization loss.

2 Related Work

2.1 Geometry-Based Object Pose Estimation

In general, geometry-based methods exploit the points, edges or other types of representation that are subject to the projection constraints under the perspective camera. Then, the pose can be solved by optimization. A large body of work utilizes point representation, which can be categorized into sparse keypoints and dense correspondences. BB8 [15] and RTM3D [19] locate the corners of the 3D bounding box as keypoints, while PVNet [13] defines the keypoints by farthest point sampling and Deep MANTA [20] by handcrafted templates. On the other hand, dense correspondence methods [17, 18, 8, 28, 16] predict pixel-wise 3D coordinates within a cropped 2D region. Most existing geometry-based methods follow a two-stage strategy, where the intermediate representations (i.e., 2D-3D correspondences) are learned with a surrogate loss function, which is sub-optimal compared to end-to-end learning.

2.2 End-to-End Correspondence Learning

To mitigate the limitation of surrogate correspondence learning, end-to-end approaches have been proposed to backpropagate the gradient from pose to intermediate representation. Using implicit differentiation w.r.t. the optimal pose or its approximations, Brachmann and Rother [10] propose a dense correspondence network where 3D points are learnable, BPnP [12] predicts 2D keypoint locations, and BlindPnP [11] learns the corresponding weight matrix given a set of unordered 2D/3D points. The above methods are all coupled with surrogate regularization loss, otherwise convergence is not guaranteed due to numerical instability [10] and the non-differentiable nature of the optimal pose. Under the probabilistic framework, these methods can be regarded as a Laplace approximation approach (Section 3.1).

Beyond point correspondence, RePOSE [7] proposes a feature-metric correspondence network trained by backpropagating the PnP solver (e.g. Levenberg-Marquardt), but it is insufficient under pose ambiguity although it can be leveraged as a local regularization technique in our framework (Section 3.4).

2.3 Probabilistic Deep Learning

Probabilistic methods account for uncertainty in the model and the data, known respectively as epistemic and aleatoric uncertainty [29]. The latter involves interpreting the prediction as learnable probabilistic distributions. Discrete categorical distribution via Softmax has been widely adopted as a smooth approximation of one-hot $\operatorname*{arg\,max}$ for end-to-end classification. This inspired works such as DSAC [9], a smooth RANSAC with a finite hypothesis pool. Meanwhile, tractable parametric distributions (e.g., normal distribution) are often used in predicting continuous variables [30, 31, 29, 32, 33, 28], and mixture distributions can be employed to further capture ambiguity [34, 35, 36], e.g., ambiguous 6DoF pose [37]. In this paper, we propose yet a unique contribution: backpropagating a complicated continuous distribution derived from a nested optimization layer (the PnP layer) approximated by importance sampling, essentially making it a continuous counterpart of Softmax.

3 Generalized End-to-End Probabilistic PnP

3.1 Overview

Given an object proposal, our goal is to predict a set $X=\mathopen{}\mathclose{{}\left\{x^{\text{3D}}_{i},x^{\text{2D}}_{i},w^{\text{% 2D}}_{i}}\right\}_{i=1}^{N}$ of $N$ corresponding points, with 3D object coordinates $x^{\text{3D}}_{i}\in\mathbb{R}^{3}$ , 2D image coordinates $x^{\text{2D}}_{i}\in\mathbb{R}^{2}$ , and 2D weights $w^{\text{2D}}_{i}\in\mathbb{R}^{2}_{+}$ , from which a weighted PnP problem can be formulated to estimate the object pose relative to the camera.

The essence of a PnP layer is searching for an optimal pose $y$ (expanded as rotation matrix $R$ and translation vector $t$ ) that minimizes the cumulative squared weighted reprojection error:

\smash[b]{\operatorname*{arg\,min}_{y}\frac{1}{2}\sum_{i=1}^{N}\mathopen{}% \mathclose{{}\left\|\smash[b]{\underbrace{w_{i}^{\text{2D}}\circ\mathopen{}% \mathclose{{}\left(\pi(Rx_{i}^{\text{3D}}+t)-x_{i}^{\text{2D}}}\right)}_{f_{i}% (y)\in\mathbb{R}^{2}}}}\right\|^{2},}\vphantom{\underbrace{\mathopen{}% \mathclose{{}\left((O_{o}^{O})}\right)}_{o}}

(1)

where $\pi(\cdot)$ is the projection function with camera intrinsics involved, $\circ$ stands for element-wise product, and $f_{i}(y)$ compactly denotes the weighted reprojection error.

Eq. (1) formulates a non-linear least squares problem that may have non-unique solutions, i.e., pose ambiguity [21, 22]. Previous work [10, 12, 11] only backpropagates through a local solution $y^{\ast}$ , which is inherently unstable and non-differentiable. To construct a differentiable alternative for end-to-end learning, we model the PnP output as a distribution of pose, which guarantees differentiable probability density. The cumulative error is considered to be the negative logarithm of the likelihood function $p(X|y)$ defined as:

p\mathopen{}\mathclose{{}\left(X\middle|y}\right)=\exp-\frac{1}{2}\sum_{i=1}^{% N}\mathopen{}\mathclose{{}\left\|f_{i}(y)}\right\|^{2}.

(2)

With an additional prior pose distribution $p(y)$ , we can derive the posterior pose $p(y|X)$ via the Bayes theorem. Using an uniform prior in the domain $Y$ , the posterior density is simplified to the normalized likelihood:

p(y|X)=\frac{\exp-\frac{1}{2}\sum_{i=1}^{N}\mathopen{}\mathclose{{}\left\|f_{i% }(y)}\right\|^{2}}{\int_{Y}\exp-\frac{1}{2}\sum_{i=1}^{N}\mathopen{}\mathclose% {{}\left\|f_{i}(y)}\right\|^{2}\mathop{}\!\mathrm{d}{y}}.

(3)

Eq. (3) can be interpreted as a continuous counterpart of categorical Softmax.

3.1.1 KL Loss Function

During training, given a target pose distribution with probability density $t(y)$ , the KL divergence $D_{\text{KL}}\mathopen{}\mathclose{{}\left(t(y)\|p(y|X)}\right)$ is minimized as training loss. Intuitively, pose ambiguity can be captured by the multiple modes of $p(y|X)$ , and convergence is ensured such that wrong modes are suppressed by the loss function. Substituting Eq. (3), the KL divergence loss can be re-written as follows:

$\displaystyle\mathcal{L}_{\text{KL}}$	$\displaystyle=\int_{Y}t(y)\mathopen{}\mathclose{{}\left(\log{t(y)}-\log{p(y\|X)% }}\right)\mathop{}\!\mathrm{d}{y}$
	$\displaystyle=-\int_{Y}t(y)\log{\frac{p(X\|y)}{\int_{Y}p(X\|y)\mathop{}\!\mathrm% {d}{y}}}\mathop{}\!\mathrm{d}{y}+\mathit{const}$
	$\displaystyle=-\int_{Y}t(y)\log{p(X\|y)}\mathop{}\!\mathrm{d}{y}+\log{\int_{Y}p% (X\|y)\mathop{}\!\mathrm{d}{y}}+\mathit{const}.$	(4)

In practice, we drop the constant relevant to the target distribution so that it is effectively a cross-entropy loss. In addition, we empirically find it effective to set a narrow (Dirac-like) target distribution centered at the ground truth $y_{\text{gt}}$ , yielding the simplified loss (after substituting Eq. (2)):

\mathcal{L}_{\text{KL}}=\underbrace{\frac{1}{2}\sum_{i=1}^{N}\mathopen{}% \mathclose{{}\left\|f_{i}(y_{\text{gt}})}\right\|^{2}}_{\mathclap{\mathcal{L}_% {\text{tgt}}\text{ (reproj. at target pose)}}}+\underbrace{\log\int_{Y}\exp-% \frac{1}{2}\sum_{i=1}^{N}\mathopen{}\mathclose{{}\left\|f_{i}(y)}\right\|^{2}% \mathop{}\!\mathrm{d}{y}}_{\mathcal{L}_{\text{pred}}\text{ (reproj. at % predicted pose)}}.

(5)

The only remaining problem is the integration in the second term, which is elaborated in Section 3.2.

3.1.2 Comparison to Reprojection-Based Method

The two terms in Eq. (5) are concerned with the reprojection errors at target and predicted pose respectively. The former is often used as a surrogate loss in previous work [12, 10, 28]. However, the first term alone cannot handle learning all 2D-3D points without imposing strict regularization, as the minimization could simply collapse all the 2D-3D points. The second term originates from the normalization factor in Eq. (3), and is crucial to a discriminative loss function, as shown in Figure 2.

3.1.3 Comparison to Implicit Differentiation Method

Existing work on end-to-end PnP [12, 11] derives a single solution of a particular solver $y^{\ast}=\mathit{PnP}(X)$ via implicit function theorem [38], assuming $\nabla_{y}\frac{1}{2}\sum_{i=1}^{N}\mathopen{}\mathclose{{}\left\|f_{i}(y)}% \right\|^{2}\negmedspace\bigm{|}_{y=y^{\ast}}=0$ . In the probabilistic framework, this is essentially the Laplace method that approximates the posterior by $\mathcal{N}(y^{\ast},\Sigma_{y^{\ast}})$ , where both $y^{\ast}$ and $\Sigma_{y^{\ast}}$ can be estimated by the PnP solver with analytical derivatives [28]. If $\Sigma_{y^{\ast}}$ is simplified to be isotropic, the approximated KL divergence can be simplified into the L2 loss $\|y^{\ast}-y_{\text{gt}}\|^{2}$ used in [11]. However, the Laplace approximation is inaccurate for non-normal posteriors with ambiguity, therefore does not guarantee global convergence. Besides, implicit differentiation itself may be prone to numerical instability [10].

3.2 Monte Carlo Pose Loss

In this section, we introduce a GPU-friendly efficient Monte Carlo approach to the integration in the proposed loss function, based on the Adaptive Multiple Importance Sampling (AMIS) algorithm [23].

Considering $q(y)$ to be the probability density function of a proposal distribution that approximates the shape of the integrand $\exp-\frac{1}{2}\sum_{i=1}^{N}\mathopen{}\mathclose{{}\left\|f_{i}(y)}\right\|% ^{2}$ , and $y_{j}$ to be one of the $K$ samples drawn from $q(y)$ , the estimation of the second term $\mathcal{L}_{\text{pred}}$ in Eq. (5) is thus:

\mathcal{L}_{\text{pred}}\approx\log\frac{1}{K}\sum_{j=1}^{K}\underbrace{\frac% {\exp-\frac{1}{2}\sum_{i=1}^{N}\mathopen{}\mathclose{{}\left\|f_{i}(y_{j})}% \right\|^{2}}{q(y_{j})}}_{v_{j}\text{ (importance weight)}},

(6)

where $v_{j}$ compactly denotes the importance weight at $y_{j}$ . Eq. (6) gives the vanilla importance sampling, where the choice of proposal $q(y)$ strongly affects the numerical stability. The AMIS algorithm is a better alternative as it iteratively adapts the proposal to the integrand.

In brief, AMIS utilizes the sampled importance weights from past iterations to estimate the new proposal. Then, all previous samples are re-weighted as being homogeneously sampled from a mixture of the overall sum of proposals. [23] Initial proposal can be determined by the mode and covariance of the predicted pose distribution (see supplementary for details). A pseudo-code is given below.

Input :

X=\{x_{i}^{\text{3D}},x_{i}^{\text{2D}},w_{i}^{\text{2D}}\}_{i=1}^{N}

Output :

\mathcal{L}_{\text{pred}}

y^{\ast},\Sigma_{y^{\ast}}\leftarrow\mathit{PnP}(X)

// Laplace approximation

Fit

q_{1}(y)

y^{\ast},\Sigma_{y^{\ast}}

// initial proposal

1 for $1\leq t\leq T$ do

2 Generate

K^{\prime}

samples

y^{t}_{j=1\cdots K^{\prime}}

from

q_{t}(y)

for $1\leq j\leq K^{\prime}$ do

P_{j}^{t}\leftarrow p(X|y_{j}^{t})

// evaluate integrand

4 for $1\leq\tau\leq t$ and $1\leq j\leq K^{\prime}$ do

Q_{j}^{\tau}\leftarrow\frac{1}{t}\sum_{m=1}^{t}q_{m}(y_{j}^{\tau})

// evaluate proposal mix

v_{j}^{\tau}\leftarrow P_{j}^{\tau}/Q_{j}^{\tau}

// importance weight

6 if $t<T$ then

7 Estimate

q_{t+1}(y)

from all weighted samples

\{y_{j}^{\tau},v_{j}^{\tau}\ |\,1\leq\tau\leq t,1\leq j\leq K^{\prime}\}

\mathcal{L}_{\text{pred}}\leftarrow\log\frac{1}{TK^{\prime}}\sum_{t=1}^{T}\sum% _{j=1}^{K^{\prime}}v_{j}^{t}

In this paper, we empirically set the AMIS iteration count $T$ to 4, and the number of samples per iteration $K^{\prime}$ to 128 for 6DoF pose and 32 for 4DoF pose (1D yaw-only orientation). These hyperparameters can be adjusted to balance computation and accuracy.

3.2.1 Choice of Proposal Distribution

We use separate proposal distributions for position and orientation, as the orientation space is non-Euclidean. For position, we adopt the 3DoF multivariate t-distribution. For 1D yaw-only orientation, we use a mixture of von Mises and uniform distribution. For 3D orientation represented by unit quaternion, the angular central Gaussian distribution [39] is adopted.

3.3 Backpropagation

Although backpropagation can be simply implemented with automatic differentiation packages, here we analyze the gradients of the loss function for an intuitive understanding of the learning process. In general, the gradients of the loss function defined in Eq. (5) is:

\nabla\mathcal{L}_{\text{KL}}=\nabla\frac{1}{2}\sum_{i=1}^{N}\mathopen{}% \mathclose{{}\left\|f_{i}(y_{\text{gt}})}\right\|^{2}-\mathop{\mathbb{E}}_{y% \sim p(y|X)}{\nabla\frac{1}{2}\sum_{i=1}^{N}\mathopen{}\mathclose{{}\left\|f_{% i}(y)}\right\|^{2}},

(7)

where the first term is the gradient of reprojection errors at target pose, and the second term is the expected gradient of reprojection errors over predicted pose distribution, which is approximated by backpropagating the importance weights in the Monte Carlo pose loss.

3.3.1 Balancing Uncertainty and Discrimination

Consider the negative gradient w.r.t. the corresponding weights $w_{i}^{\text{2D}}$ :

-\nabla_{w_{i}^{\text{2D}}}\mathcal{L}_{\text{KL}}=w_{i}^{\text{2D}}\circ% \mathopen{}\mathclose{{}\left(-r_{i}^{\circ 2}(y_{\text{gt}})+\mathop{\mathbb{% E}}_{y\sim p(y|X)}{r_{i}^{\circ 2}(y)}}\right),

(8)

where $r_{i}(y)=\pi(Rx_{i}^{\text{3D}}+t)-x_{i}^{\text{2D}}$ (unweighted reprojection error), and $(\cdot)^{\circ 2}$ stands for element-wise square. The first bracketed term $-r_{i}^{\circ 2}(y_{\text{gt}})$ with negative sign indicates that correspondences with large reprojection error (hence high uncertainty) shall be weighted less. The second term $\mathop{\mathbb{E}}_{y\sim p(y|X)}{r_{i}^{\circ 2}(y)}$ is relevant to the variance of reprojection error over the predicted pose. The positive sign indicates that correspondences sensitive to pose variation should be weighted more, because they provide stronger pose discrimination. The final gradient is thus a balance between the uncertainty and discrimination, as shown in Figure 3. Existing work [28, 13] on learning uncertainty-aware correspondences only considers the former, hence lacking the discriminative ability.

3.4 Limitations and Derivative Regularization Loss

In practice, we observe that the KL divergence loss has two limitations:

•

While the KL divergence is a good metric for the probabilistic distribution, existing evaluation protocols are all based on the point estimate of pose $y^{\ast}$ . Therefore, for inference it is still required to locate a mode $y^{\ast}$ of the posterior $p(y|X)$ by solving the PnP problem in Eq. (1), which could be sub-optimal if trained solely with the KL loss.
•

The 2D-3D correspondences are underdetermined if we only impose the KL loss when training the network. Learning these entangled elements could be difficult if the network architecture is not designed carefully with preferable inductive bias.

The above limitations can be mitigated by an additional regularization loss on $y^{\ast}$ that backpropagates through the Gauss-Newton (GN) least squares solver or its variants [7]. We call it the derivative regularization loss, since GN is a derivative-based optimizer, and the loss therefore acts on the derivatives of the log-density $\log{p(y|X)}$ to direct the GN increment $\Delta y$ towards the true pose $y_{\text{gt}}$ .

To employ the regularization during training, a detached solution $y^{\ast}$ is obtained first. Then, at $y^{\ast}$ , a final GN increment is evaluated (which ideally equals 0 if $y^{\ast}$ has already converged to the local optimum):

\Delta y=-{\underbrace{(J^{\text{T}}J)}_{\mathclap{H\text{ (approx.)}}}}^{-1}% \underbrace{J^{\text{T}}F(y^{\ast})}_{g},

(9)

where $F(y^{\ast})=\mathopen{}\mathclose{{}\left[f_{1}^{\text{T}}(y^{\ast}),f_{2}^{% \text{T}}(y^{\ast}),\cdots,f_{N}^{\text{T}}(y^{\ast})}\right]^{\text{T}}$ is the flattened weighted reprojection errors of all points, $J=\mathop{}\!\mathrm{\partial}{F(y)}/\mathop{}\!\mathrm{\partial}{y^{\text{T}}% }\negmedspace\bigm{|}_{y=y^{\ast}}$ is the Jacobian matrix, $J^{\text{T}}F(y)$ equals the gradient $g$ of the negative log-likelihood (NLL) w.r.t. object pose, i.e., $\mathop{}\!\mathrm{\partial}{\frac{1}{2}\sum_{i=1}^{N}\mathopen{}\mathclose{{}% \left\|f_{i}(y)}\right\|^{2}}/\mathop{}\!\mathrm{\partial}{y}$ , and $J^{\text{T}}J$ is an approximation of the Hessian matrix $H=\mathop{}\!\mathrm{\partial}{g}/\mathop{}\!\mathrm{\partial}{y^{\text{T}}}$ . We therefore design the regularization loss as follows:

\mathcal{L}_{\text{reg}}=l(y^{\ast}+\Delta y,y_{\text{gt}}),

(10)

where $l(\cdot,\cdot)$ is a distance metric for pose. We adopt smooth L1 for position and cosine similarity for orientation (see supplementary materials for details). Note that the gradient is only backpropagated through $\Delta y$ , which is analytically differentiable w.r.t. the 2D-3D correspondences.

This loss not only addresses the first limitation by moving $y^{\ast}$ towards $y_{\text{gt}}$ , but also partially disentangles the 2D-3D correspondences. To analyze the effect of the loss on the correspondences, we consider a local approximation of Eq. (10), assuming equal weights for position and orientation:

	$\displaystyle\mathcal{L}_{\text{reg}}$	$\displaystyle\approx\mathopen{}\mathclose{{}\left\\|y^{\ast}+\Delta y-y_{\text{% gt}}}\right\\|^{2}$
		$\displaystyle=\mathopen{}\mathclose{{}\left\\|y^{\ast}-\smash[b]{\underbrace{(J% ^{\text{T}}J)^{-1}J^{\text{T}}}_{J^{+}}}F(y^{\ast})-y_{\text{gt}}}\right\\|^{2}% .\vphantom{\underbrace{(J^{\text{T}}J)^{-1}J^{\text{T}}}_{J^{+}}}$		(11)

Note that $(J^{\text{T}}J)^{-1}J^{\text{T}}$ is also the pseudo inverse of the matrix $J$ , which can be denoted by $J^{+}$ for brevity. Then, taking the first-order approximation $F(y^{\ast})=F(y_{\text{gt}})+J\mathopen{}\mathclose{{}\left(y^{\ast}-y_{\text{% gt}}}\right)$ , the loss can be approximated into:

	$\displaystyle\mathcal{L}_{\text{reg}}$	$\displaystyle\approx\mathopen{}\mathclose{{}\left\\|y^{\ast}-J^{+}\mathopen{}% \mathclose{{}\left(F(y_{\text{gt}})+J\mathopen{}\mathclose{{}\left(y^{\ast}-y_% {\text{gt}}}\right)}\right)-y_{\text{gt}}}\right\\|^{2}$
		$\displaystyle=\mathopen{}\mathclose{{}\left\\|J^{+}F(y_{\text{gt}})}\right\\|^{2}.$		(12)

This indicates that the derivative regularization loss is analogous to the reprojection-based surrogate loss $\mathopen{}\mathclose{{}\left\|F(y_{\text{gt}})}\right\|^{2}$ (Section 3.1.2). Although the extra weighting matrix $J^{+}$ makes the individual elements in the reprojection vector $F(y_{\text{gt}})$ underdetermined, over multiple samples and mini-batches there remains a tendency of independently minimizing each of the elements, i.e., minimizing the reprojection error of each correspondence. Thus, it helps to overcome the potential training difficulties associated with the KL loss.

The regularization loss can also serve as an independent objective for training pose estimators, akin to RePOSE [7]. However, since we observe that this objective alone is not effective in addressing pose ambiguity, it is treated as a secondary regularization in this study.

4 Implementation Details

4.1 Dynamic KL Loss Weight

Following [28], we compute a dynamic loss weight for $\mathcal{L}_{\text{KL}}$ so that the magnitude of its gradients is consistent regardless of the entropy of the distribution. This is implemented by computing the exponential moving average (EMA) of the 1-norm of the sum of weights $\mathopen{}\mathclose{{}\left\|\sum_{i=1}^{N}{w^{\text{2D}}_{i}}}\right\|_{1}$ , and using the reciprocal of the EMA value as the dynamic loss weight for $\mathcal{L}_{\text{KL}}$ . Intuitively, this cancels out the effect of the magnitude of $w^{\text{2D}}_{i}$ on the loss gradients w.r.t. $x^{\text{2D}}_{i}$ and $x^{\text{3D}}_{i}$ .

4.2 Adaptive Huber Kernel

For the PnP formulation in Eq. (1), the plain L2 reprojection errors $\mathopen{}\mathclose{{}\left\|f_{i}(y)}\right\|^{2}$ are sensitive to outliers, which limits the model’s expressiveness in representing multi-modal distributions that characterizes ambiguity. Therefore, we robustify the reprojection errors using the Huber kernel $\rho(\cdot)$ , yielding an alternative formulation:

\operatorname*{arg\,min}_{y}\frac{1}{2}\sum_{i=1}^{N}\rho\mathopen{}\mathclose% {{}\left(\mathopen{}\mathclose{{}\left\|f_{i}(y)}\right\|^{2}}\right).

(13)

The Huber kernel with threshold $\delta$ is defined as:

\rho(s)=\begin{dcases}s,&s\leq\delta^{2},\\ \delta(2\sqrt{s}-\delta),&s>\delta^{2}.\end{dcases}

(14)

To robustify the weighted reprojection errors of various scales, we adopt an adaptive threshold $\delta$ defined as a function of the weights ${w^{\text{2D}}_{i}}$ and 2D coordinates ${x^{\text{2D}}_{i}}$ :

\delta=\delta_{\text{rel}}\frac{\mathopen{}\mathclose{{}\left\|\bar{w}^{\text{% 2D}}}\right\|_{1}}{2}\mathopen{}\mathclose{{}\left(\frac{1}{N-1}\sum_{i=1}^{N}% {\mathopen{}\mathclose{{}\left\|x^{\text{2D}}_{i}-\bar{x}^{\text{2D}}}\right\|% ^{2}}}\right)^{\negthickspace\frac{1}{2}}

(15)

with the relative threshold $\delta_{\text{rel}}$ as a hyperparameter, and the mean vectors $\bar{w}^{\text{2D}}=\frac{1}{N}\sum_{i=1}^{N}{w_{i}^{\text{2D}}},\,\bar{x}^{% \text{2D}}=\frac{1}{N}\sum_{i=1}^{N}{x_{i}^{\text{2D}}}$ .

Accordingly, the reprojection errors $F(y)$ and Jacobian matrix $J$ in Eq. (9) have to be rescaled (see supplementary).

4.3 Initialization

Since the LM solver only finds a local solution, initialization plays a determinant role in dealing with ambiguity. We implement a random sampling algorithm analogous to RANSAC, to search for the global optimum efficiently.

Given the $N$ -point correspondence set $X=\mathopen{}\mathclose{{}\left\{x^{\text{3D}}_{i},x^{\text{2D}}_{i},w^{\text{% 2D}}_{i}}\right\}_{i=1}^{N}$ , we generate $M$ subsets consisting of $n$ corresponding points each ( $3\leq n<N$ ), by repeatedly sub-sampling $n$ indices without replacement from a multinomial distribution, whose probability mass function $p(i)$ is defined by the corresponding weights:

p(i)=\frac{\mathopen{}\mathclose{{}\left\|w^{\text{2D}}_{i}}\right\|_{1}}{\sum% _{i=1}^{N}{\mathopen{}\mathclose{{}\left\|w^{\text{2D}}_{i}}\right\|_{1}}}.

(16)

From each subset, a pose hypothesis can be solved via the LM algorithm within very few iterations (e.g. 3 iterations). This is implemented as a batch operation on GPU, and is rather efficient for small subsets. We take the hypothesis with maximum log-likelihood $\log{p(X|y)}$ as the initial point, starting from which subsequent LM iterations are computed on the full set $X$ .

4.3.1 Training Mode Initialization

During training, the LM PnP solver is utilized for estimating the location and concentration of the initial proposal distribution in the AMIS algorithm. The location is very important to the stability of Monte Carlo training. If the LM solver fails to find the global optimum, and the location of the local optimum is far from the true pose $y_{\text{gt}}$ , the balance between the two opposite signed terms in Eq. (5) may be broken, leading to exploding gradient in the worst case scenario. To avoid such problem, we adopt a simple initialization trick: we compare the log-likelihood $\log{p(X|y)}$ of the ground truth $y_{\text{gt}}$ and the selected hypothesis, and then keep the one with higher likelihood as the initial state of the LM solver.

5 6DoF Pose Estimation based on CDPN

To demonstrate that EPro-PnP can be applied to off-the-shelf 2D-3D correspondence networks, experiments have been conducted on CDPN [18], a dense correspondence network for 6DoF pose estimation.

5.1 Network Architecture

The original CDPN feeds cropped image regions within the detected 2D boxes into the pose estimation network, to which two decoupled heads are appended for rotation and translation respectively. The rotation head is PnP-based while the translation head uses explicit center and depth regression. This paper discards the translation head to focus entirely on PnP, and modifies only the last layer of the rotation head for strict comparison to the baseline.

As shown in Fig. 4, apart from the standard 3D coordinate map, the network predicts a 2-channel weight map (originally it is a single channel segmentation mask). We find it necessary to predict a global scale $w^{\text{2D}}_{\text{S}}$ separately, and apply it to the normalized weights ${w^{\text{2D}}_{\text{N}}}_{i}$ that satisfies $\sum_{i=1}^{N}{{w^{\text{2D}}_{\text{N}}}_{i}}=[1,1]^{\text{T}}$ . Intuitively, the global scale controls the entropy of the pose distribution $p(y|X)$ as it scales the entire log-likelihood, while the normalized weights determines the relative importance of each correspondence. This helps to overcome the entangling effect of the KL loss mentioned in Section 3.4. Inspired by the attention mechanism [24], the normalized weights are activated via spatial Softmax, focusing on important regions in the image. The global scale is usually inversely proportional to the 2D size of the object due to the uncertainty in reprojection, and is hard-coded as such in this network.

The original CDPN imposes masked coordinate regression loss[18] to learn the dense correspondences, using the ground truth object 3D models to render the target masks and 3D coordinate maps. With EPro-PnP, however, this extra geometry supervision is optional, as we demonstrate that the entire network can be trained solely by the KL loss $\mathcal{L}_{\text{KL}}$ and/or the derivative regularization loss $\mathcal{L}_{\text{reg}}$ . To reduce the Monte Carlo overhead, 512 points are randomly sampled from the 64×64 dense points to compute $\mathcal{L}_{\text{KL}}$ .

TABLE I: Results of the CDPN baseline. A0 and A1 are reproduced with the official code (https://git.io/JXZv6).

ID	Method	ADD(-S)			Mean
ID	Method	0.02d	0.05d	0.1d	Mean
A0	CDPN-Full [18]	29.10	69.50	91.03	63.21
A1	CDPN w/o trans. head	15.93	46.79	74.54	45.75
A2	A1 → Batch=32, LM solver	21.17	55.00	79.96	52.04

5.2 Dataset and Metrics

As in CDPN, we use the LineMOD [6] 6DoF pose estimation dataset to conduct our experiments. The dataset consists of 13 sequences, each containing about 1.2K images annotated with 6DoF poses of a single object. Following [36], the images are split into the training and testing sets, with about 200 images per object for training. For data augmentation, we use the same synthetic data as in CDPN [18].

We use two common metrics for evaluation: ADD(-S) and $n\text{\textdegree},n\,\text{cm}$ . The ADD measures whether the average deviation of the transformed model points is less than a certain fraction of the object’s diameter (e.g., ADD-0.1d). For symmetric objects, ADD-S computes the average distance to the closest model point. $n\text{\textdegree},n\,\text{cm}$ measures the accuracy of pose based on angular/positional error thresholds. All metrics are presented as percentages.

Despite that some objects in the dataset are nearly rotational symmetric, we observe that our model has no trouble identifying their exact orientations. Therefore, the presented results shall be closer to the scenario without pose ambiguity.

5.3 Baseline

For strict comparison, general settings are kept the same as in CDPN [18] (with ResNet-34 [40] as backbone). As shown in Table I, the original CDPN-Full (A0) trains the network in 3 stages totaling 480 epochs using RMSprop. With the translation head removed, we only train the rotation head in a single stage of 160 epochs (A1), which greatly impacts the pose accuracy (45.75 vs. 63.21). Additionally, we improve the baseline by using the LM solver with Huber kernel at test time, and increase the batch size to 32 for less training wall time (A2). Instead of using the advanced initialization technique in Section 4.3, we adopt the simple EPnP [41] initialization without RANSAC.

5.4 Main Results and Discussions

As shown in Table II, we conduct ablation studies to reveal the contributions of the Monte Carlo KL loss $\mathcal{L}_{\text{KL}}$ , the derivative regularization loss $\mathcal{L}_{\text{reg}}$ , the original coordinate regression loss $\mathcal{L}_{\text{crd}}$ in CDPN [18], and initializing the model with pretrained weights from A1.

5.4.1 KL Loss vs. Coordinate Regression

Training the model from scratch with the KL loss alone (B0) significantly outperforms the baseline model (A2) trained with the coordinate regression loss (61.87 vs. 52.04), despite the lack of geometry supervision from the ground truth object 3D models.

5.4.2 KL Loss and Derivative Regularization

Both the KL loss (B0) and the derivative regularization loss (B1) performs well independently on this benchmark. Because pose ambiguity is not noticeable in LineMOD dataset, the solver-based derivative regularization loss performs better than the KL loss (63.15 vs. 61.87). Nevertheless, the best possible pose accuracy without knowing the object geometry can be achieved when combining both loss functions together (B2), even outperforming CDPN-Full (A0) by a clear margin (67.36 vs. 63.21).

5.4.3 With Knowledge of the Object 3D Models

On top of B2, one can further impose the coordinate regression loss $\mathcal{L}_{\text{crd}}$ (B4) with target 3D coordinates rendered from the object 3D models, further improving the pose accuracy. Yet a better approach to exploiting the 3D models is to pretrain the network in the traditional way (A1) and then finetune it with EPro-PnP (B5), yielding significantly better results (73.87). This training scheme partially benefits from more training epochs (2×160 in total). Furthermore, kee** the coordinate regression loss during finetuning (B6) slightly improves the score (73.95 vs. 73.87).

We also observe that both the derivative regularization loss (B2) and the coordinate regression loss (B3) improve the results of the bare KL loss setup (B0) to similar extends (67.36 vs. 67.74), as they are both disentangled objectives.

5.5 Comparison to Implicit Differentiation and Reprojection-Based Loss

As shown in Table III, when the coordinate regression loss is removed, i.e., object 3D models are unavailable, both implicit differentiation and reprojection loss fail to learn the pose properly. Yet EPro-PnP manages to learn the 3D coordinates and weights from scratch. This validates that EPro-PnP can be used as a general pose estimator without relying on geometric prior.

TABLE II: Results on EPro-PnP-enhanced CDPN.

\mathcal{L}_{\text{crd}}

refers to the masked coordinate regression loss in the original [18], here the loss is imposed only on

x^{\text{3D}}

, not

w^{\text{2D}}

. Init. refers to initializing the model with pretrained weights from A1.

ID	$\mathcal{L}_{\text{KL}}$	$\mathcal{L}_{\text{reg}}$	$\mathcal{L}_{\text{crd}}$	Init.	ADD(-S)			Mean
ID	$\mathcal{L}_{\text{KL}}$	$\mathcal{L}_{\text{reg}}$	$\mathcal{L}_{\text{crd}}$	Init.	0.02d	0.05d	0.1d	Mean
B0	✓				28.48	67.20	89.93	61.87
B1		✓			25.86	70.90	92.68	63.15
B2	✓	✓			34.08	74.16	93.85	67.36
B3	✓		✓		34.40	75.00	93.83	67.74
B4	✓	✓	✓		36.22	75.97	94.64	68.94
B5	✓	✓		✓	43.34	82.13	96.14	73.87
B6	✓	✓	✓	✓	43.77	81.73	96.36	73.95

TABLE III: Comparison among loss functions by experiments conducted on the same dense correspondence network. For implicit differentiation, we minimize the distance metric of pose in Eq. (10) instead of the reprojection-metric pose loss in BPnP [12].

Main Loss	$\mathcal{L}_{\text{crd}}$	2°	2 cm	2°, 2 cm	ADD(-S) 0.1d
Implicit diff. [12]		divergence
Reprojection [28]		00.32	42.30	00.16	14.56
KL div. (ours)		58.28	91.17	55.71	89.93
Implicit diff. [12]	✓	56.13	91.13	53.33	88.74
Reprojection [28]	✓	62.79	92.91	60.65	92.04
KL div. (ours)	✓	69.95	94.97	68.38	93.83

TABLE IV: Comparison to the state-of-the-art geometric methods. BPnP [12] is not included as it adopts a different train/test split. *Although GDRNet [43] only reports the performance in its ablation section, it is still a fair comparison to our method, since both use the same baseline (CDPN).

Method	Type	ADD(-S)
Method	Type	0.02d	0.05d	0.1d
CDPN [18]	PnP + Explicit depth	-	-	89.86
HybridPose [14]	Hybrid constraints	-	-	91.3
GDRNet* [43]	PnP + Explicit depth	35.6	76.0	93.6
DPOD [8]	PnP + Explicit refiner	-	-	95.15
PVNet-RePOSE [7]	PnP + Implicit refiner	-	-	96.1
PVNet-RNNPose [42]	PnP + Implicit refiner	50.39	85.56	97.37
Ours	PnP	43.77	81.73	96.36

5.6 Comparison to the State of the Art

As shown in Table IV, although we base EPro-PnP on the older baseline CDPN [18], the results are better than some of the more advanced methods, e.g., the pose refiner RePOSE [7] that adds extra overhead to the PnP-based initial estimator PVNet [13]. Among all these entries, EPro-PnP is the most straightforward as it simply solves the PnP problem itself, without refinement network [7, 8, 42], explicit depth prediction [18, 43], or multiple representations [14].

Moreover, removing the translation head (depth prediction) from the original CDPN-Full results in far fewer parameters in our model (from 113M to 27M) , and the overall inference speed is more than twice as fast as CDPN-Full (including dataloading, measured at a batch size of 32), even though we introduce the iterative LM solver. Furthermore, faster inference is possible if the number of points $N=64\times 64$ is reduced to an optimal level.

5.7 Visualizations

As illustrated in Figure 5, the weight maps predicted by the model trained with the KL loss (B0) tend to be more focused on important parts of the objects (e.g., the head and handle of the watering can), while those with the derivative regularization loss (B1) are more evenly spread out. Combining the two loss functions (B2) leads to more reasonable weighting, and more details in the object geometry (represented by $x^{\text{3D}}$ ). With additional geometry pretraining and supervision (B6), the model outputs sharper correspondence maps, which contribute to higher pose accuracy and lower entropy of the probabilistic pose.

6 3D Object Detection based on Deformable Correspondence Network

To demonstrate that EPro-PnP can learn the entire set of 2D-3D correspondences $\mathopen{}\mathclose{{}\left\{x^{\text{3D}}_{i},x^{\text{2D}}_{i},w^{\text{2D% }}_{i}}\right\}_{i=1}^{N}$ from scratch, and the possibility of designing novel correspondence networks capable of handling pose ambiguity, we propose a novel deformable correspondence network for 3D object detection. The network owes its name to the Deformable DETR [27], a work that inspired our model architecture.

6.1 Network Architecture

As shown in Figure 6, the deformable correspondence network is an extension of the FCOS3D [44] framework. The original FCOS3D is a one-stage detector that directly regresses the center offset, depth, and yaw orientation of multiple objects for 4DoF pose estimation. In our adaptation, the outputs of the multi-level FCOS head [45] are modified to generate object queries instead of directly predicting the pose. Inspired by Deformable DETR [27], the appearance and position of a query is disentangled into the object embedding vector and the reference point. Moreover, to better distinguish objects of different classes, we learn a set of class embedding vectors, one of which will be selected according to the object label to be aggregated into the object embedding vector via addition (not shown in Figure 6 for brevity).

With the object queries, a multi-head deformable attention layer [27] is adopted to sample the key-value pairs from interpolated dense feature map, with the value projected into point-wise features (point feat), and meanwhile aggregated into the object-level features (obj feat).

The point features are passed into a subnet that predicts the 3D points and corresponding weights (normalized by Softmax). Following MonoRUn [28], the 3D points are set in the normalized object coordinate (NOC) space to handle categorical objects of various sizes.

The object features are responsible for predicting the object-level properties: (a) the 3D score (i.e., 3D localization confidence), (b) the global weight scale, (c) the 3D box size for recovering the absolute scale of the 3D points, and (d) other optional properties (velocity, attribute) required by the nuScenes benchmark [4].

6.1.1 Implementation Details

We adopt the same detector architecture as in FCOS3D [44], with ResNet-101-DCN [46] as backbone. The deformable correspondence head predicts $N=128$ pairs of 2D-3D points. The network is trained for 12 epochs by the AdamW [47] optimizer, with a batch size of 12 images across 2 GPUs on the nuScenes dataset [4].

6.2 Loss Functions

6.2.1 Correspondence Loss

The deformable 2D-3D correspondences can be learned solely with the KL divergence loss $\mathcal{L}_{\text{KL}}$ , or in conjunction with the regularization loss $\mathcal{L}_{\text{reg}}$ .

6.2.2 Auxiliary Correspondence Loss (Optional)

Inspired by the dense correspondence network MonoRUn [28], we regularize the dense features by appending a small auxiliary network that predicts the multi-head dense 3D coordinates and weights corresponding to densely-sampled 2D points within the ground truth (using RoI Align [48]). This allows us to employ the uncertainty-aware reprojection loss $\mathcal{L}_{\text{proj}}$ [28] without directly regularizing the deformable correspondences. Furthermore, we can convert the LiDAR scan of objects into sparse 3D object coordinate maps, so that the classical coordinate regression loss $\mathcal{L}_{\text{crd}}$ can be imposed on the auxiliary branch as well. Both of the loss functions are implemented as the NLL of Gaussian mixtures to deal with ambiguity (see supplementary for details).

6.2.3 Other Loss Functions

Loss functions on the FCOS head include:

•

Basic detector loss, including focal loss [49] for classification and cross entropy loss for centerness.
•

A smooth L1 loss for regressing the 2D reference points, with the target defined as the center of the visible region of the objects.
•

A GIoU loss [50] for auxiliary 2D box regression, following the 2D auxiliary supervision in M²BEV [51].

Loss functions for the object-level predictions include:

•

A cross entropy loss for the 3D score.
•

A smooth L1 loss for regressing the 3D box size.
•

A smooth L1 loss for regressing the velocity and a cross entropy loss for attribute classification.

Additionally, inspired by DD3D [2], we further exploit the available LiDAR data to build an auxiliary depth supervision. By projecting the LiDAR points to the camera frame, we extract the point-wise features from the interpolated dense feature map, which are then fed into a small 2-layer MLP to predict the scene depth. Same as the auxiliary correspondence loss functions in Section 6.2.2, the depth loss is implemented as the NLL of Gaussian mixtures, which allows modeling discontinuities around sharp edges [52].

6.3 Dataset and Metrics

We evaluate the deformable correspondence network on the nuScenes 3D object detection benchmark [4], which provides a large scale of data collected in 1000 scenes. Each scene contains 40 keyframes, annotated with a total of 1.4M 3D bounding boxes from 10 categories. Each keyframe includes 6 RGB images collected from surrounding cameras. The data is split into 700/150/150 scenes for training/validation/testing. The official benchmark evaluates the average precision with true positives judged by 2D center error on the ground plane. The mAP metric is computed by averaging over the thresholds of 0.5, 1, 2, 4 meters. Besides, there are 5 true positive metrics: Average Translation Error (ATE), Average Scale Error (ASE), Average Orientation Error (AOE), Average Velocity Error (AVE) and Average Attribute Error (AAE). Finally, there is a nuScenes detection score (NDS) computed as a weighted average of the above metrics.

6.4 Main Results and Discussions

6.4.1 Comparison Among Correspondence Loss Functions

As shown in Table V, the model trained with KL loss alone (C0) is significantly stronger than the model trained with the derivative regularization loss alone (C1) in all the metrics of concern, especially the orientation error (0.332 vs. 0.607). This is due to the presence of orientation ambiguity in the nuScenes dataset. Even if all the auxiliary loss functions (C2) are applied, the derivative regularization loss still fail to reach comparable performance to the Monte Carlo KL loss. Adding up all the loss functions (C3), the results can be boosted even further.

TABLE V: Experiments on the nuScenes validation set.

ID	$\mathcal{L}_{\text{KL}}$	$\mathcal{L}_{\text{reg}}$	Aux. loss		NDS↑	mAP↑	mATE↓	mAOE↓
ID	$\mathcal{L}_{\text{KL}}$	$\mathcal{L}_{\text{reg}}$	$\mathcal{L}_{\text{crd}}$	$\mathcal{L}_{\text{proj}}$	NDS↑	mAP↑	mATE↓	mAOE↓
C0	✓				0.447	0.380	0.656	0.332
C1		✓			0.408	0.363	0.683	0.607
C2		✓	✓	✓	0.429	0.363	0.691	0.397
C3	✓	✓	✓	✓	0.463	0.392	0.626	0.282

TABLE VI: Comparison to the state-of-the-art single-frame image-based 3D object detectors on the nuScenes test set. Methods with extra pretraining other than ImageNet backbone are not included for comparison. § indicates test-time flip augmentation (TTA). † indicates model ensemble.

Method	Backbone	NDS↑	mAP↑	mATE↓	mASE↓	mAOE↓	mAVE↓	mAAE↓
FCOS3D §† [44]	R101	0.428	0.358	0.690	0.249	0.452	1.434	0.124
PGD § [1]	R101	0.448	0.386	0.626	0.245	0.451	1.509	0.127
PETR [53]	R101	0.455	0.391	0.647	0.251	0.433	0.933	0.143
BEVFormer [54]	R101	0.462	0.409	0.650	0.261	0.439	0.925	0.147
PolarFormer [55]	R101	0.470	0.415	0.657	0.263	0.405	0.911	0.139
PETR [53]	Swin-B	0.483	0.445	0.627	0.249	0.449	0.927	0.141
Ours	R101	0.481	0.409	0.559	0.239	0.325	1.090	0.115
Ours §	R101	0.490	0.423	0.547	0.236	0.302	1.071	0.123

6.4.2 Comparison to the State of the Art

Results on the nuScenes test set [4] are shown in Table VI. At the time of submitting the manuscript (Jan 2023), EPro-PnP is the No. 1 single-frame monocular 3D object detector without extra data, according to the official nuScenes detection leaderboard. Among the models using ResNet-101 as backbones, EPro-PnP outperforms PolarFormer [55] by a clear margin (NDS 0.481 vs. 0.470), despite basing the deformable correspondence network on the older FCOS detector. With test-time flip augmentation (following FCOS3D [44]), our model even outperforms PGD [1] with the bulky Swin-B [56] backbone.

Since EPro-PnP is targeted at improving pose accuracy, it is not surprising to see that our model obtains exceptional results regarding the mATE and mAOE metrics, outperforming PolarFormer by a wide margin (mATE 0.559 vs. 0.657, mAOE 0.325 vs. 0.405).

It is worth noting that, EPro-PnP is currently the only method among the entries in Table VI that utilizes geometric pose reasoning, which is not a popular choice because previous non-end-to-end geometric methods usually fall behind when trained on large-scale real-life data.

6.5 Visualizations

An example of the monocular detection result is shown in Figure 7. We observe that the red 2D points (indicating greater $x^{\text{3D}}$ in the X axis) are usually spread right over the objects, which mainly determines the orientation, while the green 2D points (indicating greater $x^{\text{3D}}$ in the Y axis) are off the top and bottom of the objects, which determines the position (mainly the depth). It seems that the network learns to associate object depth to the height of the object’s projection, since the height invariant to 1D orientation in the ground plane.

Figure 8 shows that the flexibility of EPro-PnP allows predicting multimodal distributions with strong expressive power, successfully capturing the orientation ambiguity without discrete multi-bin classification [44, 57] or complicated mixture model [37].

6.6 Inference Time

The average inference time per frame (comprising a batch of 6 surrounding 1600×672 images, without TTA) is shown in Table VII, measured on RTX 3090 GPU and Core i9-10920X CPU. On average, the batch PnP solver takes 26 ms/46 ms processing 655.3 objects per frame, before non-maximum suppression (NMS).

TABLE VII: Inference time (sec) of the deformable correspondence network on nuScenes [4]. The PnP solver (including initialization) works faster (26 ms) with PyTorch v1.8.1, for which the code was originally developed, while the full model works faster (304 ms) with PyTorch v1.10.1.

PyTorch	Backbone & FPN	Heads		PnP	Total
PyTorch	Backbone & FPN	FCOS	Deform	PnP	Total
v1.8.1+cu111	0.194	0.074	0.029	0.026	0.328
v1.10.1+cu113	0.173	0.056	0.025	0.046	0.304

7 Limitations

Training the network with the Monte Carlo pose loss is inevitably slower than the baseline. With the batch size of 32 on a GTX 1080 Ti GPU, training the CDPN (without translation head) takes 143 seconds per epoch with the original coordinate regression loss, and 241 seconds per epoch with the Monte Carlo pose loss, which is about 70% longer time. However, the training time can be controlled by adjusting the number of Monte Carlo samples or the number of 2D-3D corresponding points.

Although the underlying principles are theoretically generalizable to other learning models with nested optimization layer, known as declarative networks [38], the Monte Carlo pose loss would become impractical with the growth of dimensionality.

While EPro-PnP seems to be a universal approach to end-to-end geometric pose estimation, it should be noted that the design of 2D-3D correspondence network still plays a major role in the model. For example, simply removing the 2D box size from Figure 4 would result in a notable decrease in pose accuracy. Future work may explore the feature-metric correspondence in [7, 42, 58] as a more expressive alternative to plain Euclidean reprojection error.

8 Conclusion

This paper proposes the EPro-PnP, which translates the non-differentiable PnP operation into a differentiable probabilistic layer, empowering end-to-end 2D-3D correspondence learning of unprecedented flexibility. The connections to previous work [28, 12, 11, 10, 7] have been thoroughly discussed with theoretical and experimental proofs, revealing the contributions of the Monte Carlo KL loss and the derivative regularization loss. For application, EPro-PnP can be simply integrated into existing PnP-based networks, or inspire novel solutions such as the deformable correspondence network.

Acknowledgments

This project was supported by the National Natural Science Foundation of China [No. 52002285], the Shanghai Science and Technology Commission [No. 21ZR1467400], the original research project of Tongji University [No. 22120220593], and the National Key R&D Program of China [No. 2021YFB2501104]. Part of the work was done when H. Chen was interning at Alibaba Group, supported by the Alibaba Research Intern Program.

References

[1] T. Wang, X. Zhu, J. Pang, and D. Lin, “Probabilistic and geometric depth: Detecting objects in perspective,” in Conference on Robot Learning (CoRL), 2021.
[2] D. Park, R. Ambrus, V. Guizilini, J. Li, and A. Gaidon, “Is pseudo-lidar needed for monocular 3d object detection?” in ICCV, 2021.
[3] Y. Wang, V. Guizilini, T. Zhang, Y. Wang, H. Zhao, and J. Solomon, “Detr3d: 3d object detection from multi-view images via 3d-to-2d queries,” in Conference on Robot Learning (CoRL), 2021.
[4] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in CVPR, 2020.
[5] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012.
[6] S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Konolige, N. Navab, and V. Lepetit, “Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes,” in ICCV, 2011.
[7] S. Iwase, X. Liu, R. Khirodkar, R. Yokota, and K. M. Kitani, “Repose: Fast 6d object pose refinement via deep texture rendering,” in ICCV, 2021.
[8] S. Zakharov, I. Shugurov, and S. Ilic, “Dpod: 6d pose object detector and refiner,” in ICCV, 2019.
[9] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “Dsac - differentiable ransac for camera localization,” in CVPR, 2017.
[10] E. Brachmann and C. Rother, “Learning less is more - 6d camera localization via 3d surface regression,” in CVPR, 2018.
[11] D. Campbell, L. Liu, and S. Gould, “Solving the blind perspective-n-point problem end-to-end with robust differentiable geometric optimization,” in ECCV, 2020.
[12] B. Chen, A. Parra, J. Cao, N. Li, and T.-J. Chin, “End-to-end learnable geometric vision by backpropagating pnp optimization,” in CVPR, 2020.
[13] S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “Pvnet: Pixel-wise voting network for 6dof pose estimation,” in CVPR, 2019.
[14] C. Song, J. Song, and Q. Huang, “Hybridpose: 6d object pose estimation under hybrid representations,” in CVPR, 2020.
[15] M. Rad and V. Lepetit, “BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth,” in ICCV, 2017.
[16] H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas, “Normalized object coordinate space for category-level 6d object pose and size estimation,” in CVPR, 2019.
[17] K. Park, T. Patten, and M. Vincze, “Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation,” in ICCV, 2019.
[18] Z. Li, G. Wang, and X. Ji, “Cdpn: Coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation,” in ICCV, 2019.
[19] P. Li, H. Zhao, P. Liu, and F. Cao, “Rtm3d: Real-time monocular 3d detection from object keypoints for autonomous driving,” in ECCV, 2020.
[20] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teulière, and T. Chateau, “Deep manta: A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image,” in CVPR, 2017.
[21] F. Manhardt, D. M. Arroyo, C. Rupprecht, B. Busam, N. Navab, and F. Tombari, “Explaining the ambiguity of object detection and 6d pose from visual data,” in ICCV, 2019.
[22] G. Schweighofer and A. Pinz, “Robust pose estimation from a planar target,” IEEE TPAMI, vol. 28, no. 12, pp. 2024–2030, 2006.
[23] J.-M. Cornuet, J.-M. Marin, A. Mira, and C. P. Robert, “Adaptive multiple importance sampling,” Scandinavian Journal of Statistics, vol. 39, no. 4, p. 798–812, 2012.
[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.
[25] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, 2020.
[26] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in CVPR, 2018.
[27] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” in ICLR, 2021.
[28] H. Chen, Y. Huang, W. Tian, Z. Gao, and L. Xiong, “Monorun: Monocular 3d object detection by reconstruction and uncertainty propagation,” in CVPR, 2021.
[29] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” in NIPS, 2017.
[30] Y. He, C. Zhu, J. Wang, M. Savvides, and X. Zhang, “Bounding box regression with uncertainty for accurate object detection,” in CVPR, 2019.
[31] S. Wu, C. Rupprecht, and A. Vedaldi, “Unsupervised learning of probably symmetric deformable 3d objects from images in the wild,” in CVPR, 2020.
[32] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
[33] I. Gilitschenski, R. Sahoo, W. Schwarting, A. Amini, S. Karaman, and D. Rus, “Deep orientation uncertainty learning based on a bingham loss,” in ICLR, 2020.
[34] O. Makansi, E. Ilg, O. Cicek, and T. Brox, “Overcoming limitations of mixture density networks: A sampling and fitting framework for multimodal future prediction,” in CVPR, 2019.
[35] C. M. Bishop, “Mixture density networks,” 1994.
[36] E. Brachmann, F. Michel, A. Krull, M. Y. Yang, S. Gumhold, and c. Rother, “Uncertainty-driven 6d pose estimation of objects and scenes from a single rgb image,” in CVPR, 2016.
[37] M. Bui, T. Birdal, H. Deng, S. Albarqouni, L. Guibas, S. Ilic, and N. Navab, “6d camera relocalization in ambiguous scenes via continuous multimodal inference,” in ECCV, 2020.
[38] S. Gould, R. Hartley, and D. J. Campbell, “Deep declarative networks,” IEEE TPAMI, 2021.
[39] D. E. Tyler, “Statistical analysis for the angular central gaussian distribution on the sphere,” Biometrika, vol. 74, no. 3, pp. 579–589, 1987. [Online]. Available: http://www.jstor.org/stable/2336697
[40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
[41] V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o(n) solution to the pnp problem,” International Journal Of Computer Vision, vol. 81, pp. 155–166, 2009.
[42] Y. Xu, K.-Y. Lin, G. Zhang, X. Wang, and H. Li, “Rnnpose: Recurrent 6-dof object pose refinement with robust correspondence field estimation and pose optimization,” in CVPR, 2022.
[43] G. Wang, F. Manhardt, F. Tombari, and X. Ji, “Gdr-net: Geometry-guided direct regression network for monocular 6d object pose estimation,” in CVPR, 2021.
[44] T. Wang, X. Zhu, J. Pang, and D. Lin, “FCOS3D: Fully convolutional one-stage monocular 3d object detection,” in ICCV Workshops, 2021.
[45] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in CVPR, 2019.
[46] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in CVPR, 2017.
[47] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in ICLR, 2019.
[48] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in ICCV, 2017.
[49] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” IEEE TPAMI, vol. 42, no. 2, pp. 318–327, 2020.
[50] S. H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. D. Reid, and S. Savarese, “Generalized intersection over union: A metric and A loss for bounding box regression,” in CVPR, 2019.
[51] E. Xie, Z. Yu, D. Zhou, J. Philion, A. Anandkumar, S. Fidler, P. Luo, and J. M. Alvarez, “M2bev: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation,” 2022.
[52] F. Tosi, Y. Liao, C. Schmitt, and A. Geiger, “Smd-nets: Stereo mixture density networks,” in CVPR, 2021.
[53] Y. Liu, T. Wang, X. Zhang, and J. Sun, “Petr: Position embedding transformation for multi-view 3d object detection,” in ECCV, 2022.
[54] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and J. Dai, “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” in ECCV, 2023.
[55] Y. Jiang, L. Zhang, Z. Miao, X. Zhu, J. Gao, W. Hu, and Y.-G. Jiang, “Polarformer: Multi-camera 3d object detection with polar transformers,” in AAAI, 2023.
[56] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
[57] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka, “3d bounding box estimation using deep learning and geometry,” in CVPR, 2017.
[58] Q. Lian, P. Li, and X. Chen, “Monojsg: Joint semantic and geometric cost volume for monocular 3d object detection,” in CVPR, 2022, pp. 1060–1069.
[59] S. Agarwal, K. Mierle, and Others, “Ceres solver,” http://ceres-solver.org.
[60] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle adjustment: A modern synthesis,” in International Workshop on Vision Algorithms: Theory and Practice, 2000.
[61] E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan, T. Karaletsos, R. Singh, P. Szerlip, P. Horsfall, and N. D. Goodman, “Pyro: Deep Universal Probabilistic Programming,” Journal of Machine Learning Research, 2018.
[62] I. S. Dhillon and S. Sra, “Modeling data using directional distributions,” 2003.

Appendix A Levenberg-Marquardt PnP Solver

For parallel processing on GPU, we have implemented a PyTorch-based batch Levenberg-Marquardt (LM) PnP solver. The implementation generally follows the Ceres solver [59]. Here, we discuss some important details that are related to the proposed Monte Carlo pose sampling and derivative regularization.

A.1 LM Step with Huber Kernel

Adding the Huber kernel influences every related aspect from the likelihood function to the LM iteration step and derivative regularization loss. Thanks to PyTorch’s automatic differentiation, the robustified Monte Carlo KL divergence loss does not require much special handling. For the LM solver, however, the residual $F(y)$ (concatenated weighted reprojection errors) and the Jacobian matrix $J$ have to be rescaled before computing the robustified LM step [60].

The rescaled residual block $\tilde{f}_{i}(y)$ and Jacobian block $\tilde{J}_{i}(y)$ of the $i$ -th point pair are defined as:

\tilde{f}_{i}(y)=\sqrt{\rho^{\prime}_{i}}f_{i}(y),

(17)

\tilde{J}_{i}(y)=\sqrt{\rho^{\prime}_{i}}J_{i}(y),

(18)

where

\rho^{\prime}_{i}=\begin{dcases}1,&\|f_{i}(y)\|\leq\delta,\\ \frac{\delta}{\|f_{i}(y)\|},&\|f_{i}(y)\|>\delta,\end{dcases}

(19)

J_{i}(y)=\frac{\mathop{}\!\mathrm{\partial}{f_{i}(y)}}{\mathop{}\!\mathrm{% \partial}{y^{\text{T}}}}.

(20)

Following the implementation of Ceres solver [59], the robustified LM iteration step is:

\Delta y=-\mathopen{}\mathclose{{}\left(\tilde{J}^{\text{T}}\tilde{J}+\lambda D% ^{2}}\right)^{-1}\tilde{J}^{\text{T}}\tilde{F},

(21)

where

\tilde{J}=\begin{bmatrix}\tilde{J}_{1}(y)\\ \vdots\\ \tilde{J}_{N}(y)\end{bmatrix},\tilde{F}=\begin{bmatrix}\tilde{f}_{1}(y)\\ \vdots\\ \tilde{f}_{N}(y)\end{bmatrix},

(22)

$D$ is the square root of the diagonal of the matrix $\tilde{J}^{\text{T}}\tilde{J}$ , and $\lambda$ is the reciprocal of the LM trust region radius [59].

Note that the rescaled residual and Jacobian affects the derivative regularization, as well as the covariance estimation in the next subsection.

A.1.1 Fast Inference Mode

We empirically observe that in a well-trained model, the LM trust region radius can be initialized with a very large value, effectively rendering the LM algorithm redundant. We therefore use the simple Gauss-Newton implementation for fast inference:

\Delta y=-\mathopen{}\mathclose{{}\left(\tilde{J}^{\text{T}}\tilde{J}+% \varepsilon I}\right)^{-1}\tilde{J}^{\text{T}}\tilde{F},

(23)

where $\varepsilon$ is a small value for numerical stability.

A.2 Covariance Estimation

During training, the concentration of the AMIS proposal is determined by the local estimation of pose covariance matrix $\Sigma_{y^{\ast}}$ , defined as:

\Sigma_{y^{\ast}}=\mathopen{}\mathclose{{}\left(\tilde{J}^{\text{T}}\tilde{J}+% \varepsilon I}\right)^{-1}\Big{\rvert}_{y=y^{\ast}},

(24)

where $y^{\ast}$ is the LM solution that determines the location of the proposal distribution.

Appendix B Details on Monte Carlo Pose Sampling

B.1 Proposal Distribution for Position

For the proposal distribution of the translation vector $t\in\mathbb{R}^{3}$ , we adopt the multivariate t-distribution, with the following probability density function (PDF):

q_{\text{T}}(t)=\frac{\Gamma\mathopen{}\mathclose{{}\left(\frac{\nu+3}{2}}% \right)}{\Gamma\mathopen{}\mathclose{{}\left(\frac{\nu}{2}}\right)\sqrt{\nu^{3% }\pi^{3}|\Sigma|}}\mathopen{}\mathclose{{}\left(1+\frac{1}{\nu}\|t-\mu\|_{% \Sigma}^{2}}\right)^{\negmedspace-\frac{\nu+3}{2}},

(25)

where $\|t-\mu\|_{\Sigma}^{2}=(t-\mu)^{\text{T}}\Sigma^{-1}(t-\mu)$ , with the location $\mu$ , the 3×3 positive definite scale matrix $\Sigma$ , and the degrees of freedom $\nu$ . Following [23], we set $\nu$ to 3. Compared to the multivariate normal distribution, the t-distribution has a heavier tail, which is ideal for robust sampling.

The multivariate t-distribution has been implemented in the Pyro [61] package.

B.1.1 Initial Parameters

The initial location and scale is determined by the PnP solution and covariance matrix, i.e., $\mu\leftarrow t^{\ast},\Sigma\leftarrow\Sigma_{t^{\ast}}$ , where $\Sigma_{t^{\ast}}$ is the 3×3 submatrix of the full pose covariance $\Sigma_{p^{\ast}}$ . Note that the actual covariance of the t-distribution is thus $\frac{\nu}{\nu-1}\Sigma_{t^{\ast}}$ , which is intentionally scaled up for robust sampling in a wider range.

B.1.2 Parameter Estimation from Weighted Samples

To update the proposal, we let the location $\mu$ and scale $\Sigma$ be the first and second moment of the weighted samples (i.e., weighted mean and covariance), respectively.

B.2 Proposal Distribution for 1D Orientation

For the proposal distribution of the 1D yaw-only orientation $\theta$ , we adopt a mixture of von Mises and uniform distribution. The von Mises is also known as the circular normal distribution, and its PDF is given by:

q_{\text{VM}}(\theta)=\frac{\exp{(\kappa\cos{(\theta-\mu)})}}{2\pi I_{0}(% \kappa)},

(26)

where $\mu$ is the location parameter, $\kappa$ is the concentration parameter, and $I_{0}(\cdot)$ is the modified Bessel function with order zero. The mixture PDF is thus:

q_{\text{mix}}(\theta)=(1-\alpha)q_{\text{VM}}(\theta)+\alpha q_{\text{uniform% }}(\theta),

(27)

with the uniform mixture weight $\alpha$ . The uniform component is added in order to capture other potential modes under orientation ambiguity. We set $\alpha$ to a fixed value of $1/4$ .

PyTorch has already implemented the von Mises distribution, but its random sample generation is rather slow. As an alternative we use the NumPy implementation for random sampling.

B.2.1 Initial Parameters

With the yaw angle $\theta^{\ast}$ and its variance $\sigma^{2}_{\theta^{\ast}}$ from the PnP solver, the parameters of the von Mises proposal is initialized by $\mu\leftarrow\theta^{\ast},\kappa\leftarrow\frac{1}{3\sigma^{2}_{\theta^{\ast}}}$ .

B.2.2 Parameter Estimation from Weighted Samples

For the location $\mu$ , we simply adopt its maximum likelihood estimation, i.e., the circular mean of the weighted samples. For the concentration $\kappa$ , we first compute an approximated estimation [62] by:

\hat{\kappa}=\frac{\bar{r}(2-\bar{r}^{2})}{1-\bar{r}^{2}},

(28)

where $\bar{r}=\mathopen{}\mathclose{{}\left\lVert\sum_{j}v_{j}[\sin{\theta_{j}},\cos% {\theta_{j}}]^{\text{T}}/\sum_{j}v_{j}}\right\rVert$ is the norm of the mean orientation vector, with the importance weight $v_{j}$ for the $j$ -th sample $\theta_{j}$ . Finally, the concentration is scaled down for robust sampling, such that $\kappa\leftarrow\hat{\kappa}/3$ .

B.3 Proposal Distribution for 3D Orientation

Regarding the quaternion-based parameterization of 3D orientation, which can be represented by a unit 4D vector $l$ , we adopt the angular central Gaussian (ACG) distribution as the proposal. The support of the 4-dimensional ACG distribution is the unit hypersphere, and the PDF is given by:

q_{\text{ACG}}(l)=\frac{(l^{\text{T}}\Lambda^{-1}l)^{-2}}{S_{4}|\Lambda|^{% \frac{1}{2}}},

(29)

where $S_{4}=2\pi^{2}$ is the 3D surface area of the 4D sphere, and $\Lambda$ is a 4×4 positive definite matrix.

The ACG density can be derived by integrating the zero-mean multivariate normal distribution $\mathcal{N}(0,\Lambda)$ along the radial direction from $0$ to $\inf$ . Therefore, drawing samples from the ACG distribution is equivalent to sampling from $\mathcal{N}(0,\Lambda)$ and then normalizing the samples to unit radius.

B.3.1 Initial Parameters

Consider $l^{\ast}$ to be the PnP solution and $\Sigma_{l^{\ast}}^{-1}$ to be the estimated 4×4 inverse covariance matrix. Note that $\Sigma_{l^{\ast}}^{-1}$ is only valid in the local tangent space with rank 3, satisfying ${l^{\ast}}^{\text{T}}\Sigma_{l^{\ast}}^{-1}l^{\ast}=0$ . The initial parameters are determined by:

\Lambda\leftarrow\hat{\Lambda}+\alpha|\hat{\Lambda}|^{\frac{1}{4}}I,

(30)

where $\hat{\Lambda}=\mathopen{}\mathclose{{}\left(\Sigma_{l^{\ast}}^{-1}+I}\right)^{% -1}$ , and $\alpha$ is a hyperparameter that controls the dispersion of the proposal for robust sampling. We set $\alpha$ to 0.001 in the experiments.

B.3.2 Parameter Estimation from Weighted Samples

Based on the samples $l_{j}$ and weights $v_{j}$ , the maximum likelihood estimation $\hat{\Lambda}$ is the solution to the following equation:

\hat{\Lambda}=\frac{4}{\sum_{j}v_{j}}\sum_{j}\frac{v_{j}l_{j}l_{j}^{\text{T}}}% {l_{j}^{\text{T}}\hat{\Lambda}^{-1}l_{j}}.

(31)

The solution to Eq. (31) can be computed by fixed-point iteration [39]. The final parameters of the updated proposal is determined the same way as in Eq. (30).

Appendix C Details on Derivative Regularization Loss

As stated in the main paper, the derivative regularization loss $\mathcal{L}_{\text{reg}}$ consists of the position loss $\mathcal{L}_{\text{pos}}$ and the orientation loss $\mathcal{L}_{\text{orient}}$ .

For $\mathcal{L}_{\text{pos}}$ , we adopt the smooth L1 loss based on the Euclidean distance $d_{t}=\|t^{\ast}+\Delta t-t_{\text{gt}}\|$ , given by:

\mathcal{L}_{\text{pos}}=\begin{dcases}\frac{d_{t}^{2}}{2\beta},&d_{t}\leq% \beta,\\ d_{t}-0.5\beta,&d_{t}>\beta,\end{dcases}

(32)

with the hyperparameter $\beta$ .

For $\mathcal{L}_{\text{orient}}$ , we adopt the cosine similarity loss based on the angular distance $d_{\theta}$ . For 1D orientation parameterized by the angle $\theta$ , $d_{\theta}=\theta^{\ast}+\Delta\theta-\theta_{\text{gt}}$ . For 3D orientation parameterized by the quaternion vector $l$ , $d_{\theta}=2\arccos{(l^{\ast}+\Delta l)^{\text{T}}l_{\text{gt}}}$ . The loss function is therefore defined as:

\mathcal{L}_{\text{orient}}=1-\cos{d_{\theta}}.

(33)

For 3D orientation, after the substitution, the loss function can be simplified to:

\mathcal{L}_{\text{orient}}=2-2\mathopen{}\mathclose{{}\left((l^{\ast}+\Delta l% )^{\text{T}}l_{\text{gt}}}\right)^{2}.

(34)

For the specific settings of the hyperparameter $\beta$ and loss weights, please refer to the experiment configuration code.

Appendix D Details on the Deformable Correspondence Network

D.1 Network Architecture

The detailed network architecture of the deformable correspondence network is shown in Figure 9. Following deformable DETR [27], this paper adopts the multi-head deformable sampling. Let $n_{\text{head}}$ be the number of heads and $n_{\text{hpts}}$ be the number of points per head, a total number of $N=n_{\text{head}}n_{\text{hpts}}$ points are sampled for each object. The sampling locations relative to the reference point are predicted from the object embedding by a single layer of linear transformation. We set $n_{\text{head}}$ to 8, which yields $256/n_{\text{head}}=32$ channels for the point features.

The point-level branch on the left side of Figure 9 is responsible for predicting the 3D points $x^{\text{3D}}_{i}$ and corresponding weights $w^{\text{2D}}_{i}$ . The sampled point features are first enhanced by the object-level context, by adding the reshaped head-wise object embedding to the point features. Then, the features of the $N$ points are processed by the self-attention layer, for which the 2D points are transformed into positional encoding. The attention layer is followed by standard layers of normalization, skip connection, and feedforward network (FFN).

Regarding the object-level branch on the right side of Figure 9, a multi-head attention layer is employed to aggregate the sampled point features. Unlike the original deformable attention layer [27] that predicts the attention weights by linear projection of the object embedding, we adopt the full Q-K dot-product attention with positional encoding. After being processed by the subsequent layers, the object-level features are finally transformed into to the object-level predictions, consisting of the 3D localization score, weight scale, 3D bounding box size, and other optional properties (velocity and attribute). Note that the attention layer is actually not a necessary component for object-level predictions, but rather a byproduct of the deformable point samples whose features can be leveraged with little computation overhead.

D.2 Loss Functions for Object-Level Predictions

As in FCOS3D [44], we adopt smooth L1 regression loss for 3D box size and velocity, and cross-entropy classification loss for attribute. Additionally, a binary cross-entropy loss is imposed upon the 3D localization score, with the target $c_{\text{tgt}}$ defined as a score function of the position error:

	$\displaystyle c_{\text{tgt}}$	$\displaystyle=\mathit{Score}(\\|t^{\ast}_{\text{XZ}}-{t_{\text{XZ}}}_{\text{gt}% }\\|)$
		$\displaystyle=\max(0,\min(1,-a\log{\\|t^{\ast}_{\text{XZ}}-{t_{\text{XZ}}}_{% \text{gt}}\\|}+b)),$		(35)

where $t^{\ast}_{\text{XZ}}$ is the XZ components of the PnP solution, ${t_{\text{XZ}}}_{\text{gt}}$ is the XZ components of the true pose, and $a,b$ are the linear coefficients. The predicted 3D localization score $c_{\text{pred}}$ shall reflect the positional uncertainty of an object, as a faster alternative to evaluating the uncertainty via the Monte Carlo method during inference (Section D.5). The final detection score is defined as the product of the predicted 3D score and the classification score from the base detector.

D.3 Auxiliary Loss Functions

D.3.1 Auxiliary Correspondence Loss

To regularize the dense features, we append an auxiliary branch that predicts the multi-head dense 3D coordinates and corresponding weights, as shown in Figure 10. Leveraging the ground truth of object 2D boxes, the features within the box regions are densely sampled via RoI Align [48], and transformed into the 3D coordinates $x^{\text{3D}}$ and weights $w^{\text{2D}}$ via an independent linear layer. Besides, the attention weights $\phi$ are obtained via Q-K dot-product and normalized along the $n_{\text{head}}$ dimension and across the overlap** regions of multiple RoIs via Softmax.

During training, we impose the reprojection-based auxiliary loss for the multi-head dense predictions, formulated as the negative log-likelihood (NLL) of the Gaussian mixture model [35]. Following [28], the reprojection error is further robustified by the Huber kernel $\rho(\cdot)$ . The loss function for each sampled point is defined as:

\mathcal{L}_{\text{proj}}=\smash[b]{-\log\sum_{\text{RoI}}\sum_{k=1}^{n_{\text% {head}}}\phi_{k}|\mathop{\mathrm{diag}}{w_{k}^{\text{2D}}}|\exp{-\frac{1}{2}% \rho\mathopen{}\mathclose{{}\left(\|f_{k}(y_{\text{gt}})\|^{2}}\right)}},% \vspace{1mm}

(36)

where $k$ is the head index, $f_{k}(y_{\text{gt}})$ is the weighted reprojection error of the $k$ -th head at the truth pose $y_{\text{gt}}$ . In the above equation, the diagonal matrix $\mathop{\mathrm{diag}}{w_{k}^{\text{2D}}}$ is interpreted as the inverse square root of the covariance matrix of the normal distribution, i.e., $\mathop{\mathrm{diag}}{w_{k}^{\text{2D}}}=\Sigma^{-\frac{1}{2}}$ , and the head attention weight $\phi_{k}$ is interpreted as the mixture component weight. $\sum_{\text{RoI}}$ is a special operation that takes the overlap** region of multiple RoIs into account, formulating a mixture of multiple heads and multiple RoIs (see code for details).

Another auxiliary loss is the coordinate regression loss that introduces the geometric knowledge. Following MonoRUn [28], we extract the sparse ground truth of 3D coordinates $x^{\text{3D}}_{\text{gt}}$ from the 3D LiDAR point cloud. The robustified Gaussian mixture NLL loss for each sampled point with available ground truth is defined as:

\mathcal{L}_{\text{crd}}=-\log{\sum_{k=1}^{n_{\text{head}}}\phi_{k}w^{2}\exp-% \frac{1}{2}\rho\mathopen{}\mathclose{{}\left(\mathopen{}\mathclose{{}\left\|w% \mathopen{}\mathclose{{}\left(x^{\text{3D}}_{k}-x^{\text{3D}}_{\text{gt}}}% \right)}\right\|^{2}}\right)},

(37)

where $w\in\mathbb{R}^{+}$ is a scalar weight parameter (bounded by $\exp$ activation) to be optimized during training.

As with the KL loss in the main paper, dynamic loss weights [28] are employed to rescale these two auxiliary loss functions.

D.3.2 Auxiliary Depth Loss

For each projected LiDAR point in the image, we extract the feature vector from the interpolated dense feature map, which are then fed into a small 2-layer MLP to predict the scene depth. The output depth is represented by a Gaussian mixture distribution encoded by $\mathopen{}\mathclose{{}\left\{{\phi_{\text{D}}}_{k},z_{k}}\right\}_{k=1}^{n_{% \text{D}}}$ , where the mixture weights ${\phi_{\text{D}}}_{k}$ are normalized via Softmax. Given the ground truth depth $z_{\text{gt}}$ of this point, the loss function is defined as:

\mathcal{L}_{\text{D}}=-\log{\sum_{k=1}^{n_{\text{D}}}{\phi_{\text{D}}}_{k}w_{% \text{D}}\exp-\frac{1}{2}\rho\mathopen{}\mathclose{{}\left(w_{\text{D}}^{2}% \mathopen{}\mathclose{{}\left(z_{k}-z_{\text{gt}}}\right)^{2}}\right)},

(38)

where $w_{\text{D}}\in\mathbb{R}^{+}$ is a weight parameter (bounded by $\exp$ activation) to be optimized during training.

D.4 Training Strategy

During training, we randomly sample 48 positive object queries from the FCOS3D [44] detector for each image, which limits the batch size of the deformable correspondence network to control the computation overhead of the Monte Carlo pose loss.

D.5 Experiments on the Uncertainty of Object Pose

The entropy of the inferred pose distribution reflects the aleatoric uncertainty of the predicted pose. Previous work [28] reasons the pose uncertainty by propagating the reprojection uncertainty learned from a surrogate loss through the PnP operation, but that uncertainty requires calibration and is not reliable enough. In our work, the pose uncertainty is learned with the KL-divergence-based pose loss in an end-to-end manner, which is much more reliable fundamentally.

To quantitatively evaluate the reliability of the pose uncertainty in terms of measuring the localization confidence, a straightforward approach is to compute the 3D localization score $c_{\text{MC}}$ via Monte Carlo pose sampling, and compare the resulting mAP against the standard implementation with 3D score $c_{\text{pred}}$ predicted from the object-level branch. With the PnP solution $t^{\ast}$ , the sampled translation vector $t_{j}$ , and its importance weight $v_{j}$ , the Monte Carlo score is computed by:

c_{\text{MC}}=\frac{1}{\sum_{j}v_{j}}\sum_{j}v_{j}\mathit{Score}\mathopen{}% \mathclose{{}\left(\|t^{\ast}_{\text{XZ}}-{t_{\text{XZ}}}_{j}\|}\right),

(39)

where the subscript $(\cdot)_{\text{XZ}}$ denotes taking the XZ components, and the function $\mathit{Score}(\cdot)$ is the same as in Eq. 35.

As shown in Table VIII, the mAP obtained via Monte Carlo scoring is on par with the standard implementation (0.393 vs. 0.392), indicating that the pose uncertainty is a reliable measure of the detection confidence.

TABLE VIII: Comparison between the scoring methods on the nuScenes validation set.

Scoring method	NDS↑	mAP↑	mATE↓	mAOE↓
Standard	0.463	0.392	0.626	0.282
Monte Carlo	0.463	0.393	0.623	0.286

Appendix E Notation

TABLE IX: A summary of frequently used notations.

Notation		Description
$x^{\text{3D}}_{i}$	$\in\mathbb{R}^{3}$	Coordinate vector of the $i$ -th 3D object point
$x^{\text{2D}}_{i}$	$\in\mathbb{R}^{2}$	Coordinate vector of the $i$ -th 2D image point
$w^{\text{2D}}_{i}$	$\in\mathbb{R}^{2}_{+}$	Weight vector of the $i$ -th 2D-3D point pair
$X$		The set of weighted 2D-3D correspondences
$y$		Object pose
$y_{\text{gt}}$		Ground truth of object pose
$y^{\ast}$		Object pose estimated by the PnP solver
$R$		3×3 rotation matrix representation of object orientation
$\theta$		1D yaw angle representation of object orientation
$l$		Unit quaternion representation of object orientation
$t$	$\in\mathbb{R}^{3}$	Translation vector representation of object position
$\Sigma_{y^{\ast}}$		Pose covariance estimated by the PnP solver
$J$		Jacobian matrix
$\tilde{J}$		Rescaled Jacobian matrix
$F$		Concatenated vector of weighted reprojection errors of all points
$\tilde{F}$		Concatenated vector of rescaled weighted reprojection errors of all points
$\pi(\cdot)$	$:\mathbb{R}^{3}\rightarrow\mathbb{R}^{2}$	Camera projection function
$f_{i}(y)$	$\in\mathbb{R}^{2}$	Weighted reprojection error of the $i$ -th correspondence at pose $y$
$r_{i}(y)$	$\in\mathbb{R}^{2}$	Unweighted reprojection error of the $i$ -th correspondence at pose $y$
$\rho(\cdot)$		Huber kernel function
$\rho^{\prime}_{i}$		The derivative of the Huber kernel function of the $i$ -th correspondence
$\delta$		The Huber threshold
$p(X\|y)$		Likelihood function of object pose
$p(y)$		PDF of the prior pose distribution
$p(y\|X)$		PDF of the posterior pose distribution
$t(y)$		PDF of the target pose distribution
$q(y),q_{t}(y)$		PDF of the proposal pose distribution (of the $t$ -th AMIS iteration)
$y_{j},y_{j}^{t}$		The $j$ -th random pose sample (of the $t$ -th AMIS iteration)
$v_{j},v_{j}^{t}$		Importance weight of the $j$ -th pose sample (of the $t$ -th AMIS iteration)
$i$		Index of 2D-3D point pair
$j$		Index of random pose sample
$t$		Index of AMIS iteration
$N$		Number of 2D-3D point pairs in total
$K$		Number of pose samples in total
$T$		Number of AMIS iterations
$K^{\prime}$		Number of pose samples per AMIS iteration
$n_{\text{head}}$		Number of heads in the deformable correspondence network
$n_{\text{hpts}}$		Number of points per head in the deformable correspondence network
$\mathcal{L}_{\text{KL}}$		KL divergence loss for object pose
$\mathcal{L}_{\text{tgt}}$		The component of $\mathcal{L}_{\text{KL}}$ concerning the reprojection errors at target pose
$\mathcal{L}_{\text{pred}}$		The component of $\mathcal{L}_{\text{KL}}$ concerning the reprojection errors over predicted pose
$\mathcal{L}_{\text{reg}}$		Derivative regularization loss

$\displaystyle\mathcal{L}_{\text{KL}}$	$\displaystyle=\int_{Y}t(y)\mathopen{}\mathclose{{}\left(\log{t(y)}-\log{p(y\|X)% }}\right)\mathop{}\!\mathrm{d}{y}$
	$\displaystyle=-\int_{Y}t(y)\log{\frac{p(X\|y)}{\int_{Y}p(X\|y)\mathop{}\!\mathrm% {d}{y}}}\mathop{}\!\mathrm{d}{y}+\mathit{const}$
	$\displaystyle=-\int_{Y}t(y)\log{p(X\|y)}\mathop{}\!\mathrm{d}{y}+\log{\int_{Y}p% (X\|y)\mathop{}\!\mathrm{d}{y}}+\mathit{const}.$	(4)

EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation