OfCaM: Global Human Mesh Recovery via Optimization-free Camera Motion Scale Calibration

Fengyuan Yang, Kerui Gu, Ha Linh Nguyen, Angela Yao
National University of Singapore
{fyang, keruigu, hlinhn, ayao}@comp.nus.edu.sg

Abstract

Accurate camera motion estimation is critical to estimate human motion in the global space. A standard and widely used method for estimating camera motion is Simultaneous Localization and Map** (SLAM). However, SLAM only provides a trajectory up to an unknown scale factor. Different from previous attempts that optimize the scale factor, this paper presents Optimization-free Camera Motion Scale Calibration (OfCaM), a novel framework that utilizes prior knowledge from human mesh recovery (HMR) models to directly calibrate the unknown scale factor. Specifically, OfCaM leverages the absolute depth of human-background contact joints from HMR predictions as a calibration reference, enabling the precise recovery of SLAM camera trajectory scale in global space. With this correctly scaled camera motion and HMR’s local motion predictions, we achieve more accurate global human motion estimation. To compensate for scenes where we detect SLAM failure, we adopt a local-to-global motion map** to fuse with previously derived motion to enhance robustness. Simple yet powerful, our method sets a new standard for global human mesh estimation tasks, reducing global human motion error by 60% over the prior SOTA while also demanding orders of magnitude less inference time compared with optimization-based methods.

1 Introduction

Human pose and shape estimation (also called Human Mesh Recovery, HMR) in world coordinates is a key component of many vision applications [24, 6]. There are many successful (local) HMR methods [10, 12, 15, 4, 17, 3, 20], but they work primarily in camera coordinates. Only a few world-coordinate HMR methods, i.e., global HMR, have been developed [38, 16, 29, 28]. Most of these approaches learn a local-to-global map** directly from a sequence of 3D (local) meshes, as the mesh sequence themselves provides a strong cue. Yet in some cases, there is ambiguity when the background is ignored. Consider, for example, a person riding a skateboard vs. standing on the ground, both have a similar local motion ¹¹1This work uses “motion” to refer to a sequence of human poses or meshes, or camera extrinsics over time. but totally different global motions (see Fig. 1(a)).

An observed (local) human mesh sequence is composed of the human motion in the global space relative to the camera motion. As such, global human motion can be formulated in terms of the local motion and the camera motion. Given that the camera-coordinate HMR is a mature area of research [17, 12, 4, 3, 20], decoupling the camera motion from the global motion is logical and feasible alternate solution [37, 14]. A typical approach to estimate camera motion is SLAM [5, 31, 30]. SLAM relies on the identification and continuous tracking of static environmental reference points to establish a spatial map and compute the camera’s trajectory relative to these landmarks. One limitation of SLAM is that it only estimates camera motion up to an unknown scale factor. This is typically resolved in robotics applications by integrating additional sensors (e.g., Inertial Measurement Units) or using calibration tools (e.g., checkerboard) to establish a metric scale. However, these solutions are not directly applicable to arbitrary videos for human motion analysis.

Therefore, some recent global HMR methods [37, 14] attempt to solve for the SLAM scale factor through optimization. The optimization, based on a loss function that evaluates the consistency between 3D human meshes projection and 2D video evidence, alongside smoothness constraints, jointly solves for the scale, human mesh, and camera trajectory. However, the inherent entanglement of human and camera motion makes it very challenging, sometimes leading to scale estimations that are off by a factor of several times, exemplified by the discrepancies between SLAHMR [37]’s and the ground true trajectory in Fig. 1(b). Another drawback is that the optimizations is very time-consuming; processing a one-minute video takes several minutes or longer (See Fig. 1(c)).

In this paper, we take a simple yet effective strategy of calibrating the scale based on the depth of key reference points. After perceiving the absolute depth of the reference point, we can solve the unknown scale and recover the whole camera motion. This solution gives a new direction to explicitly solve the scale of the estimated camera trajectory in an optimization-free manner. Notably, this optimization-free camera motion calibration takes much less time, which can be up to two or three orders of magnitude, compared with optimization-based methods (See Fig. 1(c)).

However, it still remains challenging to obtain the accurate absolute depth of the reference points, which commonly lie in the static background. What we have is the distance between the human mesh and the camera provided by local HMR methods. Can we utilize the foreground depth information to effectively calibrate the unknown scale of SLAM predicted camera trajectory? A key insight of this work is to select reference points closest to human-background contacts (feet in most cases) and use the predicted joint depth from HMR models as the depth of this reference point to directly calibrate the camera motion scale. By combining the accurately scaled camera motion with HMR’s local human motion predictions, we can readily compute the global human motion precisely (see Fig. 1(b)).

Refer to caption — (a) Camera $\circ$ Global = Local motion.

SLAM works well when reference points from the static background can be tracked. A typical setting in which SLAM fails is when the moving foreground takes up a majority of the scene [1, 2], which may happen when humans are too close to the camera. To that end, we design a SLAM failure indicator and revert to a local-to-global human motion map** to compensate when it indicates the failure as the local-to-global map** depends less on the background information.

To conclude, we propose Optimization-free Camera Motion Scale Calibration (OfCaM) to estimate the global human motion, which is an adaptive combination of SLAM-based human motion with additional motion cues from the local-to-global map**. Our experimental results demonstrate significantly lower error in world coordinates compared to baseline and existing methods, especially a remarkable 60% improvement in global human trajectory. Furthermore, our work reveals a mutual enhancement relationship between HMR models and camera motion estimation. This finding has the potential to spark further research on the integration of camera estimation and HMR techniques.

We highlight our key contributions as follows:

•

We propose an efficient optimization-free method to calibrate the unknown scale of SLAM-based camera motion by perceiving the depth of key reference points, which is much faster than optimization-based methods.
•

We select the contact point of the human and the background, feet in most scenarios as the key reference point, which effectively retrieve the absolute depth from the local HMR model and recover the camera trajectory.
•

We propose an adaptive and generalizable global motion framework that utilizes the local-to-global prior, ensuring robustness for both optimal and suboptimal SLAM conditions.
•

OfCaM achieves significant advanced results in global human and camera motion compared with baseline and previous SOTA methods, demonstrating our effectiveness.

2 Related Works

2.1 World Coordinate Human Mesh Recovery

Image-based HMR methods [10, 33, 36, 22, 26, 19, 18, 25, 27, 3] traditionally focus on recovering human meshes within the camera’s coordinate system. Although major video-based HMR advancements [12, 4, 20] also operate within same camera space, the advent of video data has paved the way for HMR exploration in world coordinates [38, 16, 9, 37, 29, 28]. The transition to world coordinates introduces the distinct challenge of disentangling both dynamic camera motion and human motion, which is a relatively nascent research area. While most previous attempts (e.g., GLAMR [38], DnD [16], TRACE [29], and WHAM [28]) proposed to infer global motion from observable local behaviors (e.g., if a person looks like they are walking, it is assumed they are moving forwards globally), the inherent ambiguities of local-to-global map** present significant limitations. In contrast, our approach does not solely depend on these dataset-learned local-to-global priors but rather employs them as additional cues to enhance accuracy.

Recent efforts including SLAHMR [37] and PACE [14] recognize the utility of background information in determining camera motion with SLAM techniques [23, 30, 31]. However, these methods aim to address the ‘unknown scale’ problem in SLAM outputs by jointly optimizing scale, pose, and shape parameters—a procedure that is inherently ambiguous. Our method, by contrast, deviates from these intensive optimization strategies by calibrating the scale factor directly, utilizing the depth predictions from HMR models, and thereby giving an optimization-free solution.

2.2 Camera Calibration

Camera calibration is a fundamental procedure in robotics and computer vision that enables precise spatial measurement and scene reconstruction. Typically, this process relies on additional sensors, such as Inertial Measurement Units (IMUs) [39, 8], or reference markers like checkerboards [7] to define a known metric scale. However, these traditional calibration methods are not feasible for arbitrary human-centric videos within the HMR domain, due to the absence of external sensors and standardized calibration tools. In contrast, single-view metrology [40] suggests that objects with well-defined geometrical priors can themselves act as natural calibration references. Motivated by this concept, HMR models, with their inherent human geometric priors, have the potential to serve as surrogate calibration devices. In our work, we utilize the predicted absolute joint depths from the HMR model as the reference to accurately and efficiently calibrate the unknown scale factor.

3 Preliminaries

3.1 Human Motion in Camera Coordinates $\mathcal{M}_{c}$

The 3D human motion from a video $I={\{I_{t}\}}_{t=1}^{T}$ of $T$ frames can be represented in the camera space by a $T$ -length sequence of SMPL parameters $\mathcal{M}_{c}={\{\boldsymbol{\theta}_{t},\boldsymbol{\beta}_{t},\boldsymbol{% \psi}_{t},\boldsymbol{\tau}_{t}\}}_{t=1}^{T}$ . SMPL [21] is a widely used 3D statistical model of the human body. For a given frame at time $t$ , the SMPL model maps body pose $\boldsymbol{\theta}_{t}\in\mathbb{R}^{23\times 3}$ , shape parameters $\boldsymbol{\beta}_{t}\in\mathbb{R}^{10}$ , root orientation $\boldsymbol{\psi}_{t}\in\mathbb{R}^{3}$ , and root translation $\boldsymbol{\tau}_{t}\in\mathbb{R}^{3}$ to a 3D mesh of the human body $\mathbf{V}_{t}\!\in\!\mathbb{R}^{6890\times 3}$ in the camera space. The individual joints can be mapped from the SMPL parameters with the function $\mathbf{J}=\mathcal{J}(\boldsymbol{\theta}_{t},\boldsymbol{\beta}_{t},% \boldsymbol{\psi}_{t},\boldsymbol{\tau}_{t})\in\mathbb{R}^{24\times 3}$ .

Note that $\boldsymbol{\psi}_{t}$ and $\boldsymbol{\tau}_{t}$ are sometimes referred to as “global” orientation and translation parameters by the SMPL model. However, HMR models [10, 15, 12, 4, 13, 35] estimate these parameters with respect to the camera extrinsics $\mathcal{E}$ at frame $t$ , where $\mathcal{E}={\{{\boldsymbol{R}}_{t},{\boldsymbol{T}}_{t}\}}_{t=1}^{T}$ is the sequence of camera rotations ${\boldsymbol{R}}_{t}\in\mathbb{R}^{3\times 3}$ and translations ${\boldsymbol{T}}_{t}\in\mathbb{R}^{3}$ .

3.2 Human Motion in World Coordinates $\mathcal{M}_{w}$

Unlike $\mathcal{M}_{c}$ , the human motion in the world coordinates $\mathcal{M}_{w}={\{\boldsymbol{\theta}_{t},\boldsymbol{\beta}_{t},\boldsymbol{% \Phi}_{t},\boldsymbol{\Gamma}_{t}\}}_{t=1}^{T}$ is the motion within an absolute global space ²²2By convention, the world space is defined by the camera extrinsics parameters of the very first frame. and independent of camera extrinsics $\mathcal{E}_{t}$ . Recall the classical perspective projection, local root orientation and translation $\{\boldsymbol{\psi}_{t},\boldsymbol{\tau}_{t}\}$ is obtained by applying camera extrinsics $\mathcal{E}_{t}$ to the global orientation $\boldsymbol{\Phi}_{t}\in\mathbb{R}^{3}$ and global translation $\boldsymbol{\Gamma}_{t}\in\mathbb{R}^{3}$ in world coordinates:

\boldsymbol{\psi}_{t}=\boldsymbol{R}_{t}\boldsymbol{\Phi}_{t};\quad\boldsymbol% {\tau}_{t}=\boldsymbol{R}_{t}\boldsymbol{\Gamma}_{t}+\boldsymbol{T}_{t}.

(1)

Thus, to obtain the global human motion from the local motion estimated by HMR models, one can apply the inverse of the camera extrinsics to the local root orientation and translation:

\boldsymbol{\Phi}_{t}=\boldsymbol{R}_{t}^{T}\boldsymbol{\psi}_{t};\quad% \boldsymbol{\Gamma}_{t}=\boldsymbol{R}_{t}^{T}(\boldsymbol{\tau}_{t}-% \boldsymbol{T}_{t}).

(2)

This equation explains how we decouple the global motion from the local estimation by isolating and removing the camera motion, which serves as one of our key insights. Nevertheless, getting the correct camera extrinsics $\mathcal{E}$ can be difficult. SLAM is a widely used method to estimate camera motion [23, 30, 31], though it can only predict the camera extrinsics ${\{\boldsymbol{R}_{t},s\cdot\boldsymbol{T}_{t}\}}_{t=1}^{T}$ up to an unknown scale $s$ . Our work focus on how to calibrate the scale $s$ based on the recovered human mesh.

4 Method

Our pipeline is illustrated in Fig. 2. For global HMR, the inputs are typically captured by moving cameras featuring static background content and dynamic foreground content of humans. Prior approaches [38, 16, 29] infer global human motion $\mathcal{M}_{w}$ exclusively from foreground’s local motion $\mathcal{M}_{c}$ . In contrast, recent attempts [37, 14] jointly optimize global human motion $\mathcal{M}_{w}$ and the SLAM dereived camera motion $\mathcal{E}$ to fit with the 2D observation. Different from their complex optimization, we propose an optimization-free way to calibrate the scale by comparing the depth of some key reference points from the output of SLAM and HMR, where we select the human-background contact joints as the reference points. (Sec. 4.2). Furthermore, to resolve the cases of problematic SLAM output, we introduce the global human motion refinement via fusing local motion priors (Sec. 4.2).

4.1 Scale Calibration

SLAM predicts camera motions up to an arbitrary scale factor $s$ . Similar to the traditional camera calibration strategies (e.g., with checkerboard patterns), this work estimates $s$ based on the ratio of absolute versus relative distance to the camera for some reference point $p$ , i.e.,

s_{p}=d^{A}_{p}/d^{S}_{p},

(3)

where $d^{A}_{p}$ and $d^{S}_{p}$ denote the absolute real-world distance from camera to $p$ and the corresponding relative SLAM depth respectively.

Reference Joint Selection. Standard HMR models estimate the depth of human joints with respect to the camera, based on large-scale training data and human size priors encoded in the (e.g. SMPL) model. Estimating scale $s$ based on human joints then, requires the relative depth of the corresponding joints from SLAM. However, SLAM typically relies on tracking and matching static reference points within the scene to estimate camera motion. Many of the joints on an arbitrarily moving human body are outliers disregarded by SLAM. Therefore, we propose to use contact points from the human and the background, (i.e., the feet in most cases) as the reference point $p$ . Experimentally, the feet are verified as the most reliable references (see Sec. 5.2) for scale calibration.

Absolute Distance Derived from HMR Model. By definition, the camera serves as the origin of the camera space, so the reference joint’s absolute distance with respect to the camera is given by:

d^{A}_{p}=\left\|\mathbf{J}_{p}\right\|_{2},

(4)

where $\mathbf{J}_{p}$ is the 3D position of joint $p$ computed using the HMR model $\mathbf{J}=\mathcal{J}(\boldsymbol{\theta}_{t},\boldsymbol{\beta}_{t},% \boldsymbol{\psi}_{t},\boldsymbol{\tau}_{t})$ .

Relative Distance Derived from SLAM. SLAM methods estimate the camera motion by tracking and matching a set of keypoints or patches $K$ and maintaining depth maps relative to SLAM’s coordinate. For each keypoint $k\in K$ , located at the 2D position $\mathbf{x}_{k}\in\mathbb{R}^{2}$ in the image plane, there is an associated depth $z_{k}\in\mathbb{R}$ . As shown in Fig. 3(a), the relative distance $d^{S}_{p}$ of the reference joint $p$ with respect to the camera can be estimated based on the nearest corresponding SLAM keypoints $k^{*}$ :

d^{S}_{p}=z_{k^{*}}\quad\text{where}\ k^{*}=\operatorname*{arg\,min}_{k\in K}% \left\|\mathbf{x}_{k}-\pi\left(\mathbf{J}_{p}\right)\right\|_{2}.

(5)

Above, $\pi\left(\mathbf{J}_{p}\right)$ denotes the 3D contact joint $\mathbf{J}_{p}$ projected into the camera plane with projection function $\pi$ . For stability, we reject correspondences that are too far away, i.e., if the closest patch to the projected joint $\left\|\mathbf{x}_{k^{*}}-\pi\left(\mathbf{J}_{p}\right)\right\|_{2}$ exceeds some distance threshold $\delta$ .

Sequence Scale Factor. Taking the results of Eq. 4 and Eq. 5 into Eq. 3, we can obtain the scale factor for current single keyframe $t$ . To account for noise in finding the nearest tracked keypoint as well as the SLAM depth map, we take the median of the scale factors across all keyframes as the final scale factor $\bar{s}=\text{median}(\{s(t)\mid t\in I\})$ for the whole sequence $I$ , ensuring stability and robustness.

After calibrating the scale as $\bar{s}$ , the absolute camera extrinsics is ${\{\boldsymbol{R}_{t},\bar{s}\cdot\boldsymbol{T}_{t}\}}_{t=1}^{T}$ . By applying them to HMR’s predicted human motion $\mathcal{M}_{c}$ , we can finally get the SLAM-derived global motion:

\mathcal{M}^{S}_{w}={\{\boldsymbol{\theta}_{t},\ \boldsymbol{\beta}_{t},\ % \boldsymbol{\Phi}_{t}=\boldsymbol{R}_{t}^{T}\boldsymbol{\psi}_{t},\ % \boldsymbol{\Gamma}_{t}=\boldsymbol{R}_{t}^{T}(\boldsymbol{\tau}_{t}-\bar{s}% \cdot\boldsymbol{T}_{t})\}}_{t=1}^{T},

(6)

where $\boldsymbol{\theta}_{t},\boldsymbol{\beta}_{t}$ are from HMR’s prediction, and $\boldsymbol{\Phi}_{t},\boldsymbol{\Gamma}_{t}$ are calculated by camera-to-world translation (i.e., Eq. 2) from HMR’s prediciton $\boldsymbol{\psi}_{t},\boldsymbol{\tau}_{t}$ in camera space.

4.2 Local-to-global Motion Adjustment

As previously mentioned, SLAM can be prone to failure in challenging circumstances, such as when input images are dominated by the dynamic foreground human there are too few informative reference points for matching and tracking. In such scenarios, inaccurate camera motion estimation by SLAM can further impact the derived human motion in world coordinates by Eq. 2. To mitigate complete reliance on SLAM, we propose to use the local-to-global motion prior to compensating for scenarios where SLAM falls short.

Local-to-global Human Motion. The previous works [38, 16, 29] in global HMR are devoted to learning the global motion from local motion priors. In our work, we adopt a lightweight sub-module from GLAMR [38] as the local-to-global motion predictor. The input is a sequence of the body pose $\{\boldsymbol{\theta}_{t}\}$ and the output is the sequence of human’s global orientation $\{\tilde{\boldsymbol{\Phi}}_{t}\}$ and global translation $\{\tilde{\boldsymbol{\Gamma}}_{t}\}$ both relative to the first frame. Thus, we can get the global motion derived by local-to-global predictor:

\mathcal{M}^{L}_{w}={\{\boldsymbol{\theta}_{t},\ \boldsymbol{\beta}_{t},\ % \tilde{\boldsymbol{\Phi}}_{t},\ \tilde{\boldsymbol{\Gamma}}_{t}\}}_{t=1}^{T},

(7)

where $\mathcal{M}^{L}_{w}$ denotes the global motion in world space derived by local-to-global prediction.

SLAM Failure Indicator. When SLAM performs well, the scale factor $s$ over the whole sequence will exhibit a very small standard deviation. Therefore, we use standard deviation $\sigma(\{s(\cdot)\})$ as a SLAM failure indicator to identify the set of segments $\mathcal{S}=\{\hat{I}\subseteq I\mid\sigma(\{s(t)\mid t\in\hat{I}\})>\upsilon\}$ , each segement $\hat{I}$ exhibiting significant disagreement in the calculated scale factor. This segment-wise manner is necessary because SLAM may fail in certain segments rather than the entire sequence.

Segment-wise Adaptive Global Motion Fusion. Since when SLAM failed, the scale is no longer reliable but may retain potentially useful shape information, we opt to fuse the local-to-global motion $\mathcal{M}^{L}_{w}(\hat{I})$ with SLAM-derived motion $\mathcal{M}^{S}_{w}(\hat{I})$ as the final motion for failed segments $\hat{I}\in\mathcal{S}$ . Thus, we first align the SLAM-derived motion to the local-to-global motion by Umeyama’s method [32]: $U(\mathcal{M}^{S}_{w}(\hat{I}),\mathcal{M}^{L}_{w}(\hat{I}))={\{\boldsymbol{% \theta}_{t},\ \boldsymbol{\beta}_{t},\ {\boldsymbol{\Phi}}^{\prime}_{t},\ {% \boldsymbol{\Gamma}}^{\prime}_{t}\}}_{t\in\hat{I}}$ where $U(a,b)$ is Umeyama’s method to align points set $a$ to points set $b$ . The fused global motion is the weighted average of these two motions:

\mathcal{M}^{F}_{w}(\hat{I})={\{\boldsymbol{\theta}_{t},\ \boldsymbol{\beta}_{% t},\ \lambda\tilde{\boldsymbol{\Phi}}_{t}+(1-\lambda){\boldsymbol{\Phi}}^{% \prime}_{t},\ \lambda\tilde{\boldsymbol{\Gamma}}_{t}+(1-\lambda){\boldsymbol{% \Gamma}}^{\prime}_{t}\}}_{t\in\hat{I}},

(8)

where the weight $\lambda$ is calculated by the Softmax of the standard deviation $\sigma(\{s(t)\mid t\in\hat{I}\})$ , a higher outlier score means lower weight given to SLAM-derived motion during weighted fusion.

4.3 Final Human Motion and Camera Motion

Upon identifying the failure segments $\mathcal{S}$ , we selectively update these segments by integrating local-to-global motion as mentioned above, thus we achieved the final global human motion $\mathcal{M}_{w}$ :

\mathcal{M}_{w}(\hat{I})=\begin{cases}\mathcal{M}^{F}_{w}(\hat{I})&\text{if }% \hat{I}\in\mathcal{S},\\ \mathcal{M}^{S}_{w}(\hat{I})&\text{otherwise}.\end{cases}

(9)

Consequently, this rectified global human motion allows for an update to the camera motion by algebraically reformulating Eq. 2 to express the camera motion in terms of local and global human motion.

5 Experiments

5.1 Implementation Details, Dataset and Metrics

Our experiments ³³3The code will be released upon acceptance adopt DPVO [31] as the SLAM model for the camera motion stream and CLIFF [17] as the HMR model for the human motion stream. The distance threshold $\delta=400px$ for outlier rejection and the standard deviation threshold $\upsilon=2$ for SLAM failure segment identification.

Datasets. Following previous works [28], we evaluate the global human motion and camera motion on a subset of EMDB [11] (EMDB 2), which contains 25 sequences captured by the dynamic camera and provides ground truth global motion for both human and camera.

Metrics for Human Motion. Same with previous works [14, 37, 28], we evaluate human’s global motion error by: (1) WA-MPJPE which is the average Euclidean distance between the ground truth and the predicted joint positions (i.e., MPJPE) after aligning each segment for every 100 frames; (2) W-MPJPE which is the MPJPE error after only aligning the first two frames of each 100-frames segments with the ground truth. (3) RTE which is the human’s root translation error of the whole sequence after the rigid alignment. We also evaluate local mesh error by (4) PA-MPJPE which is the MPJPE error after Procrustes aligned with ground truth.

Metrics for Camera Motion. We follow SLAM convention and previous works [14] for camera motion evaluation, reporting (1) ATE which is the Average Translation Error after rigidly aligning the camera trajectories; (2) ATE-S which is Average Translation Error without Scale alignment, providing a more accurate reflection of inaccuracies in the captured scale of the scene.

5.2 Ablation Study and Analysis

Table 1: Ablation studies on the impact of our proposed scale calibration and local-to-global adjustment on the error of global human and camera motion. ‘L2G’ denotes local-to-global.

Ablation		Global Human Motion			Global Camera Motion
Scale	L2G	WA-MPJPE $\downarrow$	W-MPJPE $\downarrow$	RTE $\downarrow$	ATE $\downarrow$	ATE-S $\downarrow$
✗	✗	335.53	833.11	9.61	0.72	6.30
✗	✓	280.25	759.56	7.68	-	-
✓	✗	111.29	347.60	2.41	0.72	1.33
✓	✓	108.24	317.88	2.21	0.71	1.25

Table 2: Comparative analysis of scale calibration performance using different reference joints. Results indicate that human-background contact joints such as feet served as a better choice.

Reference Joint	Global Human Motion			Global Camera Motion
Reference Joint	WA-MPJPE $\downarrow$	W-MPJPE $\downarrow$	RTE $\downarrow$	ATE $\downarrow$	ATE-S $\downarrow$
Head	369.11	979.07	9.65	2.13	6.77
Pelvis	292.44	753.64	8.23	1.39	5.51
Feet	108.24	317.88	2.21	0.71	1.25

Scale Calibration. Tab. 1 shows the impact of fixing the SLAM scale to the initial scale of SLAM output (first row) vs. scaling the camera motion (third and fourth row). This comparison reveals that our scale calibration is effective on both human motion (left part) and camera motion (right part). Additionally, Fig 7 shows the SLAM output indeed facing the unknown scale problem in the first place but after our scale calibration, the camera trajectory becomes fitter with the ground truth.

Local-to-global Refinement. Comparing the third row and last row of Tab. 1, L2G can successfully refine the human motion estimation derived from failed SLAM outputs (left part), and improved human motion estimations concurrently yield more accurate camera motion (right part). This improvement is more pronounced when evaluated on challenging sequences with human occupancy exceeding 40% of the image area. As shown in Tab. 4, our L2G module achieves a 10% improvement in WA-MPJPE and 30% improvement in W-MPJPE.

Importance of Camera Motion. Tab. 1 also shows the performance when we bypass the camera motion stream and directly use the result of L2G as the global human motion (second row) vs. using scale-calibrated camera motion to decouple global human motion (third row). The better performance of the latter highlights that ambiguity in local-to-global motion prediction is inherent, camera motion is essential, and we can better decouple by an effective scale calibration.

Reference Joint Selection. Tab. 2 shows that performance drops for both human and camera motion when we use non-contact joints as the reference points, e.g. the head joint (first row) or root joint (second row). As the feet are consistently proximate to the ground surface, they are more stable and reliable reference points for scale calibration. Specifically, Fig 4 further demonstrate pelvis joints show larger scale error than foot joints since dynamic humans are hard to capture by SLAM. Furthermore, an inverse correlation exists between scale errors of left and right feet, with growth in left feet corresponding to a contraction in right. This scale error trend is also consistent with the contact feet as shown in the corresponding image frames, a foot will have less scale error when it contacts with the ground. This further verified our motivation for choosing the contact joint of the foreground and background as the reference point.

Table 3: SOTA comparison of local human motion and global human motion on EMDB2 dataset.

Models	EMDB2
Models	PA-MPJPE $\downarrow$	W-MPJPE $\downarrow$	WA-MPJPE $\downarrow$	RTE $\downarrow$
GLAMR [38]	56.0	726.6	280.8	16.7
TRACE [29]	58.0	1702.3	529.0	18.9
SLAHMR [37]	61.5	776.1	326.9	10.2
WHAM(w/ DPVO) [28]	41.9	354.8	135.6	6.0
OfCam (Ours)	53.7	317.9	108.2	2.2

5.3 Comparison with the State-of-the-art

Tab. 3 compares our approach with SOTA world coordinate human mesh recovery methods. Our Method demonstrates a significant enhancement on global human motion metrics over WHAM [28] (about 10% improvement in W-MPJPE, 20% improvement in WA-MPJPE, 60% improvement in RTE). This notable improvement, especially in RTE, which evaluates the entire motion trajectory, is attributed to our accurate camera motion scale calibration. Our calibration effectively and reliably decouples human motion from the camera motion.

As discussed in the Related Works section, the previous methods can divided into local-to-global methods and optimization-based camera motion methods. Here we compare our method with both to further demonstrate our strength.

Human Global Translation Ambiguity. The large global trajectory error of Local-to-global Methods (see RTE of TRACE and GLAMR in Tab. 3) demonstrates the difficulties of those Local-to-global Methods to handle long-distance trajectories. The deeper reason for this is the ambiguity when derive global translation only based on the local motion. As shown in the first example in Fig. 5, it’s hard to infer the global translation from a "standing" local pose when a man is skateboarding.

Time Complexity. We compare the running time between our method and scale-optimization methods (such as SLAHMR [37] and PACE [14]) in Fig. 1(c). Excluding the SLAM running time, SLAHMR takes over 200 minutes per 1000 frames, and PACE takes 8 minutes per 1000 frames for optimization. In contrast, our approach requires significantly less time (2.5 seconds per 1000 frames) as it is optimization-free. Furthermore, we achieve a better scale compared to SLAHMR, despite the latter’s long optimization process. This illustrates that our scale-calibration is not only time-efficient but also delivers strong performance.

Global Human Motion Visualization. We also compare our visualization result with other methods as shown in Fig. 6. The visualization clearly demonstrates that our method produces outcomes that are not only more natural-looking but also better aligned with the ground truth.

6 Conclusion & Limitations

This paper proposes OfCaM, which uses HMR’s absolute depth prediction as a tool to calibrate the unknown scale of SLAM. By utilizing human-background contacts as the calibration reference, OfCaM effectively and efficiently recovers the camera motion. With the accurately isolated camera motion, OfCaM enhances the decoupling of global human motion from video observations. Additionally, we leverage local-to-global priors to rectify instances where SLAM outputs may fail.

Currently, our work has two limitations. First is the body-pose accuracy (see PA-MPJPE error in Table 3). However, our framework is compatible with any HMR model so more advanced methods can be integrated. This is beyond our current scope of accurate recovery of human meshes in world coordinates rather than optimizing local pose metrics. Secondly, like previous work [28], our evaluation of global human and camera motion is limited to the EMDB dataset, as it is the only dataset specifically designed for the global human and camera motion task. Others either lack annotations for world frames (3DPW [34]) or have incomplete data and or code release (HCM [14]).

References

Bescos et al. [2018] Berta Bescos, José M Fácil, Javier Civera, and José Neira. Dynaslam: Tracking, map**, and inpainting in dynamic scenes. IEEE Robotics and Automation Letters, 3(4):4076–4083, 2018.
Cadena et al. [2016] Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, José Neira, Ian Reid, and John J Leonard. Past, present, and future of simultaneous localization and map**: Toward the robust-perception age. IEEE Transactions on robotics, 32(6):1309–1332, 2016.
Cho et al. [2023] Hanbyel Cho, Yooshin Cho, Jaesung Ahn, and Junmo Kim. Implicit 3d human mesh recovery using consistency with pose and shape from unseen-view. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 21148–21158, 2023.
Choi et al. [2021] Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Beyond static features for temporally consistent 3d human pose and shape from a video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
Durrant-Whyte and Bailey [2006] Hugh Durrant-Whyte and Tim Bailey. Simultaneous localization and map**: part i. IEEE Robotics and Automation Letters, 13(2):99–110, 2006.
Han et al. [2022] D.-I. D. Han, Y. Bergs, and N. Moorhouse. Virtual reality consumer experience escapes: preparing for the metaverse. Virtual Reality, 26(4):1443–1458, 2022.
Hartley and Zisserman [2003] Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2003.
Heikkilä and Silvén [1997] Janne Heikkilä and Olli Silvén. A four-step camera calibration procedure with implicit image correction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1106–1112. IEEE, 1997.
Henning et al. [2022] Dorian F Henning, Tristan Laidlow, and Stefan Leutenegger. Bodyslam: Joint camera localisation, map**, and human motion tracking. In European Conference on Computer Vision (ECCV), pages 656–673. Springer, 2022.
Kanazawa et al. [2018] Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
Kaufmann et al. [2023] Manuel Kaufmann, Jie Song, Chen Guo, Kaiyue Shen, Tianjian Jiang, Chengcheng Tang, Juan José Zárate, and Otmar Hilliges. EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. In IEEE International Conference on Computer Vision (ICCV), 2023.
Kocabas et al. [2020] Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. Vibe: Video inference for human body pose and shape estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
Kocabas et al. [2021] Muhammed Kocabas, Chun-Hao P Huang, Otmar Hilliges, and Michael J Black. Pare: Part attention regressor for 3d human body estimation. In IEEE International Conference on Computer Vision (ICCV), pages 11127–11137, 2021.
Kocabas et al. [2023] Muhammed Kocabas, Ye Yuan, Pavlo Molchanov, Yunrong Guo, Michael J Black, Otmar Hilliges, Jan Kautz, and Umar Iqbal. Pace: Human and camera motion estimation from in-the-wild videos. International Conference on 3D Vision (3DV), 2023.
Kolotouros et al. [2019] Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In IEEE International Conference on Computer Vision (ICCV), 2019.
Li et al. [2022a] Jiefeng Li, Siyuan Bian, Chao Xu, Gang Liu, Gang Yu, and Cewu Lu. D &d: Learning human dynamics from dynamic camera. In European Conference on Computer Vision (ECCV), pages 479–496. Springer, 2022a.
Li et al. [2022b] Zhihao Li, Jianzhuang Liu, Zhensong Zhang, Songcen Xu, and Youliang Yan. Cliff: Carrying location information in full frames into human pose and shape estimation. In European Conference on Computer Vision (ECCV), pages 590–606. Springer, 2022b.
Lin et al. [2021a] Kevin Lin, Lijuan Wang, and Zicheng Liu. Mesh graphormer. In IEEE International Conference on Computer Vision (ICCV), pages 12939–12948, 2021a.
Lin et al. [2021b] Kevin Lin, Lijuan Wang, and Zicheng Liu. End-to-end human pose and mesh reconstruction with transformers. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1954–1963, 2021b.
Liu et al. [2022] Zhenguang Liu, Runyang Feng, Haoming Chen, Shuang Wu, Yixing Gao, Yunjun Gao, and Xiang Wang. Temporal feature alignment and mutual information maximization for video-based human pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11006–11016, 2022.
Loper et al. [2015] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6):248:1–248:16, 2015.
Moon and Lee [2020] Gyeongsik Moon and Kyoung Mu Lee. I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In European Conference on Computer Vision (ECCV), pages 752–768. Springer, 2020.
Mur-Artal et al. [2015] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics, 31(5):1147–1163, 2015.
Rauschnabel et al. [2022] P. A. Rauschnabel, R. Felix, C. Hinsch, H. Shahab, and F. Alt. What is xr? towards a framework for augmented and virtual reality. Computers in Human Behavior, 133:107289, 2022.
Rong et al. [2019] Yu Rong, Ziwei Liu, Cheng Li, Kaidi Cao, and Chen Change Loy. Delving deep into hybrid annotations for 3d human recovery in the wild. In IEEE International Conference on Computer Vision (ICCV), pages 5340–5348, 2019.
Saito et al. [2020] Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul Joo. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 84–93, 2020.
Sengupta et al. [2020] Akash Sengupta, Ignas Budvytis, and Roberto Cipolla. Synthetic training for accurate 3d human pose and shape estimation in the wild. arXiv preprint arXiv:2009.10013, 2020.
Shin et al. [2023] Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded humans with accurate 3d motion. arXiv preprint arXiv:2312.07531, 2023.
Sun et al. [2023] Yu Sun, Qian Bao, Wu Liu, Tao Mei, and Michael J Black. Trace: 5d temporal regression of avatars with dynamic cameras in 3d environments. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8856–8866, 2023.
Teed and Deng [2021] Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in Neural Information Processing Systems (NeurIPS), 34:16558–16569, 2021.
Teed et al. [2023] Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch visual odometry. Advances in Neural Information Processing Systems (NeurIPS), 2023.
Umeyama [1991] Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 13(04):376–380, 1991.
Varol et al. [2018] G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer, I. Laptev, and C. Schmid. Bodynet: Volumetric inference of 3d human body shapes. In European Conference on Computer Vision (ECCV), pages 20–38. Springer International Publishing, 2018.
Von Marcard et al. [2018] Timo Von Marcard, Roberto Henschel, Michael J Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV), pages 601–617, 2018.
Xu et al. [2022] Yufei Xu, **g Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems (NeurIPS), 35:38571–38584, 2022.
Yao et al. [2019] P. Yao, Z. Fang, F. Wu, Y. Feng, and J. Li. Densebody: Directly regressing dense 3d human pose and shape from a single color image. arXiv preprint arXiv:1903.10153, 2019.
Ye et al. [2023] Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. Decoupling human and camera motion from videos in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 21222–21232, 2023.
Yuan et al. [2022] Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. Glamr: Global occlusion-aware human mesh recovery with dynamic cameras. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11038–11049, 2022.
Zhang [2000] Zhengyou Zhang. A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 22(11):1330–1334, 2000.
Zhu et al. [2020] Rui Zhu, Xingyi Yang, Yannick Hold-Geoffroy, Federico Perazzi, Jonathan Eisenmann, Kalyan Sunkavalli, and Manmohan Chandraker. Single view metrology in the wild. In European Conference on Computer Vision (ECCV), pages 316–333. Springer, 2020.

Appendix A Appendix / supplemental material

Table 4: Evaluation of local-to-global adjustment on challenging sequences where the human subject occupies a large portion of the image. Sequences were selected based on an average human occupancy exceeding 40% on EMDB2 dataset.

Challenging Sequences		World Human Motion			World Camera Motion
Scale	L2G	WA-MPJPE $\downarrow$	W-MPJPE $\downarrow$	RTE $\downarrow$	ATE $\downarrow$	ATE-S $\downarrow$
✓	✗	160.54	520.89	4.59	1.11	2.12
✓	✓	142.40	376.17	3.36	1.02	1.64