OfCaM: Global Human Mesh Recovery via Optimization-free Camera Motion Scale Calibration

Fengyuan Yang, Kerui Gu, Ha Linh Nguyen, Angela Yao
National University of Singapore
{fyang, keruigu, hlinhn, ayao}@comp.nus.edu.sg
Abstract

Accurate camera motion estimation is critical to estimate human motion in the global space. A standard and widely used method for estimating camera motion is Simultaneous Localization and Map** (SLAM). However, SLAM only provides a trajectory up to an unknown scale factor. Different from previous attempts that optimize the scale factor, this paper presents Optimization-free Camera Motion Scale Calibration (OfCaM), a novel framework that utilizes prior knowledge from human mesh recovery (HMR) models to directly calibrate the unknown scale factor. Specifically, OfCaM leverages the absolute depth of human-background contact joints from HMR predictions as a calibration reference, enabling the precise recovery of SLAM camera trajectory scale in global space. With this correctly scaled camera motion and HMR’s local motion predictions, we achieve more accurate global human motion estimation. To compensate for scenes where we detect SLAM failure, we adopt a local-to-global motion map** to fuse with previously derived motion to enhance robustness. Simple yet powerful, our method sets a new standard for global human mesh estimation tasks, reducing global human motion error by 60% over the prior SOTA while also demanding orders of magnitude less inference time compared with optimization-based methods.

1 Introduction

Human pose and shape estimation (also called Human Mesh Recovery, HMR) in world coordinates is a key component of many vision applications [24, 6]. There are many successful (local) HMR methods [10, 12, 15, 4, 17, 3, 20], but they work primarily in camera coordinates. Only a few world-coordinate HMR methods, i.e., global HMR, have been developed [38, 16, 29, 28]. Most of these approaches learn a local-to-global map** directly from a sequence of 3D (local) meshes, as the mesh sequence themselves provides a strong cue. Yet in some cases, there is ambiguity when the background is ignored. Consider, for example, a person riding a skateboard vs. standing on the ground, both have a similar local motion 111This work uses “motion” to refer to a sequence of human poses or meshes, or camera extrinsics over time. but totally different global motions (see Fig. 1(a)).

An observed (local) human mesh sequence is composed of the human motion in the global space relative to the camera motion. As such, global human motion can be formulated in terms of the local motion and the camera motion. Given that the camera-coordinate HMR is a mature area of research [17, 12, 4, 3, 20], decoupling the camera motion from the global motion is logical and feasible alternate solution [37, 14]. A typical approach to estimate camera motion is SLAM  [5, 31, 30]. SLAM relies on the identification and continuous tracking of static environmental reference points to establish a spatial map and compute the camera’s trajectory relative to these landmarks. One limitation of SLAM is that it only estimates camera motion up to an unknown scale factor. This is typically resolved in robotics applications by integrating additional sensors (e.g., Inertial Measurement Units) or using calibration tools (e.g., checkerboard) to establish a metric scale. However, these solutions are not directly applicable to arbitrary videos for human motion analysis.

Therefore, some recent global HMR methods [37, 14] attempt to solve for the SLAM scale factor through optimization. The optimization, based on a loss function that evaluates the consistency between 3D human meshes projection and 2D video evidence, alongside smoothness constraints, jointly solves for the scale, human mesh, and camera trajectory. However, the inherent entanglement of human and camera motion makes it very challenging, sometimes leading to scale estimations that are off by a factor of several times, exemplified by the discrepancies between SLAHMR [37]’s and the ground true trajectory in Fig. 1(b). Another drawback is that the optimizations is very time-consuming; processing a one-minute video takes several minutes or longer (See Fig. 1(c)).

In this paper, we take a simple yet effective strategy of calibrating the scale based on the depth of key reference points. After perceiving the absolute depth of the reference point, we can solve the unknown scale and recover the whole camera motion. This solution gives a new direction to explicitly solve the scale of the estimated camera trajectory in an optimization-free manner. Notably, this optimization-free camera motion calibration takes much less time, which can be up to two or three orders of magnitude, compared with optimization-based methods (See Fig. 1(c)).

However, it still remains challenging to obtain the accurate absolute depth of the reference points, which commonly lie in the static background. What we have is the distance between the human mesh and the camera provided by local HMR methods. Can we utilize the foreground depth information to effectively calibrate the unknown scale of SLAM predicted camera trajectory? A key insight of this work is to select reference points closest to human-background contacts (feet in most cases) and use the predicted joint depth from HMR models as the depth of this reference point to directly calibrate the camera motion scale. By combining the accurately scaled camera motion with HMR’s local human motion predictions, we can readily compute the global human motion precisely (see Fig. 1(b)).

Refer to caption
(a) Camera \circ Global = Local motion.
Refer to caption
(b) Human trajectories.
Refer to caption
(c) Inference time comparison.
Figure 1: (a) Video sequence as an entanglement of the camera and human motion in the world coordinate. (b) and (c) Regression-based methods like WHAM [28] are time-efficient but fail in ambiguous cases; Optimization-based methods like SLAHMR [37] struggle to optimize a good trajectory and are time-consuming; while ours can achieve accurate trajectory and optimization-free.

SLAM works well when reference points from the static background can be tracked. A typical setting in which SLAM fails is when the moving foreground takes up a majority of the scene [1, 2], which may happen when humans are too close to the camera. To that end, we design a SLAM failure indicator and revert to a local-to-global human motion map** to compensate when it indicates the failure as the local-to-global map** depends less on the background information.

To conclude, we propose Optimization-free Camera Motion Scale Calibration (OfCaM) to estimate the global human motion, which is an adaptive combination of SLAM-based human motion with additional motion cues from the local-to-global map**. Our experimental results demonstrate significantly lower error in world coordinates compared to baseline and existing methods, especially a remarkable 60% improvement in global human trajectory. Furthermore, our work reveals a mutual enhancement relationship between HMR models and camera motion estimation. This finding has the potential to spark further research on the integration of camera estimation and HMR techniques.

We highlight our key contributions as follows:

  • We propose an efficient optimization-free method to calibrate the unknown scale of SLAM-based camera motion by perceiving the depth of key reference points, which is much faster than optimization-based methods.

  • We select the contact point of the human and the background, feet in most scenarios as the key reference point, which effectively retrieve the absolute depth from the local HMR model and recover the camera trajectory.

  • We propose an adaptive and generalizable global motion framework that utilizes the local-to-global prior, ensuring robustness for both optimal and suboptimal SLAM conditions.

  • OfCaM achieves significant advanced results in global human and camera motion compared with baseline and previous SOTA methods, demonstrating our effectiveness.

2 Related Works

2.1 World Coordinate Human Mesh Recovery

Image-based HMR methods [10, 33, 36, 22, 26, 19, 18, 25, 27, 3] traditionally focus on recovering human meshes within the camera’s coordinate system. Although major video-based HMR advancements [12, 4, 20] also operate within same camera space, the advent of video data has paved the way for HMR exploration in world coordinates [38, 16, 9, 37, 29, 28]. The transition to world coordinates introduces the distinct challenge of disentangling both dynamic camera motion and human motion, which is a relatively nascent research area. While most previous attempts (e.g., GLAMR [38], DnD [16], TRACE [29], and WHAM [28]) proposed to infer global motion from observable local behaviors (e.g., if a person looks like they are walking, it is assumed they are moving forwards globally), the inherent ambiguities of local-to-global map** present significant limitations. In contrast, our approach does not solely depend on these dataset-learned local-to-global priors but rather employs them as additional cues to enhance accuracy.

Recent efforts including SLAHMR [37] and PACE [14] recognize the utility of background information in determining camera motion with SLAM techniques [23, 30, 31]. However, these methods aim to address the ‘unknown scale’ problem in SLAM outputs by jointly optimizing scale, pose, and shape parameters—a procedure that is inherently ambiguous. Our method, by contrast, deviates from these intensive optimization strategies by calibrating the scale factor directly, utilizing the depth predictions from HMR models, and thereby giving an optimization-free solution.

2.2 Camera Calibration

Camera calibration is a fundamental procedure in robotics and computer vision that enables precise spatial measurement and scene reconstruction. Typically, this process relies on additional sensors, such as Inertial Measurement Units (IMUs) [39, 8], or reference markers like checkerboards [7] to define a known metric scale. However, these traditional calibration methods are not feasible for arbitrary human-centric videos within the HMR domain, due to the absence of external sensors and standardized calibration tools. In contrast, single-view metrology [40] suggests that objects with well-defined geometrical priors can themselves act as natural calibration references. Motivated by this concept, HMR models, with their inherent human geometric priors, have the potential to serve as surrogate calibration devices. In our work, we utilize the predicted absolute joint depths from the HMR model as the reference to accurately and efficiently calibrate the unknown scale factor.

3 Preliminaries

3.1 Human Motion in Camera Coordinates csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT

The 3D human motion from a video I={It}t=1T𝐼superscriptsubscriptsubscript𝐼𝑡𝑡1𝑇I={\{I_{t}\}}_{t=1}^{T}italic_I = { italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT of T𝑇Titalic_T frames can be represented in the camera space by a T𝑇Titalic_T-length sequence of SMPL parameters c={𝜽t,𝜷t,𝝍t,𝝉t}t=1Tsubscript𝑐superscriptsubscriptsubscript𝜽𝑡subscript𝜷𝑡subscript𝝍𝑡subscript𝝉𝑡𝑡1𝑇\mathcal{M}_{c}={\{\boldsymbol{\theta}_{t},\boldsymbol{\beta}_{t},\boldsymbol{% \psi}_{t},\boldsymbol{\tau}_{t}\}}_{t=1}^{T}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. SMPL [21] is a widely used 3D statistical model of the human body. For a given frame at time t𝑡titalic_t, the SMPL model maps body pose 𝜽t23×3subscript𝜽𝑡superscript233\boldsymbol{\theta}_{t}\in\mathbb{R}^{23\times 3}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 23 × 3 end_POSTSUPERSCRIPT, shape parameters 𝜷t10subscript𝜷𝑡superscript10\boldsymbol{\beta}_{t}\in\mathbb{R}^{10}bold_italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT, root orientation 𝝍t3subscript𝝍𝑡superscript3\boldsymbol{\psi}_{t}\in\mathbb{R}^{3}bold_italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, and root translation 𝝉t3subscript𝝉𝑡superscript3\boldsymbol{\tau}_{t}\in\mathbb{R}^{3}bold_italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT to a 3D mesh of the human body 𝐕t6890×3subscript𝐕𝑡superscript68903\mathbf{V}_{t}\!\in\!\mathbb{R}^{6890\times 3}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 6890 × 3 end_POSTSUPERSCRIPT in the camera space. The individual joints can be mapped from the SMPL parameters with the function 𝐉=𝒥(𝜽t,𝜷t,𝝍t,𝝉t)24×3𝐉𝒥subscript𝜽𝑡subscript𝜷𝑡subscript𝝍𝑡subscript𝝉𝑡superscript243\mathbf{J}=\mathcal{J}(\boldsymbol{\theta}_{t},\boldsymbol{\beta}_{t},% \boldsymbol{\psi}_{t},\boldsymbol{\tau}_{t})\in\mathbb{R}^{24\times 3}bold_J = caligraphic_J ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 24 × 3 end_POSTSUPERSCRIPT.

Note that 𝝍tsubscript𝝍𝑡\boldsymbol{\psi}_{t}bold_italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝝉tsubscript𝝉𝑡\boldsymbol{\tau}_{t}bold_italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are sometimes referred to as “global” orientation and translation parameters by the SMPL model. However, HMR models [10, 15, 12, 4, 13, 35] estimate these parameters with respect to the camera extrinsics \mathcal{E}caligraphic_E at frame t𝑡titalic_t, where ={𝑹t,𝑻t}t=1Tsuperscriptsubscriptsubscript𝑹𝑡subscript𝑻𝑡𝑡1𝑇\mathcal{E}={\{{\boldsymbol{R}}_{t},{\boldsymbol{T}}_{t}\}}_{t=1}^{T}caligraphic_E = { bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is the sequence of camera rotations 𝑹t3×3subscript𝑹𝑡superscript33{\boldsymbol{R}}_{t}\in\mathbb{R}^{3\times 3}bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT and translations 𝑻t3subscript𝑻𝑡superscript3{\boldsymbol{T}}_{t}\in\mathbb{R}^{3}bold_italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT.

3.2 Human Motion in World Coordinates wsubscript𝑤\mathcal{M}_{w}caligraphic_M start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT

Unlike csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, the human motion in the world coordinates w={𝜽t,𝜷t,𝚽t,𝚪t}t=1Tsubscript𝑤superscriptsubscriptsubscript𝜽𝑡subscript𝜷𝑡subscript𝚽𝑡subscript𝚪𝑡𝑡1𝑇\mathcal{M}_{w}={\{\boldsymbol{\theta}_{t},\boldsymbol{\beta}_{t},\boldsymbol{% \Phi}_{t},\boldsymbol{\Gamma}_{t}\}}_{t=1}^{T}caligraphic_M start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = { bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is the motion within an absolute global space 222By convention, the world space is defined by the camera extrinsics parameters of the very first frame. and independent of camera extrinsics tsubscript𝑡\mathcal{E}_{t}caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Recall the classical perspective projection, local root orientation and translation {𝝍t,𝝉t}subscript𝝍𝑡subscript𝝉𝑡\{\boldsymbol{\psi}_{t},\boldsymbol{\tau}_{t}\}{ bold_italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } is obtained by applying camera extrinsics tsubscript𝑡\mathcal{E}_{t}caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the global orientation 𝚽t3subscript𝚽𝑡superscript3\boldsymbol{\Phi}_{t}\in\mathbb{R}^{3}bold_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and global translation 𝚪t3subscript𝚪𝑡superscript3\boldsymbol{\Gamma}_{t}\in\mathbb{R}^{3}bold_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT in world coordinates:

𝝍t=𝑹t𝚽t;𝝉t=𝑹t𝚪t+𝑻t.formulae-sequencesubscript𝝍𝑡subscript𝑹𝑡subscript𝚽𝑡subscript𝝉𝑡subscript𝑹𝑡subscript𝚪𝑡subscript𝑻𝑡\boldsymbol{\psi}_{t}=\boldsymbol{R}_{t}\boldsymbol{\Phi}_{t};\quad\boldsymbol% {\tau}_{t}=\boldsymbol{R}_{t}\boldsymbol{\Gamma}_{t}+\boldsymbol{T}_{t}.bold_italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (1)

Thus, to obtain the global human motion from the local motion estimated by HMR models, one can apply the inverse of the camera extrinsics to the local root orientation and translation:

𝚽t=𝑹tT𝝍t;𝚪t=𝑹tT(𝝉t𝑻t).formulae-sequencesubscript𝚽𝑡superscriptsubscript𝑹𝑡𝑇subscript𝝍𝑡subscript𝚪𝑡superscriptsubscript𝑹𝑡𝑇subscript𝝉𝑡subscript𝑻𝑡\boldsymbol{\Phi}_{t}=\boldsymbol{R}_{t}^{T}\boldsymbol{\psi}_{t};\quad% \boldsymbol{\Gamma}_{t}=\boldsymbol{R}_{t}^{T}(\boldsymbol{\tau}_{t}-% \boldsymbol{T}_{t}).bold_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (2)

This equation explains how we decouple the global motion from the local estimation by isolating and removing the camera motion, which serves as one of our key insights. Nevertheless, getting the correct camera extrinsics \mathcal{E}caligraphic_E can be difficult. SLAM is a widely used method to estimate camera motion [23, 30, 31], though it can only predict the camera extrinsics {𝑹t,s𝑻t}t=1Tsuperscriptsubscriptsubscript𝑹𝑡𝑠subscript𝑻𝑡𝑡1𝑇{\{\boldsymbol{R}_{t},s\cdot\boldsymbol{T}_{t}\}}_{t=1}^{T}{ bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s ⋅ bold_italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT up to an unknown scale s𝑠sitalic_s. Our work focus on how to calibrate the scale s𝑠sitalic_s based on the recovered human mesh.

4 Method

Refer to caption
Figure 2: Our proposed framework operates in two distinct yet complementary streams: (1) Camera Motion Stream, which leverages contact joints’ depth from HMR prediction to calibrate the SLAM’s unknown scale factor; and (2) Human Motion Stream, which leverages a local-to-global motion prior to rectify inaccuracies derived from SLAM’s failure cases.

Our pipeline is illustrated in Fig. 2. For global HMR, the inputs are typically captured by moving cameras featuring static background content and dynamic foreground content of humans. Prior approaches [38, 16, 29] infer global human motion wsubscript𝑤\mathcal{M}_{w}caligraphic_M start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT exclusively from foreground’s local motion csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. In contrast, recent attempts [37, 14] jointly optimize global human motion wsubscript𝑤\mathcal{M}_{w}caligraphic_M start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and the SLAM dereived camera motion \mathcal{E}caligraphic_E to fit with the 2D observation. Different from their complex optimization, we propose an optimization-free way to calibrate the scale by comparing the depth of some key reference points from the output of SLAM and HMR, where we select the human-background contact joints as the reference points. (Sec. 4.2). Furthermore, to resolve the cases of problematic SLAM output, we introduce the global human motion refinement via fusing local motion priors (Sec. 4.2).

4.1 Scale Calibration

SLAM predicts camera motions up to an arbitrary scale factor s𝑠sitalic_s. Similar to the traditional camera calibration strategies (e.g., with checkerboard patterns), this work estimates s𝑠sitalic_s based on the ratio of absolute versus relative distance to the camera for some reference point p𝑝pitalic_p, i.e.,

sp=dpA/dpS,subscript𝑠𝑝subscriptsuperscript𝑑𝐴𝑝subscriptsuperscript𝑑𝑆𝑝s_{p}=d^{A}_{p}/d^{S}_{p},italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_d start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT / italic_d start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , (3)

where dpAsubscriptsuperscript𝑑𝐴𝑝d^{A}_{p}italic_d start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and dpSsubscriptsuperscript𝑑𝑆𝑝d^{S}_{p}italic_d start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT denote the absolute real-world distance from camera to p𝑝pitalic_p and the corresponding relative SLAM depth respectively.

Reference Joint Selection. Standard HMR models estimate the depth of human joints with respect to the camera, based on large-scale training data and human size priors encoded in the (e.g. SMPL) model. Estimating scale s𝑠sitalic_s based on human joints then, requires the relative depth of the corresponding joints from SLAM. However, SLAM typically relies on tracking and matching static reference points within the scene to estimate camera motion. Many of the joints on an arbitrarily moving human body are outliers disregarded by SLAM. Therefore, we propose to use contact points from the human and the background, (i.e., the feet in most cases) as the reference point p𝑝pitalic_p. Experimentally, the feet are verified as the most reliable references (see Sec. 5.2) for scale calibration.

Absolute Distance Derived from HMR Model. By definition, the camera serves as the origin of the camera space, so the reference joint’s absolute distance with respect to the camera is given by:

dpA=𝐉p2,subscriptsuperscript𝑑𝐴𝑝subscriptnormsubscript𝐉𝑝2d^{A}_{p}=\left\|\mathbf{J}_{p}\right\|_{2},italic_d start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ∥ bold_J start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (4)

where 𝐉psubscript𝐉𝑝\mathbf{J}_{p}bold_J start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the 3D position of joint p𝑝pitalic_p computed using the HMR model 𝐉=𝒥(𝜽t,𝜷t,𝝍t,𝝉t)𝐉𝒥subscript𝜽𝑡subscript𝜷𝑡subscript𝝍𝑡subscript𝝉𝑡\mathbf{J}=\mathcal{J}(\boldsymbol{\theta}_{t},\boldsymbol{\beta}_{t},% \boldsymbol{\psi}_{t},\boldsymbol{\tau}_{t})bold_J = caligraphic_J ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

Refer to caption
(a)
Refer to caption
(b)
Figure 3: Details of our proposed method: (a) Retrieval of reference point depth from SLAM output. (b) Identification and compensation of failed SLAM motion segments using local-to-global prediction.

Relative Distance Derived from SLAM. SLAM methods estimate the camera motion by tracking and matching a set of keypoints or patches K𝐾Kitalic_K and maintaining depth maps relative to SLAM’s coordinate. For each keypoint kK𝑘𝐾k\in Kitalic_k ∈ italic_K, located at the 2D position 𝐱k2subscript𝐱𝑘superscript2\mathbf{x}_{k}\in\mathbb{R}^{2}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in the image plane, there is an associated depth zksubscript𝑧𝑘z_{k}\in\mathbb{R}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R. As shown in Fig. 3(a), the relative distance dpSsubscriptsuperscript𝑑𝑆𝑝d^{S}_{p}italic_d start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT of the reference joint p𝑝pitalic_p with respect to the camera can be estimated based on the nearest corresponding SLAM keypoints ksuperscript𝑘k^{*}italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT:

dpS=zkwherek=argminkK𝐱kπ(𝐉p)2.formulae-sequencesubscriptsuperscript𝑑𝑆𝑝subscript𝑧superscript𝑘wheresuperscript𝑘subscriptargmin𝑘𝐾subscriptnormsubscript𝐱𝑘𝜋subscript𝐉𝑝2d^{S}_{p}=z_{k^{*}}\quad\text{where}\ k^{*}=\operatorname*{arg\,min}_{k\in K}% \left\|\mathbf{x}_{k}-\pi\left(\mathbf{J}_{p}\right)\right\|_{2}.italic_d start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT where italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_k ∈ italic_K end_POSTSUBSCRIPT ∥ bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_π ( bold_J start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (5)

Above, π(𝐉p)𝜋subscript𝐉𝑝\pi\left(\mathbf{J}_{p}\right)italic_π ( bold_J start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) denotes the 3D contact joint 𝐉psubscript𝐉𝑝\mathbf{J}_{p}bold_J start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT projected into the camera plane with projection function π𝜋\piitalic_π. For stability, we reject correspondences that are too far away, i.e., if the closest patch to the projected joint 𝐱kπ(𝐉p)2subscriptnormsubscript𝐱superscript𝑘𝜋subscript𝐉𝑝2\left\|\mathbf{x}_{k^{*}}-\pi\left(\mathbf{J}_{p}\right)\right\|_{2}∥ bold_x start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_π ( bold_J start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT exceeds some distance threshold δ𝛿\deltaitalic_δ.

Sequence Scale Factor. Taking the results of Eq. 4 and Eq. 5 into Eq. 3, we can obtain the scale factor for current single keyframe t𝑡titalic_t. To account for noise in finding the nearest tracked keypoint as well as the SLAM depth map, we take the median of the scale factors across all keyframes as the final scale factor s¯=median({s(t)tI})¯𝑠medianconditional-set𝑠𝑡𝑡𝐼\bar{s}=\text{median}(\{s(t)\mid t\in I\})over¯ start_ARG italic_s end_ARG = median ( { italic_s ( italic_t ) ∣ italic_t ∈ italic_I } ) for the whole sequence I𝐼Iitalic_I, ensuring stability and robustness.

After calibrating the scale as s¯¯𝑠\bar{s}over¯ start_ARG italic_s end_ARG, the absolute camera extrinsics is {𝑹t,s¯𝑻t}t=1Tsuperscriptsubscriptsubscript𝑹𝑡¯𝑠subscript𝑻𝑡𝑡1𝑇{\{\boldsymbol{R}_{t},\bar{s}\cdot\boldsymbol{T}_{t}\}}_{t=1}^{T}{ bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_s end_ARG ⋅ bold_italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. By applying them to HMR’s predicted human motion csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, we can finally get the SLAM-derived global motion:

wS={𝜽t,𝜷t,𝚽t=𝑹tT𝝍t,𝚪t=𝑹tT(𝝉ts¯𝑻t)}t=1T,\mathcal{M}^{S}_{w}={\{\boldsymbol{\theta}_{t},\ \boldsymbol{\beta}_{t},\ % \boldsymbol{\Phi}_{t}=\boldsymbol{R}_{t}^{T}\boldsymbol{\psi}_{t},\ % \boldsymbol{\Gamma}_{t}=\boldsymbol{R}_{t}^{T}(\boldsymbol{\tau}_{t}-\bar{s}% \cdot\boldsymbol{T}_{t})\}}_{t=1}^{T},caligraphic_M start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = { bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over¯ start_ARG italic_s end_ARG ⋅ bold_italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (6)

where 𝜽t,𝜷tsubscript𝜽𝑡subscript𝜷𝑡\boldsymbol{\theta}_{t},\boldsymbol{\beta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are from HMR’s prediction, and 𝚽t,𝚪tsubscript𝚽𝑡subscript𝚪𝑡\boldsymbol{\Phi}_{t},\boldsymbol{\Gamma}_{t}bold_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are calculated by camera-to-world translation (i.e., Eq. 2) from HMR’s prediciton 𝝍t,𝝉tsubscript𝝍𝑡subscript𝝉𝑡\boldsymbol{\psi}_{t},\boldsymbol{\tau}_{t}bold_italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in camera space.

4.2 Local-to-global Motion Adjustment

As previously mentioned, SLAM can be prone to failure in challenging circumstances, such as when input images are dominated by the dynamic foreground human there are too few informative reference points for matching and tracking. In such scenarios, inaccurate camera motion estimation by SLAM can further impact the derived human motion in world coordinates by Eq. 2. To mitigate complete reliance on SLAM, we propose to use the local-to-global motion prior to compensating for scenarios where SLAM falls short.

Local-to-global Human Motion. The previous works [38, 16, 29] in global HMR are devoted to learning the global motion from local motion priors. In our work, we adopt a lightweight sub-module from GLAMR [38] as the local-to-global motion predictor. The input is a sequence of the body pose {𝜽t}subscript𝜽𝑡\{\boldsymbol{\theta}_{t}\}{ bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } and the output is the sequence of human’s global orientation {𝚽~t}subscript~𝚽𝑡\{\tilde{\boldsymbol{\Phi}}_{t}\}{ over~ start_ARG bold_Φ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } and global translation {𝚪~t}subscript~𝚪𝑡\{\tilde{\boldsymbol{\Gamma}}_{t}\}{ over~ start_ARG bold_Γ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } both relative to the first frame. Thus, we can get the global motion derived by local-to-global predictor:

wL={𝜽t,𝜷t,𝚽~t,𝚪~t}t=1T,subscriptsuperscript𝐿𝑤superscriptsubscriptsubscript𝜽𝑡subscript𝜷𝑡subscript~𝚽𝑡subscript~𝚪𝑡𝑡1𝑇\mathcal{M}^{L}_{w}={\{\boldsymbol{\theta}_{t},\ \boldsymbol{\beta}_{t},\ % \tilde{\boldsymbol{\Phi}}_{t},\ \tilde{\boldsymbol{\Gamma}}_{t}\}}_{t=1}^{T},caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = { bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG bold_Φ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG bold_Γ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (7)

where wLsubscriptsuperscript𝐿𝑤\mathcal{M}^{L}_{w}caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT denotes the global motion in world space derived by local-to-global prediction.

SLAM Failure Indicator. When SLAM performs well, the scale factor s𝑠sitalic_s over the whole sequence will exhibit a very small standard deviation. Therefore, we use standard deviation σ({s()})𝜎𝑠\sigma(\{s(\cdot)\})italic_σ ( { italic_s ( ⋅ ) } ) as a SLAM failure indicator to identify the set of segments 𝒮={I^Iσ({s(t)tI^})>υ}𝒮conditional-set^𝐼𝐼𝜎conditional-set𝑠𝑡𝑡^𝐼𝜐\mathcal{S}=\{\hat{I}\subseteq I\mid\sigma(\{s(t)\mid t\in\hat{I}\})>\upsilon\}caligraphic_S = { over^ start_ARG italic_I end_ARG ⊆ italic_I ∣ italic_σ ( { italic_s ( italic_t ) ∣ italic_t ∈ over^ start_ARG italic_I end_ARG } ) > italic_υ }, each segement I^^𝐼\hat{I}over^ start_ARG italic_I end_ARG exhibiting significant disagreement in the calculated scale factor. This segment-wise manner is necessary because SLAM may fail in certain segments rather than the entire sequence.

Segment-wise Adaptive Global Motion Fusion. Since when SLAM failed, the scale is no longer reliable but may retain potentially useful shape information, we opt to fuse the local-to-global motion wL(I^)subscriptsuperscript𝐿𝑤^𝐼\mathcal{M}^{L}_{w}(\hat{I})caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( over^ start_ARG italic_I end_ARG ) with SLAM-derived motion wS(I^)subscriptsuperscript𝑆𝑤^𝐼\mathcal{M}^{S}_{w}(\hat{I})caligraphic_M start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( over^ start_ARG italic_I end_ARG ) as the final motion for failed segments I^𝒮^𝐼𝒮\hat{I}\in\mathcal{S}over^ start_ARG italic_I end_ARG ∈ caligraphic_S. Thus, we first align the SLAM-derived motion to the local-to-global motion by Umeyama’s method [32]: U(wS(I^),wL(I^))={𝜽t,𝜷t,𝚽t,𝚪t}tI^𝑈subscriptsuperscript𝑆𝑤^𝐼subscriptsuperscript𝐿𝑤^𝐼subscriptsubscript𝜽𝑡subscript𝜷𝑡subscriptsuperscript𝚽𝑡subscriptsuperscript𝚪𝑡𝑡^𝐼U(\mathcal{M}^{S}_{w}(\hat{I}),\mathcal{M}^{L}_{w}(\hat{I}))={\{\boldsymbol{% \theta}_{t},\ \boldsymbol{\beta}_{t},\ {\boldsymbol{\Phi}}^{\prime}_{t},\ {% \boldsymbol{\Gamma}}^{\prime}_{t}\}}_{t\in\hat{I}}italic_U ( caligraphic_M start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( over^ start_ARG italic_I end_ARG ) , caligraphic_M start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( over^ start_ARG italic_I end_ARG ) ) = { bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ over^ start_ARG italic_I end_ARG end_POSTSUBSCRIPT where U(a,b)𝑈𝑎𝑏U(a,b)italic_U ( italic_a , italic_b ) is Umeyama’s method to align points set a𝑎aitalic_a to points set b𝑏bitalic_b. The fused global motion is the weighted average of these two motions:

wF(I^)={𝜽t,𝜷t,λ𝚽~t+(1λ)𝚽t,λ𝚪~t+(1λ)𝚪t}tI^,subscriptsuperscript𝐹𝑤^𝐼subscriptsubscript𝜽𝑡subscript𝜷𝑡𝜆subscript~𝚽𝑡1𝜆subscriptsuperscript𝚽𝑡𝜆subscript~𝚪𝑡1𝜆subscriptsuperscript𝚪𝑡𝑡^𝐼\mathcal{M}^{F}_{w}(\hat{I})={\{\boldsymbol{\theta}_{t},\ \boldsymbol{\beta}_{% t},\ \lambda\tilde{\boldsymbol{\Phi}}_{t}+(1-\lambda){\boldsymbol{\Phi}}^{% \prime}_{t},\ \lambda\tilde{\boldsymbol{\Gamma}}_{t}+(1-\lambda){\boldsymbol{% \Gamma}}^{\prime}_{t}\}}_{t\in\hat{I}},caligraphic_M start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( over^ start_ARG italic_I end_ARG ) = { bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ over~ start_ARG bold_Φ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_λ ) bold_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ over~ start_ARG bold_Γ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_λ ) bold_Γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ over^ start_ARG italic_I end_ARG end_POSTSUBSCRIPT , (8)

where the weight λ𝜆\lambdaitalic_λ is calculated by the Softmax of the standard deviation σ({s(t)tI^})𝜎conditional-set𝑠𝑡𝑡^𝐼\sigma(\{s(t)\mid t\in\hat{I}\})italic_σ ( { italic_s ( italic_t ) ∣ italic_t ∈ over^ start_ARG italic_I end_ARG } ) , a higher outlier score means lower weight given to SLAM-derived motion during weighted fusion.

4.3 Final Human Motion and Camera Motion

Upon identifying the failure segments 𝒮𝒮\mathcal{S}caligraphic_S, we selectively update these segments by integrating local-to-global motion as mentioned above, thus we achieved the final global human motion wsubscript𝑤\mathcal{M}_{w}caligraphic_M start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT:

w(I^)={wF(I^)if I^𝒮,wS(I^)otherwise.subscript𝑤^𝐼casessubscriptsuperscript𝐹𝑤^𝐼if ^𝐼𝒮subscriptsuperscript𝑆𝑤^𝐼otherwise\mathcal{M}_{w}(\hat{I})=\begin{cases}\mathcal{M}^{F}_{w}(\hat{I})&\text{if }% \hat{I}\in\mathcal{S},\\ \mathcal{M}^{S}_{w}(\hat{I})&\text{otherwise}.\end{cases}caligraphic_M start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( over^ start_ARG italic_I end_ARG ) = { start_ROW start_CELL caligraphic_M start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( over^ start_ARG italic_I end_ARG ) end_CELL start_CELL if over^ start_ARG italic_I end_ARG ∈ caligraphic_S , end_CELL end_ROW start_ROW start_CELL caligraphic_M start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( over^ start_ARG italic_I end_ARG ) end_CELL start_CELL otherwise . end_CELL end_ROW (9)

Consequently, this rectified global human motion allows for an update to the camera motion by algebraically reformulating Eq. 2 to express the camera motion in terms of local and global human motion.

5 Experiments

5.1 Implementation Details, Dataset and Metrics

Our experiments 333The code will be released upon acceptance adopt DPVO [31] as the SLAM model for the camera motion stream and CLIFF [17] as the HMR model for the human motion stream. The distance threshold δ=400px𝛿400𝑝𝑥\delta=400pxitalic_δ = 400 italic_p italic_x for outlier rejection and the standard deviation threshold υ=2𝜐2\upsilon=2italic_υ = 2 for SLAM failure segment identification.

Datasets. Following previous works [28], we evaluate the global human motion and camera motion on a subset of EMDB  [11] (EMDB 2), which contains 25 sequences captured by the dynamic camera and provides ground truth global motion for both human and camera.

Metrics for Human Motion. Same with previous works [14, 37, 28], we evaluate human’s global motion error by: (1) WA-MPJPE which is the average Euclidean distance between the ground truth and the predicted joint positions (i.e., MPJPE) after aligning each segment for every 100 frames; (2) W-MPJPE which is the MPJPE error after only aligning the first two frames of each 100-frames segments with the ground truth. (3) RTE which is the human’s root translation error of the whole sequence after the rigid alignment. We also evaluate local mesh error by (4) PA-MPJPE which is the MPJPE error after Procrustes aligned with ground truth.

Metrics for Camera Motion. We follow SLAM convention and previous works [14] for camera motion evaluation, reporting (1) ATE which is the Average Translation Error after rigidly aligning the camera trajectories; (2) ATE-S which is Average Translation Error without Scale alignment, providing a more accurate reflection of inaccuracies in the captured scale of the scene.

5.2 Ablation Study and Analysis

Table 1: Ablation studies on the impact of our proposed scale calibration and local-to-global adjustment on the error of global human and camera motion. ‘L2G’ denotes local-to-global.
Ablation Global Human Motion Global Camera Motion
Scale L2G WA-MPJPE\downarrow W-MPJPE\downarrow RTE\downarrow ATE\downarrow ATE-S\downarrow
335.53 833.11 9.61 0.72 6.30
280.25 759.56 7.68 - -
111.29 347.60 2.41 0.72 1.33
108.24 317.88 2.21 0.71 1.25
Table 2: Comparative analysis of scale calibration performance using different reference joints. Results indicate that human-background contact joints such as feet served as a better choice.
Reference Joint Global Human Motion Global Camera Motion
WA-MPJPE\downarrow W-MPJPE\downarrow RTE\downarrow ATE\downarrow ATE-S\downarrow
Head 369.11 979.07 9.65 2.13 6.77
Pelvis 292.44 753.64 8.23 1.39 5.51
Feet 108.24 317.88 2.21 0.71 1.25

Scale Calibration. Tab. 1 shows the impact of fixing the SLAM scale to the initial scale of SLAM output (first row) vs. scaling the camera motion (third and fourth row). This comparison reveals that our scale calibration is effective on both human motion (left part) and camera motion (right part). Additionally, Fig 7 shows the SLAM output indeed facing the unknown scale problem in the first place but after our scale calibration, the camera trajectory becomes fitter with the ground truth.

Local-to-global Refinement. Comparing the third row and last row of Tab. 1, L2G can successfully refine the human motion estimation derived from failed SLAM outputs (left part), and improved human motion estimations concurrently yield more accurate camera motion (right part). This improvement is more pronounced when evaluated on challenging sequences with human occupancy exceeding 40% of the image area. As shown in Tab. 4, our L2G module achieves a 10% improvement in WA-MPJPE and 30% improvement in W-MPJPE.

Importance of Camera Motion. Tab. 1 also shows the performance when we bypass the camera motion stream and directly use the result of L2G as the global human motion (second row) vs. using scale-calibrated camera motion to decouple global human motion (third row). The better performance of the latter highlights that ambiguity in local-to-global motion prediction is inherent, camera motion is essential, and we can better decouple by an effective scale calibration.

Reference Joint Selection. Tab. 2 shows that performance drops for both human and camera motion when we use non-contact joints as the reference points, e.g. the head joint (first row) or root joint (second row). As the feet are consistently proximate to the ground surface, they are more stable and reliable reference points for scale calibration. Specifically, Fig 4 further demonstrate pelvis joints show larger scale error than foot joints since dynamic humans are hard to capture by SLAM. Furthermore, an inverse correlation exists between scale errors of left and right feet, with growth in left feet corresponding to a contraction in right. This scale error trend is also consistent with the contact feet as shown in the corresponding image frames, a foot will have less scale error when it contacts with the ground. This further verified our motivation for choosing the contact joint of the foreground and background as the reference point.

Refer to caption
Figure 4: Scale error across some keyframes of the video. Results show a low scale error when the foot is in contact with the ground. Additionally, an inverse correlation between the left and right foot scale errors corresponds with the alternating pattern of footfalls during locomotion.
Table 3: SOTA comparison of local human motion and global human motion on EMDB2 dataset.
Models EMDB2
PA-MPJPE\downarrow W-MPJPE\downarrow WA-MPJPE\downarrow RTE\downarrow
GLAMR [38] 56.0 726.6 280.8 16.7
TRACE [29] 58.0 1702.3 529.0 18.9
SLAHMR [37] 61.5 776.1 326.9 10.2
WHAM(w/ DPVO) [28] 41.9 354.8 135.6 6.0
OfCam (Ours) 53.7 317.9 108.2 2.2

5.3 Comparison with the State-of-the-art

Tab. 3 compares our approach with SOTA world coordinate human mesh recovery methods. Our Method demonstrates a significant enhancement on global human motion metrics over WHAM [28] (about 10% improvement in W-MPJPE, 20% improvement in WA-MPJPE, 60% improvement in RTE). This notable improvement, especially in RTE, which evaluates the entire motion trajectory, is attributed to our accurate camera motion scale calibration. Our calibration effectively and reliably decouples human motion from the camera motion.

As discussed in the Related Works section, the previous methods can divided into local-to-global methods and optimization-based camera motion methods. Here we compare our method with both to further demonstrate our strength.

Human Global Translation Ambiguity. The large global trajectory error of Local-to-global Methods (see RTE of TRACE and GLAMR in Tab. 3) demonstrates the difficulties of those Local-to-global Methods to handle long-distance trajectories. The deeper reason for this is the ambiguity when derive global translation only based on the local motion. As shown in the first example in Fig. 5, it’s hard to infer the global translation from a "standing" local pose when a man is skateboarding.

Time Complexity. We compare the running time between our method and scale-optimization methods (such as SLAHMR [37] and PACE [14]) in Fig. 1(c). Excluding the SLAM running time, SLAHMR takes over 200 minutes per 1000 frames, and PACE takes 8 minutes per 1000 frames for optimization. In contrast, our approach requires significantly less time (2.5 seconds per 1000 frames) as it is optimization-free. Furthermore, we achieve a better scale compared to SLAHMR, despite the latter’s long optimization process. This illustrates that our scale-calibration is not only time-efficient but also delivers strong performance.

Global Human Motion Visualization. We also compare our visualization result with other methods as shown in Fig. 6. The visualization clearly demonstrates that our method produces outcomes that are not only more natural-looking but also better aligned with the ground truth.

Refer to caption
Figure 5: Comparison of global human trajectory estimation on EMDB. Overall, ours shows better alignment to ground truth data compared to WHAM, especially in high ambiguity local-to-global motion scenarios.
Refer to caption
Figure 6: Global human motion visualization comparing ours, WHAM, and the ground truth.

6 Conclusion & Limitations

This paper proposes OfCaM, which uses HMR’s absolute depth prediction as a tool to calibrate the unknown scale of SLAM. By utilizing human-background contacts as the calibration reference, OfCaM effectively and efficiently recovers the camera motion. With the accurately isolated camera motion, OfCaM enhances the decoupling of global human motion from video observations. Additionally, we leverage local-to-global priors to rectify instances where SLAM outputs may fail.

Currently, our work has two limitations. First is the body-pose accuracy (see PA-MPJPE error in Table 3). However, our framework is compatible with any HMR model so more advanced methods can be integrated. This is beyond our current scope of accurate recovery of human meshes in world coordinates rather than optimizing local pose metrics. Secondly, like previous work [28], our evaluation of global human and camera motion is limited to the EMDB dataset, as it is the only dataset specifically designed for the global human and camera motion task. Others either lack annotations for world frames (3DPW [34]) or have incomplete data and or code release (HCM [14]).

References

  • Bescos et al. [2018] Berta Bescos, José M Fácil, Javier Civera, and José Neira. Dynaslam: Tracking, map**, and inpainting in dynamic scenes. IEEE Robotics and Automation Letters, 3(4):4076–4083, 2018.
  • Cadena et al. [2016] Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, José Neira, Ian Reid, and John J Leonard. Past, present, and future of simultaneous localization and map**: Toward the robust-perception age. IEEE Transactions on robotics, 32(6):1309–1332, 2016.
  • Cho et al. [2023] Hanbyel Cho, Yooshin Cho, Jaesung Ahn, and Junmo Kim. Implicit 3d human mesh recovery using consistency with pose and shape from unseen-view. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 21148–21158, 2023.
  • Choi et al. [2021] Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Beyond static features for temporally consistent 3d human pose and shape from a video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Durrant-Whyte and Bailey [2006] Hugh Durrant-Whyte and Tim Bailey. Simultaneous localization and map**: part i. IEEE Robotics and Automation Letters, 13(2):99–110, 2006.
  • Han et al. [2022] D.-I. D. Han, Y. Bergs, and N. Moorhouse. Virtual reality consumer experience escapes: preparing for the metaverse. Virtual Reality, 26(4):1443–1458, 2022.
  • Hartley and Zisserman [2003] Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2003.
  • Heikkilä and Silvén [1997] Janne Heikkilä and Olli Silvén. A four-step camera calibration procedure with implicit image correction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1106–1112. IEEE, 1997.
  • Henning et al. [2022] Dorian F Henning, Tristan Laidlow, and Stefan Leutenegger. Bodyslam: Joint camera localisation, map**, and human motion tracking. In European Conference on Computer Vision (ECCV), pages 656–673. Springer, 2022.
  • Kanazawa et al. [2018] Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • Kaufmann et al. [2023] Manuel Kaufmann, Jie Song, Chen Guo, Kaiyue Shen, Tianjian Jiang, Chengcheng Tang, Juan José Zárate, and Otmar Hilliges. EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. In IEEE International Conference on Computer Vision (ICCV), 2023.
  • Kocabas et al. [2020] Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. Vibe: Video inference for human body pose and shape estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Kocabas et al. [2021] Muhammed Kocabas, Chun-Hao P Huang, Otmar Hilliges, and Michael J Black. Pare: Part attention regressor for 3d human body estimation. In IEEE International Conference on Computer Vision (ICCV), pages 11127–11137, 2021.
  • Kocabas et al. [2023] Muhammed Kocabas, Ye Yuan, Pavlo Molchanov, Yunrong Guo, Michael J Black, Otmar Hilliges, Jan Kautz, and Umar Iqbal. Pace: Human and camera motion estimation from in-the-wild videos. International Conference on 3D Vision (3DV), 2023.
  • Kolotouros et al. [2019] Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In IEEE International Conference on Computer Vision (ICCV), 2019.
  • Li et al. [2022a] Jiefeng Li, Siyuan Bian, Chao Xu, Gang Liu, Gang Yu, and Cewu Lu. D &d: Learning human dynamics from dynamic camera. In European Conference on Computer Vision (ECCV), pages 479–496. Springer, 2022a.
  • Li et al. [2022b] Zhihao Li, Jianzhuang Liu, Zhensong Zhang, Songcen Xu, and Youliang Yan. Cliff: Carrying location information in full frames into human pose and shape estimation. In European Conference on Computer Vision (ECCV), pages 590–606. Springer, 2022b.
  • Lin et al. [2021a] Kevin Lin, Lijuan Wang, and Zicheng Liu. Mesh graphormer. In IEEE International Conference on Computer Vision (ICCV), pages 12939–12948, 2021a.
  • Lin et al. [2021b] Kevin Lin, Lijuan Wang, and Zicheng Liu. End-to-end human pose and mesh reconstruction with transformers. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1954–1963, 2021b.
  • Liu et al. [2022] Zhenguang Liu, Runyang Feng, Haoming Chen, Shuang Wu, Yixing Gao, Yunjun Gao, and Xiang Wang. Temporal feature alignment and mutual information maximization for video-based human pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11006–11016, 2022.
  • Loper et al. [2015] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6):248:1–248:16, 2015.
  • Moon and Lee [2020] Gyeongsik Moon and Kyoung Mu Lee. I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In European Conference on Computer Vision (ECCV), pages 752–768. Springer, 2020.
  • Mur-Artal et al. [2015] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics, 31(5):1147–1163, 2015.
  • Rauschnabel et al. [2022] P. A. Rauschnabel, R. Felix, C. Hinsch, H. Shahab, and F. Alt. What is xr? towards a framework for augmented and virtual reality. Computers in Human Behavior, 133:107289, 2022.
  • Rong et al. [2019] Yu Rong, Ziwei Liu, Cheng Li, Kaidi Cao, and Chen Change Loy. Delving deep into hybrid annotations for 3d human recovery in the wild. In IEEE International Conference on Computer Vision (ICCV), pages 5340–5348, 2019.
  • Saito et al. [2020] Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul Joo. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 84–93, 2020.
  • Sengupta et al. [2020] Akash Sengupta, Ignas Budvytis, and Roberto Cipolla. Synthetic training for accurate 3d human pose and shape estimation in the wild. arXiv preprint arXiv:2009.10013, 2020.
  • Shin et al. [2023] Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded humans with accurate 3d motion. arXiv preprint arXiv:2312.07531, 2023.
  • Sun et al. [2023] Yu Sun, Qian Bao, Wu Liu, Tao Mei, and Michael J Black. Trace: 5d temporal regression of avatars with dynamic cameras in 3d environments. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8856–8866, 2023.
  • Teed and Deng [2021] Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in Neural Information Processing Systems (NeurIPS), 34:16558–16569, 2021.
  • Teed et al. [2023] Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch visual odometry. Advances in Neural Information Processing Systems (NeurIPS), 2023.
  • Umeyama [1991] Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 13(04):376–380, 1991.
  • Varol et al. [2018] G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer, I. Laptev, and C. Schmid. Bodynet: Volumetric inference of 3d human body shapes. In European Conference on Computer Vision (ECCV), pages 20–38. Springer International Publishing, 2018.
  • Von Marcard et al. [2018] Timo Von Marcard, Roberto Henschel, Michael J Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV), pages 601–617, 2018.
  • Xu et al. [2022] Yufei Xu, **g Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems (NeurIPS), 35:38571–38584, 2022.
  • Yao et al. [2019] P. Yao, Z. Fang, F. Wu, Y. Feng, and J. Li. Densebody: Directly regressing dense 3d human pose and shape from a single color image. arXiv preprint arXiv:1903.10153, 2019.
  • Ye et al. [2023] Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. Decoupling human and camera motion from videos in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 21222–21232, 2023.
  • Yuan et al. [2022] Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. Glamr: Global occlusion-aware human mesh recovery with dynamic cameras. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11038–11049, 2022.
  • Zhang [2000] Zhengyou Zhang. A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 22(11):1330–1334, 2000.
  • Zhu et al. [2020] Rui Zhu, Xingyi Yang, Yannick Hold-Geoffroy, Federico Perazzi, Jonathan Eisenmann, Kalyan Sunkavalli, and Manmohan Chandraker. Single view metrology in the wild. In European Conference on Computer Vision (ECCV), pages 316–333. Springer, 2020.

Appendix A Appendix / supplemental material

Table 4: Evaluation of local-to-global adjustment on challenging sequences where the human subject occupies a large portion of the image. Sequences were selected based on an average human occupancy exceeding 40% on EMDB2 dataset.
Challenging Sequences World Human Motion World Camera Motion
Scale L2G WA-MPJPE\downarrow W-MPJPE\downarrow RTE\downarrow ATE\downarrow ATE-S\downarrow
160.54 520.89 4.59 1.11 2.12
142.40 376.17 3.36 1.02 1.64
Refer to caption
Figure 7: Visulizaiton of camera trajectory before and after our scale calibration. As shown, the original SLAM output is up to an unknown scale. After our scale calibration, it becomes better aligned to ground truth data.