License: arXiv.org perpetual non-exclusive license
arXiv:2312.06386v1 [cs.CV] 11 Dec 2023

ManiPose: Manifold-Constrained Multi-Hypothesis 3D Human Pose Estimation

Cédric Rommel11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
Corresponding author.
   Victor Letzelter1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT
   Nermin Samet11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
   Renaud Marlet1,515{}^{1,5}start_FLOATSUPERSCRIPT 1 , 5 end_FLOATSUPERSCRIPT
   Matthieu Cord1,313{}^{1,3}start_FLOATSUPERSCRIPT 1 , 3 end_FLOATSUPERSCRIPT
   Patrick Pérez11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
   Eduardo Valle1,414{}^{1,4}start_FLOATSUPERSCRIPT 1 , 4 end_FLOATSUPERSCRIPT
   11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTValeo.ai, Paris, France   22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTTelecom Paris, Palaiseau, France   33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTSorbonne Université, Paris, France
44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTRecod.ai Lab, School of Electrical and Computing Engineering, University of Campinas, Brazil
55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPTLIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS, Marne-la-Vallee, France
Abstract

Monocular 3D human pose estimation (3D-HPE) is an inherently ambiguous task, as a 2D pose in an image might originate from different possible 3D poses. Yet, most 3D-HPE methods rely on regression models, which assume a one-to-one map** between inputs and outputs. In this work, we provide theoretical and empirical evidence that, because of this ambiguity, common regression models are bound to predict topologically inconsistent poses, and that traditional evaluation metrics, such as the MPJPE, P-MPJPE and PCK, are insufficient to assess this aspect. As a solution, we propose ManiPose, a novel manifold-constrained multi-hypothesis model capable of proposing multiple candidate 3D poses for each 2D input, together with their corresponding plausibility. Unlike previous multi-hypothesis approaches, our solution is completely supervised and does not rely on complex generative models, thus greatly facilitating its training and usage. Furthermore, by constraining our model to lie within the human pose manifold, we can guarantee the consistency of all hypothetical poses predicted with our approach, which was not possible in previous works. We illustrate the usefulness of ManiPose in a synthetic 1D-to-2D lifting setting and demonstrate on real-world datasets that it outperforms state-of-the-art models in pose consistency by a large margin, while still reaching competitive MPJPE performance.

1 Introduction

Refer to caption
Figure 1: When faced with depth ambiguity, state-of-the-art regression methods [40] fail to predict poses with consistent segments’ length. In contrast, ManiPose predicts multiple consistent hypotheses and is thus capable of dealing with such ambiguities.

Monocular 3D human pose estimation (3D-HPE) is a significant and challenging learning problem that aims at predicting human poses in the 3D space given a single image or video. Most modern approaches to 3D-HPE split the problem into two steps: 2D-HPE is first used to predict keypoint positions in the 2D pixel space, followed by a 2D-to-3D lifting step, predominantly cast as a statistical regression problem. Because of depth ambiguity and occlusions, it is a well-known fact that this is an intrinsically ill-posed problem, given that multiple 3D poses can correspond to the same 2D pose observed in an image. Despite these difficulties, recent developments have spurred rapid transformations in the field, leading to substantial performance improvements in terms of mean per joint prediction error (MPJPE) and other derived metrics (e.g., P-MPJPE, PCK) [40, 41, 32, 35].

However, recent studies [37, 8, 30] have noted that poses predicted by state-of-the-art models fail to respect the basic invariances of human morphology, such as bilateral symmetry of left and right-side body parts, or the constant length of rigid segments connecting the joints of a given subject across an image sequence. In this work, we provide theoretical elements clarifying the cause of these issues. Namely, we formally prove that pose consistency and traditional performance metrics, such as MPJPE, are somehow antagonistic and cannot be optimized simultaneously using a standard regression model. This is because (i) MPJPE ignores the topology of the space of human poses, and (ii) because regression models rely on a one-to-one map** between inputs and outputs, thus overlooking the inherently ambiguous nature of the 3D-HPE problem.

To reconcile both MPJPE performance and pose consistency, and better accommodate the 3D-HPE task ambiguity, we propose ManiPose, a novel manifold-constrained multi-hypothesis human pose lifting method. Unlike previously proposed multi-hypothesis (MH) approaches to 3D-HPE, our solution relies on a completely supervised model instead of a generative one, thus alleviating training complexity. Furthermore, this structural novelty allows us to enforce pose consistency as a hard constraint in the network architecture, guaranteeing that all predicted poses lie on a pose manifold, which we estimate. As a third difference with previous approaches, ManiPose predicts not only a set of hypothetical poses, but also their corresponding likelihood conditional to the 2D input. Together, these two outputs allow ManiPose to estimate deterministically the conditional distribution of 3D poses given a 2D input in a single forward pass. All these ingredients allow ManiPose to reduce pose inconsistency by more than one order of magnitude with regard to state-of-the-art methods, while maintaining competitive performance in terms of MPJPE.

Our contributions can be summarized as follows:

  • We prove that the standard regression setting inevitably leads to predicted poses outside the human pose manifold.

  • As a corollary, we also prove that regression models constrained to lie on the human pose manifold cannot beat unconstrained models in terms of the traditional performance metrics, thus calling to complement benchmarks with pose consistency measures.

  • We motivate our approach and illustrate our theoretical findings on simulated lifting experiments, where we demonstrate that only multi-hypothesis approaches can conciliate MPJPE performance and pose consistency.

  • We hence propose a novel fully-supervised multi-hypothesis method for 3D-HPE called ManiPose.

  • Finally, we demonstrate on two challenging datasets, Human 3.6M and MPI-INF-3DHP, that ManiPose outperforms state-of-the-art methods by a substantial margin in terms of pose consistency, while reaching comparable MPJPE performance.

2 Related work

Regression-based 2D-to-3D pose lifting. While 2D-to-3D human pose lifting was initially tackled at the frame level [24, 3], the field quickly adopted recurrent [9], convolutional [29] and graph neural networks [2, 43, 10, 39] to move towards video-level predictions. More recently, spatial-temporal transformers were proposed [32, 41], including MixSTE [40], which, arguably, is the state of the art. Our work also processes videos with a transformer architecture. A few previous works have proposed to constrain predicted poses to respect human symmetries  [38, 4]. We follow the same idea, but in a multi-hypothesis setting and with a different constraint implementation.

Multi-hypothesis 3D-HPE. The intrinsic ambiguity of the 3D-HPE task led the community to investigate multi-hypothesis approaches (MH). Efforts included the use of Mixture Density Networks [19, 28, 1], variational autoencoders [33], normalizing flows [13, 37] and diffusion models [8]. Contrary to our approach, these methods rely on training a conditional generative model to sample 3D pose hypotheses given the 2D input. A notable exception is MHFormer [21], which, like us, proposes a deterministic MH approach. However, it treats hypotheses as intermediate representations and aggregates them into a single pose at the final layers of the network, thus falling back to a one-to-one map**. We try to avoid this in this work for reasons detailed in the next sections. Moreover, none of the previous MH approaches constrain hypotheses to lie on the human pose manifold, thus not guaranteeing good pose consistency.

Multiple Choice Learning. Suited for ambiguous tasks, MCL [7] consists in optimizing a set of predictors using the oracle loss. Adapted for deep learning by Lee et al. [15, 16], it results in diverse predictors being specialized in subsets of the data distribution. It has proved its effectiveness in several computer vision tasks [31, 14, 26, 6, 23, 34], and was first applied to 2D-HPE in [31]. Our work is the first to revisit MCL for the 3D-HPE task, building upon recent novelties from Letzelter et al. [17].

3 Need for constrained multiple hypotheses

This section aims to illustrate and motivate our contributions by highlighting the shortcomings of traditional lifting-based 3D-HPE methods, without unnecessary complexities.

To this end, we propose to tackle a 1D-to-2D lifting task. As in human pose lifting, we take a root joint J0subscript𝐽0J_{0}italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as reference, which is equivalent to fixing its position at (0,0)00(0,0)( 0 , 0 ). Hence, for a joint J1subscript𝐽1J_{1}italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the problem amounts to predicting the Cartesian coordinates (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) defining its position, given its 1D projection u=x𝑢𝑥u=xitalic_u = italic_x. To strip the problem to its essence, we consider only those two joints, and omit camera perspective. As in the case of human poses, we suppose that joints J0subscript𝐽0J_{0}italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and J1subscript𝐽1J_{1}italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are connected by a rigid segment of length s=1𝑠1s=1italic_s = 1, assumed to be fixed for the whole dataset (as if there were a single subject in a 3D-HPE problem). As a consequence, we know that J1subscript𝐽1J_{1}italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT lies on a manifold: the circle of radius s𝑠sitalic_s centered on J0=(0,0)subscript𝐽000J_{0}=(0,0)italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( 0 , 0 ) (cf. Fig. 2 left).

Refer to caption
Figure 2: Left: 1D-to-2D articulated pose lifting problem. Right: True mean-squared error minimizers under a multimodal distribution. A one-to-one map** cannot both reach optimal mean error performance and stay on the pose manifold. Only a method predicting multiple poses at once can deliver an acceptable solution to the problem ({\color[rgb]{0,0.88,0}\star}).
Refer to caption
Figure 3: A) Unconstrained regression models can perform well when the distribution of 2D poses is unimodal, and its principal component is approximately parallel to the projection axis. B) Because of depth ambiguity, they fail to remain within the manifold when the principal component is perpendicular to the projection axis. C) What is more, both constrained and unconstrained regression models fail to deliver reasonable solutions when the distribution is bimodal. Only a multi-hypothesis method can both stay on the manifold and achieve good mean joint position error.

To show empirical results to illustrate our point, we create three datasets of input-output pairs {(xi,(xi,yi))}i=1Nsuperscriptsubscriptsubscript𝑥𝑖subscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁\{(x_{i},(x_{i},y_{i}))\}_{i=1}^{N}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT with different angular distributions as depicted on Fig. 3 (more details about the experiment setting can be found in the supplementary material in Sec. 10). In an easy scenario (Fig. 3 A), a simple 2-layer MLP, trained in a regression setting to predict (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) given x𝑥xitalic_x, can solve the problem reasonably well. However, this simple solution fails in the two other scenarios (Fig. 3 B-C), leading to predictions that are outside the manifold. This happens because depth ambiguity makes the lifting problem heavily ill-posed in these cases: there are multiple 2D output positions in the training set corresponding to the same projected inputs.

A possible idea to mitigate this issue is to inject of prior knowledge about the manifold into the model, such as forcing its predictions to lie on the circle. This can be implemented by training the same MLP to predict the angle θ𝜃\thetaitalic_θ instead of the Cartesian coordinates (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) directly, and then decode the prediction θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG into (x^,y^)=(cosθ^,sinθ^)^𝑥^𝑦^𝜃^𝜃{(\hat{x},\hat{y})=(\cos\hat{\theta},\sin\hat{\theta})}( over^ start_ARG italic_x end_ARG , over^ start_ARG italic_y end_ARG ) = ( roman_cos over^ start_ARG italic_θ end_ARG , roman_sin over^ start_ARG italic_θ end_ARG ). Although this solution leads to predictions on the circle (cf. Fig. 3 B-C), it achieves worse MPJPE performance than the previous unconstrained MLP in the multimodal case (cf. Constrained MLP in Tab. 1).

MPJPE \downarrow Distance to circle \downarrow
Unconst. MLP 0.748 0.411
Constrained MLP 0.759 0.000
ManiPose 0.733 0.000
Table 1: Mean per joint prediction error (MPJPE) and mean distance to circle for each model on the test set of the bimodal scenario C. Best in bold. MPJPE is computed using (1) for ManiPose.

This lower performance can be explained by the choice of the evaluation metric used, MPJPE, which assumes an Euclidean topology, thus leading to its minimum always falling inside the circle, hence outside the manifold. This is shown in the right side of Fig. 2, where we plot the true minimizer of the mean position error for an ambiguous input. We can see that the prediction made by the unconstrained MLP lies very close to the minimizer. These findings are made theoretically precise and generalized to the real 3D-HPE setting in the next section.

A first consequence of this result is that unconstrained regression models are bound to predict inconsistent poses, where segment lengths are not constant, as seen for state-of-the-art 3D-HPE models [37, 8, 30]. A second consequence is that, no matter how good a model constrained to the manifold is, there is always an unconstrained model leading to better MPJPE performance, as illustrated by the minimizer constrained to the circle in Fig. 2 (right).

These realizations raise the question: Is there a way to have the best of both constrained and unconstrained models, namely a good MPJPE performance, while still predicting consistent poses on the manifold? Given the multimodality of human pose distributions (cf. Fig. 7), the only way out is to adopt a multi-hypothesis framework, leaving behind the regression setting, which cannot accommodate the ambiguity of the lifting task. Ideally, we would want our model to predict both possible poses (marked with a {\color[rgb]{0,0.88,0}\star}) in Fig. 2 (right), with their corresponding likelihood.

To this end, we tested a simplified version of our new multi-hypothesis approach, named ManiPose. It consists here in replacing the last linear layer of the constrained MLP model with K= 2𝐾2K\,{=}\,2italic_K = 2 identical linear heads. Each head k𝑘kitalic_k predicts a hypothesis for the angle θ^ksuperscript^𝜃𝑘\hat{\theta}^{k}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and a corresponding score γksuperscript𝛾𝑘\gamma^{k}italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT modeling the probability that the solution θ𝜃\thetaitalic_θ is closer to θ^ksuperscript^𝜃𝑘\hat{\theta}^{k}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT than to other hypothesis {θ^i}iksubscriptsuperscript^𝜃𝑖𝑖𝑘\{\hat{\theta}^{i}\}_{i\neq k}{ over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ≠ italic_k end_POSTSUBSCRIPT. This is described in details in Sec. 5 in a general 3D-HPE setting.

As shown in Fig. 3, not only does this method predict hypotheses on the circle, it also yields predictions close to the MLP predictions when hypotheses θ^ksuperscript^𝜃𝑘\hat{\theta}^{k}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and scores γksuperscript𝛾𝑘\gamma^{k}italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT are aggregated:

x^aggr=k=1Kx^kγk,y^aggr=k=1Ky^kγk,formulae-sequencesubscript^𝑥aggrsuperscriptsubscript𝑘1𝐾superscript^𝑥𝑘superscript𝛾𝑘subscript^𝑦aggrsuperscriptsubscript𝑘1𝐾superscript^𝑦𝑘superscript𝛾𝑘\hat{x}_{\text{aggr}}=\sum_{k=1}^{K}\hat{x}^{k}\gamma^{k}\,,\quad\hat{y}_{% \text{aggr}}=\sum_{k=1}^{K}\hat{y}^{k}\gamma^{k}\,,over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT aggr end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT aggr end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , (1)

where (x^k,y^k)=(cosθ^k,sinθ^k)superscript^𝑥𝑘superscript^𝑦𝑘superscript^𝜃𝑘superscript^𝜃𝑘(\hat{x}^{k},\hat{y}^{k})=(\cos\hat{\theta}^{k},\sin\hat{\theta}^{k})( over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) = ( roman_cos over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_sin over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ), leading even to superior MPJPE performance here (Tab. 1). Further experiments can be found in the supplementary material.

4 Theoretical results: Regression and MPJPE are not enough

In this section, we generalize the intuitions from previous Sec. 3 to the 3D-HPE setting, proving our claims rigorously in the supplementary material (Sec. 9).

Starting with a few definitions needed for our derivations:

Definition 4.1 (Human skeleton).

We define a human skeleton as an undirected connected graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ) with J=|V|𝐽𝑉J=|V|italic_J = | italic_V | nodes, called joints, associated with different human body articulation points. We assume a predefined order of joints and denote A=[Aij]0i,j<J{0,1}J×J𝐴subscriptdelimited-[]subscript𝐴𝑖𝑗formulae-sequence0𝑖𝑗𝐽superscript01𝐽𝐽A=[A_{ij}]_{0\leq i,j<J}\in\{0,1\}^{J\times J}italic_A = [ italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 0 ≤ italic_i , italic_j < italic_J end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_J × italic_J end_POSTSUPERSCRIPT the adjacency matrix of G𝐺Gitalic_G, defining joints connections.

Definition 4.2 (Human pose and movement).

Let G𝐺Gitalic_G be a skeleton of J𝐽Jitalic_J joints. We attach to each joint i𝑖iitalic_i a position piGsubscriptsuperscriptp𝐺𝑖\mathrm{p}^{G}_{i}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and call the vector pG=[p0G,,pJ1G](3)Jsuperscriptp𝐺subscriptsuperscriptp𝐺0subscriptsuperscriptp𝐺𝐽1superscriptsuperscript3𝐽{\mathrm{p}^{G}=[\mathrm{p}^{G}_{0},\dots,\mathrm{p}^{G}_{J-1}]\in(\mathbb{R}^% {3})^{J}}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = [ roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_J - 1 end_POSTSUBSCRIPT ] ∈ ( blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT a human pose. Furthermore, given a series of increasing time steps t1<t2<<tLsubscript𝑡1subscript𝑡2subscript𝑡𝐿{t_{1}<t_{2}<\dots<t_{L}\in\mathbb{R}}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < ⋯ < italic_t start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∈ blackboard_R, we define a human movement mGsuperscript𝑚𝐺m^{G}italic_m start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT as a sequence of poses of the same subject at those instants mG=[pt1G,,ptLG](3)J×Lsuperscript𝑚𝐺subscriptsuperscriptp𝐺subscript𝑡1subscriptsuperscriptp𝐺subscript𝑡𝐿superscriptsuperscript3𝐽𝐿{m^{G}=[\mathrm{p}^{G}_{t_{1}},\dots,\mathrm{p}^{G}_{t_{L}}]\in(\mathbb{R}^{3}% )^{J\times L}}italic_m start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = [ roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] ∈ ( blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J × italic_L end_POSTSUPERSCRIPT.

Our theoretical results rely on a number of assumptions described hereafter. The first one states what reference frame is usually used for assessing 3D-HPE models:

Assumption 4.3 (Reference root joint).

For any skeleton G𝐺Gitalic_G and movement mGsuperscript𝑚𝐺m^{G}italic_m start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT of length L𝐿Litalic_L, the joint of index 00, called the root joint, is at the origin pt,0G=[0,0,0]subscriptsuperscriptp𝐺𝑡0000\mathrm{p}^{G}_{t,0}=[0,0,0]roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 0 end_POSTSUBSCRIPT = [ 0 , 0 , 0 ] at all times t1ttLsubscript𝑡1𝑡subscript𝑡𝐿t_{1}\leq t\leq t_{L}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_t ≤ italic_t start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT. This is equivalent to measuring positions ptGsubscriptsuperscriptp𝐺𝑡\mathrm{p}^{G}_{t}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in a reference frame attached to the root joint.

The second assumption concerns the rigidity of human body parts, and was verified for ground-truth 3D poses from the Human 3.6M dataset (cf. Fig. 5):

Assumption 4.4 (Rigid segments).

We assume that the Euclidean distance between adjacent joints is constant within a movement mGsuperscript𝑚𝐺m^{G}italic_m start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT: for any pair of instants t𝑡titalic_t and tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and for any joints i,j𝑖𝑗i,jitalic_i , italic_j such that Aij=1subscript𝐴𝑖𝑗1A_{ij}=1italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1, we assume that

st,i,j=st,i,j=si,j,subscript𝑠𝑡𝑖𝑗subscript𝑠superscript𝑡𝑖𝑗subscript𝑠𝑖𝑗s_{t,i,j}=s_{t^{\prime},i,j}=s_{i,j}\,,italic_s start_POSTSUBSCRIPT italic_t , italic_i , italic_j end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i , italic_j end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , (2)

where st,i,j=pt,iGpt,jG2>0subscript𝑠𝑡𝑖𝑗subscriptnormsubscriptsuperscriptp𝐺𝑡𝑖subscriptsuperscriptp𝐺𝑡𝑗20s_{t,i,j}=\|\mathrm{p}^{G}_{t,i}-\mathrm{p}^{G}_{t,j}\|_{2}>0italic_s start_POSTSUBSCRIPT italic_t , italic_i , italic_j end_POSTSUBSCRIPT = ∥ roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0.

Lastly, we assume that the conditional distribution of poses is not reduced to a single point, i.e., we have a one-to-many problem. This was also empirically verified (cf. Fig. 7 in supplementary material).

Assumption 4.5 (Non-degenerate conditional distribution).

Given a joint distribution P(xG,pG)Psuperscriptx𝐺superscriptp𝐺\mathrm{P}(\mathrm{x}^{G},\mathrm{p}^{G})roman_P ( roman_x start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT , roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) of 3D poses pG(3)Jsuperscriptp𝐺superscriptsuperscript3𝐽{\mathrm{p}^{G}\in(\mathbb{R}^{3})^{J}}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ∈ ( blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT and corresponding 2D inputs xG(2)Jsuperscriptx𝐺superscriptsuperscript2𝐽{\mathrm{x}^{G}\in(\mathbb{R}^{2})^{J}}roman_x start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ∈ ( blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT, we assume that the conditional distribution P(pG|xG)Pconditionalsuperscriptp𝐺superscriptx𝐺\mathrm{P}(\mathrm{p}^{G}|\mathrm{x}^{G})roman_P ( roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT | roman_x start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) is non-degenerate, i.e., it is not a single Dirac distribution.

Note that this can be true even when P(xG,pG)Psuperscriptx𝐺superscriptp𝐺\mathrm{P}(\mathrm{x}^{G},\mathrm{p}^{G})roman_P ( roman_x start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT , roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) is unimodal (e.g., Fig. 3 B).

A first consequence of our assumptions is that true human poses forming a movement lie on a smooth manifold:

Proposition 4.6 (Human pose manifold).

If Asm. 4.3 and 4.4 are verified, then all poses ptGsubscriptsuperscriptnormal-p𝐺𝑡\mathrm{p}^{G}_{t}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT forming a human movement mGsuperscript𝑚𝐺m^{G}italic_m start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT lie on the same manifold \mathcal{M}caligraphic_M of dimension 2(J1)2𝐽12(J-1)2 ( italic_J - 1 ). The latter is homeomorphic to the direct product of 2D unit spheres (S2)J1superscriptsuperscript𝑆2𝐽1(S^{2})^{J-1}( italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J - 1 end_POSTSUPERSCRIPT:

t{t1,,tL},ptG(S2)J1.formulae-sequencefor-all𝑡subscript𝑡1subscript𝑡𝐿subscriptsuperscriptp𝐺𝑡superscriptsuperscript𝑆2𝐽1\forall t\in\{t_{1},\dots,t_{L}\},\quad\mathrm{p}^{G}_{t}\in\mathcal{M}\cong(S% ^{2})^{J-1}\,.∀ italic_t ∈ { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT } , roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_M ≅ ( italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J - 1 end_POSTSUPERSCRIPT . (3)

Note that when J=2𝐽2J=2italic_J = 2 and poses have one fewer dimension, we fall back to the unit circle manifold S1superscript𝑆1S^{1}italic_S start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, as in Sec. 3.

With this new proposition in mind, we can properly define the notion of pose consistency:

Definition 4.7 (Inconsistent poses and movements).

We say a predicted movement mGsuperscript𝑚𝐺m^{G}italic_m start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT is inconsistent if it violates assumption 4.4.

We now have everything needed to state the main theoretical result of this work:

Proposition 4.8 (Inconsistency of MSE minimizer).

If the training poses distribution verifies Asm. 4.3-4.5, predicted poses minimizing the traditional mean-squared-error loss are inconsistent.

Prop. 4.8 has two main take-away messages: 1. MPJPE cannot fully assess 3D-HPE models, and must be complemented with pose consistency metrics; 2. When minimizing the mean joint error with an unconstrained model that assumes a one-to-one map**, like a regression model, one cannot predict consistent poses. In the next section, we propose a solution to these problems.

5 Method

Refer to caption
Figure 4: Overview of ManiPose architecture. The Rotations module predicts K𝐾Kitalic_K possible sequences of segments’ rotations together with their corresponding likelihood (scores), while the Segments module estimates the shared segments’ lengths. Predicted poses are hence constrained to an estimated pose manifold defined by the predicted segments’ lengths.

As in previous state-of-the-art 3D-HPE approaches, we adopt a 2-step estimation procedure, where we first estimate human keypoints in the pixel space [x1G,,xLG](2)J×Lsubscriptsuperscriptx𝐺1subscriptsuperscriptx𝐺𝐿superscriptsuperscript2𝐽𝐿{[\mathrm{x}^{G}_{1},\dots,\mathrm{x}^{G}_{L}]\in(\mathbb{R}^{2})^{J\times L}}[ roman_x start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_x start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ] ∈ ( blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J × italic_L end_POSTSUPERSCRIPT from a sequence of L𝐿Litalic_L video frames, and then lift them to 3D joint positions [p^1G,,p^LG](3)J×Lsubscriptsuperscript^p𝐺1subscriptsuperscript^p𝐺𝐿superscriptsuperscript3𝐽𝐿{[\hat{\mathrm{p}}^{G}_{1},\dots,\hat{\mathrm{p}}^{G}_{L}]\in(\mathbb{R}^{3})^% {J\times L}}[ over^ start_ARG roman_p end_ARG start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG roman_p end_ARG start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ] ∈ ( blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J × italic_L end_POSTSUPERSCRIPT. We focus on the second step (i.e., lifting) in the rest of the paper, assuming the availability of predicted keypoints xiGsubscriptsuperscriptx𝐺𝑖\mathrm{x}^{G}_{i}roman_x start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In the following, we omit the G𝐺Gitalic_G superscript from pose and keypoints notations for clarity.

5.1 Constraining predictions to the pose manifold

Rationale. While traditional human pose lifting methods directly predict 3D joint positions using a regression model, the human morphology does not allow joints to occupy the whole 3D space (cf. Prop. 4.6). If we knew the length of each segment connecting pairs of joints for a given subject, we could mimic the procedure adopted in Sec. 3, and guarantee that predicted poses lie on the correct pose manifold by only predicting body part’s rotations with respect to a reference pose. Since we do not have access to ground-truth segment lengths in real use-cases, we propose to predict them, thus estimating the manifold.

Disentangled representations. We constrain model predictions to lie on an estimated manifold by predicting parametrized disentangled transformations of a reference pose uu\mathrm{u}roman_u, for which all segments have unit length. Namely, we propose to split the network into two parts (cf. Fig. 4):

  1. 1.

    Segments module: which predicts segment lengths sJ1𝑠superscript𝐽1{s\in\mathbb{R}^{J-1}}italic_s ∈ blackboard_R start_POSTSUPERSCRIPT italic_J - 1 end_POSTSUPERSCRIPT, shared by all L𝐿Litalic_L frames of the input sequence;

  2. 2.

    Rotations module: which predicts relative rotation representations r=[r1,0,,rL,J1](d)J×L𝑟subscript𝑟10subscript𝑟𝐿𝐽1superscriptsuperscript𝑑𝐽𝐿{r=[r_{1,0},\dots,r_{L,J-1}]\in(\mathbb{R}^{d})^{J\times L}}italic_r = [ italic_r start_POSTSUBSCRIPT 1 , 0 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_L , italic_J - 1 end_POSTSUBSCRIPT ] ∈ ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J × italic_L end_POSTSUPERSCRIPT of each joint with respect to their parent joint at each time step.

Rotations representation. As proposed in [42], we represent rotations continuously using 6D embeddings (i.e., d=6𝑑6d=6italic_d = 6). Compared to quaternions or axis-angles, these representations are continuous and hence more amenable to be learned by neural networks, as demonstrated in [42].

Pose decoding. In order to deliver pose predictions in (3)J×Lsuperscriptsuperscript3𝐽𝐿(\mathbb{R}^{3})^{J\times L}( blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J × italic_L end_POSTSUPERSCRIPT, the intermediate representations (s,r)𝑠𝑟(s,r)( italic_s , italic_r ) need to be decoded, just like θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG in Sec. 3. This is achieved in three steps:

  1. 1.

    First, we scale the unit segments of the reference pose u(3)Jusuperscriptsuperscript3𝐽\mathrm{u}\in(\mathbb{R}^{3})^{J}roman_u ∈ ( blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT using s𝑠sitalic_s, forming a scaled reference pose usuperscriptu\mathrm{u}^{\prime}roman_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

    uj=sτ(j)+sj(ujuτ(j)),0<jJ1,formulae-sequencesubscriptsuperscriptu𝑗subscript𝑠𝜏𝑗subscript𝑠𝑗subscriptu𝑗subscriptu𝜏𝑗0𝑗𝐽1\mathrm{u}^{\prime}_{j}=s_{\tau(j)}+s_{j}(\mathrm{u}_{j}-\mathrm{u}_{\tau(j)})% \,,\quad 0<j\leq J-1\,,roman_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_τ ( italic_j ) end_POSTSUBSCRIPT + italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - roman_u start_POSTSUBSCRIPT italic_τ ( italic_j ) end_POSTSUBSCRIPT ) , 0 < italic_j ≤ italic_J - 1 , (4)

    where τ𝜏\tauitalic_τ maps the index of a joint to its parent’s, if any.

  2. 2.

    Then, for each time step 1tL1𝑡𝐿{1\leq t\leq L}1 ≤ italic_t ≤ italic_L and joint 0j<J0𝑗𝐽0\leq j<J0 ≤ italic_j < italic_J, we convert the predicted rotation representations rt,jsubscript𝑟𝑡𝑗r_{t,j}italic_r start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT into rotation matrices Rt,jSO(3)subscript𝑅𝑡𝑗SO3{R_{t,j}\in\mathrm{SO}(3)}italic_R start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ∈ roman_SO ( 3 ) (cf. Algorithm 1).

  3. 3.

    Finally, we apply these rotation matrices Rt,jsubscript𝑅𝑡𝑗R_{t,j}italic_R start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT at each time step t𝑡titalic_t to the scaled reference pose usuperscriptu\mathrm{u}^{\prime}roman_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT using the forward kinematics Algorithm 2.

5.2 Multiple Choice Learning

ManiPose architecture. As explained in Sec. 3, given the inherent depth ambiguity of the pose lifting task, the only way to conciliate pose consistency and MPJPE performance is to predict multiple hypotheses. In response, we suggest to leverage the Multiple Choice Learning (MCL) framework [7], drawing upon the resilient multiple choice learning method proposed by Letzelter et al. [17]. This variant of MCL allows us to estimate conditional distributions in regression tasks, enabling our model to predict a variety of plausible 3D poses for each given input. Namely, instead of estimating a single rotation rt(d)Jsubscript𝑟𝑡superscriptsuperscript𝑑𝐽r_{t}\in(\mathbb{R}^{d})^{J}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT per time step, the rotations module in ManiPose predicts an intermediate representation et(d)Jsubscript𝑒𝑡superscriptsuperscriptsuperscript𝑑𝐽e_{t}\in(\mathbb{R}^{d^{\prime}})^{J}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT which is fed to K𝐾Kitalic_K linear heads with weights Wrksubscriptsuperscript𝑊𝑘𝑟W^{k}_{r}italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and Wγksubscriptsuperscript𝑊𝑘𝛾W^{k}_{\gamma}italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT used as follows. Each head k𝑘kitalic_k predicts a different rotation hypothesis rtk(d)Jsubscriptsuperscript𝑟𝑘𝑡superscriptsuperscript𝑑𝐽r^{k}_{t}\in(\mathbb{R}^{d})^{J}italic_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT together with its corresponding likelihood γtk[0,1]subscriptsuperscript𝛾𝑘𝑡01\gamma^{k}_{t}\in[0,1]italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 , 1 ], by computing

rtk=Wrket,γ~tk=Wγket,1tL,formulae-sequencesubscriptsuperscript𝑟𝑘𝑡subscriptsuperscript𝑊𝑘𝑟subscript𝑒𝑡formulae-sequencesubscriptsuperscript~𝛾𝑘𝑡subscriptsuperscript𝑊𝑘𝛾subscript𝑒𝑡1𝑡𝐿r^{k}_{t}=W^{k}_{r}e_{t}\,,\quad\tilde{\gamma}^{k}_{t}=W^{k}_{\gamma}e_{t}\,,% \quad 1\leq t\leq L\,,italic_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_γ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , 1 ≤ italic_t ≤ italic_L , (5)

followed by

γtk=σ[γ~t]k,with γ~t=[γ~t1,,γ~tK]K,formulae-sequencesubscriptsuperscript𝛾𝑘𝑡𝜎subscriptdelimited-[]subscript~𝛾𝑡𝑘with subscript~𝛾𝑡superscriptsubscript~𝛾𝑡1superscriptsubscript~𝛾𝑡𝐾superscript𝐾\gamma^{k}_{t}=\sigma[\tilde{\gamma}_{t}]_{k}\,,\quad\text{with\ \ }\tilde{% \gamma}_{t}=[\tilde{\gamma}_{t}^{1},\dots,\tilde{\gamma}_{t}^{K}]\in\mathbb{R}% ^{K},italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_σ [ over~ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , with over~ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ over~ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , over~ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , (6)

where σ𝜎\sigmaitalic_σ is the softmax function.

The rotations predicted by each head are then all decoded together with the common predictions of the segments’ length s𝑠sitalic_s, just as before (cf. Fig. 4). This produces K𝐾Kitalic_K different hypothetical pose sequences p^k=(p^tk)t=1Lsuperscript^p𝑘superscriptsubscriptsubscriptsuperscript^p𝑘𝑡𝑡1𝐿\hat{\mathrm{p}}^{k}=(\hat{\mathrm{p}}^{k}_{t})_{t=1}^{L}over^ start_ARG roman_p end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = ( over^ start_ARG roman_p end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, together with their corresponding likelihood sequences γk=(γtk)t=1Lsuperscript𝛾𝑘superscriptsubscriptsubscriptsuperscript𝛾𝑘𝑡𝑡1𝐿{\gamma^{k}=(\gamma^{k}_{t})_{t=1}^{L}}italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = ( italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, called scores hereafter.

Loss function. As in [17], the ManiPose model is trained with a composite loss:

=wta+βscore.subscriptwta𝛽subscriptscore\mathcal{L}=\mathcal{L}_{\text{wta}}+\beta\mathcal{L}_{\text{score}}\,.caligraphic_L = caligraphic_L start_POSTSUBSCRIPT wta end_POSTSUBSCRIPT + italic_β caligraphic_L start_POSTSUBSCRIPT score end_POSTSUBSCRIPT . (7)

The first term, wtasubscriptwta\mathcal{L}_{\text{wta}}caligraphic_L start_POSTSUBSCRIPT wta end_POSTSUBSCRIPT, is the Winner-takes-all loss [16]

wta(p^(x),p)=1Lt=1Lmink1,K(p^tk(x),pt)subscriptwta^pxp1𝐿superscriptsubscript𝑡1𝐿subscript𝑘1𝐾superscriptsubscript^p𝑡𝑘xsubscriptp𝑡\mathcal{L}_{\text{wta}}(\hat{\mathrm{p}}(\mathrm{x}),\mathrm{p})=\frac{1}{L}% \sum_{t=1}^{L}\min_{k\in\llbracket 1,K\rrbracket}\ell(\hat{\mathrm{p}}_{t}^{k}% (\mathrm{x}),\mathrm{p}_{t})caligraphic_L start_POSTSUBSCRIPT wta end_POSTSUBSCRIPT ( over^ start_ARG roman_p end_ARG ( roman_x ) , roman_p ) = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT roman_min start_POSTSUBSCRIPT italic_k ∈ ⟦ 1 , italic_K ⟧ end_POSTSUBSCRIPT roman_ℓ ( over^ start_ARG roman_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( roman_x ) , roman_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (8)

where (p^tk(x),pt)1Jj=0J1pt,jp^t,jk(x)2subscriptsuperscript^p𝑘𝑡xsubscriptp𝑡1𝐽superscriptsubscript𝑗0𝐽1subscriptnormsubscriptp𝑡𝑗subscriptsuperscript^p𝑘𝑡𝑗x2\ell(\hat{\mathrm{p}}^{k}_{t}(\mathrm{x}),\mathrm{p}_{t})\triangleq\frac{1}{J}% \sum_{j=0}^{J-1}\|\mathrm{p}_{t,j}-\hat{\mathrm{p}}^{k}_{t,j}(\mathrm{x})\|_{2}roman_ℓ ( over^ start_ARG roman_p end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_x ) , roman_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ divide start_ARG 1 end_ARG start_ARG italic_J end_ARG ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J - 1 end_POSTSUPERSCRIPT ∥ roman_p start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT - over^ start_ARG roman_p end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ( roman_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and p^tk(x)superscriptsubscript^p𝑡𝑘x\hat{\mathrm{p}}_{t}^{k}(\mathrm{x})over^ start_ARG roman_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( roman_x ) denotes the pose prediction at time t𝑡titalic_t using the kthsuperscript𝑘thk^{\text{th}}italic_k start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT head.

The second term, scoresubscriptscore\mathcal{L}_{\text{score}}caligraphic_L start_POSTSUBSCRIPT score end_POSTSUBSCRIPT, is the scoring loss

score(p^(x),γ(x),p)=1Lt=1L(δ(p^t,pt),γt(x)),subscriptscore^px𝛾xp1𝐿superscriptsubscript𝑡1𝐿𝛿subscript^p𝑡subscriptp𝑡subscript𝛾𝑡x\mathcal{L}_{\text{score}}(\hat{\mathrm{p}}(\mathrm{x}),\gamma(\mathrm{x}),% \mathrm{p})=\frac{1}{L}\sum_{t=1}^{L}\mathcal{H}\big{(}\delta(\hat{\mathrm{p}}% _{t},\mathrm{p}_{t}),\gamma_{t}(\mathrm{x})\big{)}\,,caligraphic_L start_POSTSUBSCRIPT score end_POSTSUBSCRIPT ( over^ start_ARG roman_p end_ARG ( roman_x ) , italic_γ ( roman_x ) , roman_p ) = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT caligraphic_H ( italic_δ ( over^ start_ARG roman_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_x ) ) , (9)

where (,)\mathcal{H}(\cdot,\cdot)caligraphic_H ( ⋅ , ⋅ ) is the cross-entropy, p^t=(p^tk)k=1Ksubscript^p𝑡superscriptsubscriptsubscriptsuperscript^p𝑘𝑡𝑘1𝐾\hat{\mathrm{p}}_{t}=(\hat{\mathrm{p}}^{k}_{t})_{k=1}^{K}over^ start_ARG roman_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( over^ start_ARG roman_p end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, and

[δ(p^t,pt)]k𝟏[kargmink1,K(p^tk,pt)],subscriptdelimited-[]𝛿subscript^p𝑡subscriptp𝑡𝑘1delimited-[]𝑘subscriptargminsuperscript𝑘1𝐾subscriptsuperscript^psuperscript𝑘𝑡subscriptp𝑡[\delta(\hat{\mathrm{p}}_{t},\mathrm{p}_{t})]_{k}\triangleq\mathbf{1}\Big{[}k% \in\operatorname*{arg\,min}_{k^{\prime}\in\llbracket 1,K\rrbracket}\;\ell\left% (\hat{\mathrm{p}}^{k^{\prime}}_{t},\mathrm{p}_{t}\right)\Big{]}\,,[ italic_δ ( over^ start_ARG roman_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≜ bold_1 [ italic_k ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ⟦ 1 , italic_K ⟧ end_POSTSUBSCRIPT roman_ℓ ( over^ start_ARG roman_p end_ARG start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] , (10)

is the indicator function of the winner pose hypothesis, which is the closest to the ground truth. Eq. (9) corresponds to the average cross-entropy between the target and predicted scores γt(x)[0,1]Ksubscript𝛾𝑡xsuperscript01𝐾\gamma_{t}(\mathrm{x})\in[0,1]^{K}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_x ) ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT at each time t𝑡titalic_t.

These two losses complement each other. The Winner-takes-all loss only updates the best predicted hypothesis, specializing each head on a region of the data distribution [16]. Concurrently, the scoring loss allows the model to learn how likely each head wins for a given input, thereby avoiding overconfidence [14, 34] of non-winner heads at inference time. Furthermore, as detailed in [17], the joint prediction of modes p^ksuperscript^p𝑘\hat{\mathrm{p}}^{k}over^ start_ARG roman_p end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and likelihoods γk(x)superscript𝛾𝑘x\gamma^{k}(\mathrm{x})italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( roman_x ) allow the derivation of the predicted conditional expectation of 3D poses given 2D inputs:

𝔼[p|x]k=1Kγk(x)p^k(x).similar-to-or-equals𝔼delimited-[]conditionalpxsuperscriptsubscript𝑘1𝐾superscript𝛾𝑘xsuperscript^p𝑘x\mathbb{E}[\mathrm{p}|\mathrm{x}]\simeq\sum_{k=1}^{K}\gamma^{k}(\mathrm{x})% \hat{\mathrm{p}}^{k}(\mathrm{x})\,.blackboard_E [ roman_p | roman_x ] ≃ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( roman_x ) over^ start_ARG roman_p end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( roman_x ) . (11)

6 Experiments

L𝐿Litalic_L K𝐾Kitalic_K Orac. MPJPE\downarrow MPSSE \downarrow MPSCE \downarrow
Single-hypothesis methods:
ST-GCN [2] 7 1 N/A 48.8 8.9 10.8
VideoPose3D [29] 243 1 N/A 46.8 6.5 7.8
PoseFormer [41] 81 1 N/A 44.3 4.3 7.2
Anatomy3D [4] 243 1 N/A 44.1 1.4 2.0
MixSTE [40] 243 1 N/A 40.9 8.8 9.9
Multi-hypothesis methods:
Sharma et al. [33] 1 10 46.8 13.0 9.9
Wehrbein et al. [37] 1 200 44.3 12.2 14.8
Diffpose [8]* 1 200 43.3 14.9 -
MHFormer [21] 351 3 N/A 43.0 5.7 8.0
ManiPose (Ours) 243 5 41.9 0.3 0.7
Table 2: Pose consistency evaluation of state-of-the-art methods on H3.6M. MPJPE performance and pose consistency are not correlated. L𝐿Litalic_L: sequence length. K𝐾Kitalic_K: number of hypotheses. Orac.: Metric computed using oracle hypothesis. N/A: non-applicable. Bold: best; Underlined: second best. *: MPSSE values reported in [8]. Missing entries: methods with unavailable code.
L𝐿Litalic_L K𝐾Kitalic_K Orac. Dir. Disc Eat Greet Phone Photo Pose Purch. Sit SitD. Smoke Wait WalkD. Walk WalkT. Avg.
Single-hypothesis methods:
GraphSH [39] 1 1 N/A 45.2 49.9 47.5 50.9 54.9 66.1 48.5 46.3 59.7 71.5 51.4 48.6 53.9 39.9 44.1 51.9
MGCN [43] 1 1 N/A 45.4 49.2 45.7 49.4 50.4 58.2 47.9 46.0 57.5 63.0 49.7 46.6 52.2 38.9 40.8 49.4
ST-GCN [2] 7 1 N/A 44.6 47.4 45.6 48.8 50.8 59.0 47.2 43.9 57.9 61.9 49.7 46.6 51.3 37.1 39.4 48.8
VideoPose3D [29] 243 1 N/A 45.2 46.7 43.3 45.6 48.1 55.1 44.6 44.3 57.3 65.8 47.1 44.0 49.0 32.8 33.9 46.8
UGCN [35] 96 1 N/A 41.3 43.9 44.0 42.2 48.0 57.1 42.2 43.2 57.3 61.3 47.0 43.5 47.0 32.6 31.8 45.6
Liu et al. [22] 243 1 N/A 41.8 44.8 41.1 44.9 47.4 54.1 43.4 42.2 56.2 63.6 45.3 43.5 45.3 31.3 32.2 45.1
PoseFormer [41] 81 1 N/A 41.5 44.8 39.8 42.5 46.5 51.6 42.1 42.0 53.3 60.7 45.5 43.3 46.1 31.8 32.2 44.3
Anatomy3D [4] 243 1 N/A 41.4 43.2 40.1 42.9 46.6 51.9 41.7 42.3 53.9 60.2 45.4 41.7 46.0 31.5 32.7 44.1
MixSTE [40] 243 1 N/A 37.6 40.9 37.3 39.7 42.3 49.9 40.1 39.8 51.7 55.0 42.1 39.8 41.0 27.9 27.9 40.9
Multi-hypothesis methods:
Li et al. [19] 1 10 62.0 69.7 64.3 73.6 75.1 84.8 68.7 75.0 81.2 104.3 70.2 72.0 75.0 67.0 69.0 73.9
Li et al. [18] 1 5 43.8 48.6 49.1 49.8 57.6 61.5 45.9 48.3 62.0 73.4 54.8 50.6 56.0 43.4 45.5 52.7
Oikarinen et al. [28] 1 200 40.0 43.2 41.0 43.4 50.0 53.6 40.1 41.4 52.6 67.3 48.1 44.2 44.9 39.5 40.2 46.2
Sharma et al. [33] 1 10 37.8 43.2 43.0 44.3 51.1 57.0 39.7 43.0 56.3 64.0 48.1 45.4 50.4 37.9 39.9 46.8
Wehrbein et al. [37] 1 200 38.5 42.5 39.9 41.7 46.5 51.6 39.9 40.8 49.5 56.8 45.3 46.4 46.8 37.8 40.4 44.3
DiffPose [8] 1 200 38.1 43.1 35.3 43.1 46.6 48.2 39.0 37.6 51.9 59.3 41.7 47.6 45.4 37.4 36.0 43.3
MHFormer [21] 351 3 N/A 39.2 43.1 40.1 40.9 44.9 51.2 40.6 41.3 53.5 60.3 43.7 41.1 43.8 29.8 30.6 43.0
ManiPose (Ours) 243 5 42.6 47.3 39.3 42.5 45.4 53.1 44.3 41.1 53.6 58.9 45.4 43.1 46.3 31.5 33.3 44.5
ManiPose (Ours) 243 5 40.2 44.0 37.0 39.9 42.6 49.6 41.2 38.8 50.0 55.4 43.0 40.4 43.8 30.4 32.0 41.9
Table 3: Quantitative comparison with the state-of-the-art methods on Human3.6M under Protocol #1 (MPJPE in mm), using detected 2D poses. L𝐿Litalic_L: sequence length. K𝐾Kitalic_K: number of hypotheses. Orac.: Metric computed using oracle hypothesis. N/A: non-applicable. Bold: best; Underlined: second best.
PCK \uparrow AUC \uparrow MPJPE \downarrow MPSSE \downarrow MPSCE \downarrow
VideoPose3D [29] 85.5 51.5 84.8 10.4 27.5
PoseFormer [41] 86.6 56.4 77.1 10.8 14.2
MixSTE [40] 94.4 66.5 54.9 17.3 21.6
P-STMO [32] 97.9 75.8 32.2 8.5 11.3
ManiPose (Ours) Aggr. 98.0 75.3 37.7 0.6 1.3
ManiPose (Ours) Orac. 98.4 77.0 34.6 0.6 1.3
Table 4: Quantitative comparison with the state-of-the-art on MPI-INF-3DHP using ground-truth 2D poses.
MR MC K𝐾Kitalic_K # Params. MPJPE \downarrow MPSSE \downarrow MPSCE \downarrow
ManiPose (Ours) 5 34.44 M 41.9 0.3 0.7
 w/o MH 1 34.42 M 44.6 0.3 0.7
 w/o MC, w/ MR 1 33.78 M 42.3 5.7 7.3
 w/o MR (MixSTE) 1 33.78 M 40.9 8.8 9.9
Table 5: Ablation study: Single-hypothesis cannot optimize both MPJPE performance and consistency. ManiPose uses the same backbone as MixSTE. MR: using manifold regularization. MC: manifold-constrained. Bold: best. Underlined: second best.
Refer to caption
Figure 5: MPSCE, MPSSE and MPJPE per segment/coordinate. ManiPose mostly helps to deal with the depth ambiguity (z𝑧zitalic_z coordinate). Ground-truth (GT) poses are represented but not visible because they have perfect consistency.

6.1 Datasets

We evaluate our model on two 3D-HPE datasets: Human 3.6M [11] and MPI-INF-3DHP [25].

Human 3.6M contains 3.6 million images of 7 actors performing 15 different indoor actions. It is the most widely used dataset for 3D-HPE. Following previous works [40, 21, 41, 29], we train our models on subjects S1, S5, S6, S7, S8, and test on subjects S9 and S11, adopting a 17-joint skeleton (cf. Fig. 5). A pre-trained CPN [5] network is employed to compute the input 2D keypoints as in [29, 40].

MPI-INF-3DHP also adopts a 17-joint skeleton, but is smaller than Human 3.6M and contains both indoor and outdoor scenes. It is hence more challenging. We used ground-truth 2D keypoints for this dataset, as usually done [41, 4, 40].

6.2 Evaluation metrics

Traditional metrics. Performance is usually evaluated on these datasets using the mean per-joint position error (MPJPE) under different protocols. Under protocol #1, the root joint position is set as reference, and the predicted root position is translated to 00. Under protocol #2, also denoted P-MPJPE, predictions are additionally Procrustes-corrected. Both are reported in mm.

For MPI-INF-3DHP, other thresholded metrics derived from MPJPE are also often reported, such as AUC and PCK (with a threshold at 150 mm), as explained in [25].

Pose consistency. In addition to the traditional metrics, we also evaluate the consistency of predicted poses.

Given the results from Sec. 4, we assess to which extent Asm. 4.4 is verified by measuring the average standard deviations of segment lengths across time in predicted action sequences:

MPSCE1J1j=1J11Lt=1L(st,j,τ(j)s¯j,τ(j))2MPSCE1𝐽1superscriptsubscript𝑗1𝐽11𝐿superscriptsubscript𝑡1𝐿superscriptsubscript𝑠𝑡𝑗𝜏𝑗subscript¯𝑠𝑗𝜏𝑗2\displaystyle\text{MPSCE}\triangleq\frac{1}{J-1}\sum_{j=1}^{J-1}\sqrt{\frac{1}% {L}\sum_{t=1}^{L}(s_{t,j,\tau(j)}-\bar{s}_{j,\tau(j)})^{2}}MPSCE ≜ divide start_ARG 1 end_ARG start_ARG italic_J - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J - 1 end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t , italic_j , italic_τ ( italic_j ) end_POSTSUBSCRIPT - over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j , italic_τ ( italic_j ) end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (12)
st,j,i=p^t,jp^t,i2,s¯j,i=1Lt=1Lst,j,i,formulae-sequencesubscript𝑠𝑡𝑗𝑖subscriptnormsubscript^p𝑡𝑗subscript^p𝑡𝑖2subscript¯𝑠𝑗𝑖1𝐿superscriptsubscript𝑡1𝐿subscript𝑠𝑡𝑗𝑖\displaystyle s_{t,j,i}=\|\hat{\mathrm{p}}_{t,j}-\hat{\mathrm{p}}_{t,i}\|_{2},% \quad\bar{s}_{j,i}=\frac{1}{L}\sum_{t=1}^{L}s_{t,j,i}\,,italic_s start_POSTSUBSCRIPT italic_t , italic_j , italic_i end_POSTSUBSCRIPT = ∥ over^ start_ARG roman_p end_ARG start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT - over^ start_ARG roman_p end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_t , italic_j , italic_i end_POSTSUBSCRIPT , (13)

where τ𝜏\tauitalic_τ was defined in Sec. 5.1. We call this metric the Mean Per Segment Consistency Error (MPSCE), also reported in mm.

Following [8, 30], we also assess the bilateral symmetry of predicted skeletons through the Mean Per Segment Symmetry Error (MPSSE) in mm:

MPSSE1L|𝒥left|t=1Lj𝒥left|st,j,τ(j)st,j,τ(j)|,MPSSE1𝐿subscript𝒥leftsuperscriptsubscript𝑡1𝐿subscript𝑗subscript𝒥leftsubscript𝑠𝑡𝑗𝜏𝑗subscript𝑠𝑡superscript𝑗𝜏superscript𝑗\displaystyle\text{MPSSE}\triangleq\frac{1}{L\,|\mathcal{J}_{\text{left}}|}% \sum_{t=1}^{L}\sum_{j\in\mathcal{J}_{\text{left}}}|s_{t,j,\tau(j)}-s_{t,j^{% \prime},\tau(j^{\prime})}|\,,MPSSE ≜ divide start_ARG 1 end_ARG start_ARG italic_L | caligraphic_J start_POSTSUBSCRIPT left end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_J start_POSTSUBSCRIPT left end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t , italic_j , italic_τ ( italic_j ) end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_t , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_τ ( italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT | , (14)
with j=ζ(j),with superscript𝑗𝜁𝑗\displaystyle\text{with \ \ }j^{\prime}=\zeta(j)\,,with italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_ζ ( italic_j ) , (15)

where 𝒥leftsubscript𝒥left\mathcal{J}_{\text{left}}caligraphic_J start_POSTSUBSCRIPT left end_POSTSUBSCRIPT denotes the set of indices of left-side joints and ζ𝜁\zetaitalic_ζ maps left-side joint indices to their right-side counterpart.

Multi-hypothesis setting. One must decide how to use multiple hypotheses (MH) to compute the previous metrics.

The most widely used approach [18, 19, 28, 33, 37, 8] is the oracle evaluation, i.e., using the predicted hypothesis closer to the ground truth (i.e., Eq. (8) in the case of MPJPE). This metric makes sense for MH methods as it measures the distance between the target and the discrete set of predicted hypotheses. It is hence aligned with the existence of many possible outputs for a given input.

Hypotheses can also be aggregated into a final hypothesis, e.g., through unweighted or weighted averaging as in Eq. (11). The latter has the disadvantage of falling back to a one-to-one map** scheme, which is precisely what we want to avoid when working in a MH setting.

We hence report both oracle and aggregated metrics in our experiments, favoring oracle results.

6.3 Implementation details

The ManiPose method presented in Sec. 5 is compatible with any backbone. Here, we chose to build on the MixSTE [40] network for both the rotation and the segments111In a reduced scale. modules. The details about our architecture and training can be found in the supp. material (Sec. 11).

6.4 Comparison to the state-of-the-art

Human 3.6M. Comparisons with state-of-the-art regression and multi-hypothesis methods are presented in Tab. 2. As a general remark, MPJPE and consistency metrics are not positively correlated. As predicted by Sec. 4, our results show that MPJPE improvements achieved by MixSTE come at the cost of poorer consistency compared to previous models. In contrast, the only single-hypothesis constrained model, Anatomy3D [4], achieves good consistency at the expense of inferior MPJPE. As opposed to all previous methods, ManiPose reaches nearly perfect consistency without compromising the MPJPE performance.

Detailed MPJPE results can be found in Tab. 3. ManiPose reaches the second-best MPJPE performance on average and on most actions. Note that while ManiPose is deterministic, previous multi-hypothesis methods are generative and frame-based, except MHFormer. Tab. 3 shows that they require up to two orders of magnitude more hypotheses than ManiPose to reach competitive performance. Protocol #2 results can be found in the appendix (Sec. 12).

We present qualitative results on Figs. 1 and 6.

MPI-INF-3DHP. Similar results were obtained for this dataset (cf. Tab. 4). Not only does ManiPose reach consistency errors close to 00, but also best PCK and AUC performance. As for MPJPE, only [32] achieves slightly better performance, at the cost of large pose consistency errors.

6.5 Ablation study

Impact of components. We evaluate the impact of removing each component of ManiPose on the Human 3.6M performance (Tab. 5). The components tested are the multiple hypothesis (MH) and the manifold constraint (MC). We also compare MC to a more standard manifold regularization (MR), i.e., adding Eq. (12) to the loss. Note that without all these components, we fall back to MixSTE [40].

We see that MR helps to improve pose consistency, but not as much as MC. However, without multiple hypotheses, MC consistency improvements come at the cost of degraded MPJPE performance, as explained in Secs. 3 and 4.

Fine error analysis. We can see in Fig. 5 that, compared to MixSTE, ManiPose reaches substantially superior MPSSE and MPSCE consistency across all skeleton segments. Furthermore, note that larger MixSTE errors occur for segments knee-foot and elbow-wrist, which are the most prone to depth ambiguity. This agrees with coordinate-wise errors depicted in Fig. 5, showing that ManiPose improvements mostly translate into a reduction of MixSTE depth errors, which are twice larger than for other coordinates. Further ablation studies can be found in the supp. material.

Refer to caption
Refer to caption
Refer to caption
Figure 6: Qualitative comparison between ManiPose and state-of-the-art regression method, MixSTE. The constrained multiple hypotheses help to deal with depth ambiguity and occlusions.

7 Conclusion

This work provided empirical and theoretical evidence challenging the common unconstrained single-hypothesis approaches to 3D-HPE. We proved that unconstrained single-hypothesis methods cannot deliver consistent poses and that existing evaluation metrics are insufficient to assess this aspect. We also showed that constraining or regularizing single-hypothesis models is not enough to optimize both joint position error and pose consistency, which are somehow antagonistic. In response, we presented a new manifold-constrained multi-hypothesis human pose lifting method (ManiPose) and demonstrated its empirical superiority to the existing state-of-the-art on two challenging datasets. One limitation of our method is its reliance on the forward kinematics algorithm, which guarantees its manifold-consistency, but requires non-parallelizable iterations across joints. This warrants further investigations.

References

  • Bishop [1994] Christopher M Bishop. Mixture density networks. 1994.
  • Cai et al. [2019] Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham, Junsong Yuan, and Nadia Magnenat Thalmann. Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2272–2281, 2019.
  • Chen and Ramanan [2017] Ching-Hang Chen and Deva Ramanan. 3d human pose estimation= 2d pose estimation+ matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7035–7043, 2017.
  • Chen et al. [2021] Tianlang Chen, Chen Fang, Xiaohui Shen, Yiheng Zhu, Zhili Chen, and Jiebo Luo. Anatomy-aware 3d human pose estimation with bone-based pose decomposition. IEEE Transactions on Circuits and Systems for Video Technology, 32(1):198–209, 2021.
  • Chen et al. [2018] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7103–7112, 2018.
  • Firman et al. [2018] Michael Firman, Neill DF Campbell, Lourdes Agapito, and Gabriel J Brostow. Diversenet: When one right answer is not enough. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5598–5607, 2018.
  • Guzman-Rivera et al. [2012] Abner Guzman-Rivera, Dhruv Batra, and Pushmeet Kohli. Multiple choice learning: Learning to produce multiple structured outputs. Advances in neural information processing systems, 25, 2012.
  • Holmquist and Wandt [2023] Karl Holmquist and Bastian Wandt. Diffpose: Multi-hypothesis human pose estimation using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15977–15987, 2023.
  • Hossain and Little [2018] Mir Rayat Imtiaz Hossain and James J Little. Exploiting temporal information for 3d human pose estimation. In Proceedings of the European conference on computer vision (ECCV), pages 68–84, 2018.
  • Hu et al. [2021] Wenbo Hu, Changgong Zhang, Fangneng Zhan, Lei Zhang, and Tien-Tsin Wong. Conditional directed graph convolution for 3d human pose estimation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 602–611, 2021.
  • Ionescu et al. [2014] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 2014.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kolotouros et al. [2021] Nikos Kolotouros, Georgios Pavlakos, Dinesh Jayaraman, and Kostas Daniilidis. Probabilistic modeling for human mesh recovery. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11605–11614, 2021.
  • Lee et al. [2017] Kimin Lee, Changho Hwang, KyoungSoo Park, and **woo Shin. Confident multiple choice learning. In International Conference on Machine Learning, pages 2014–2023. PMLR, 2017.
  • Lee et al. [2015] Stefan Lee, Senthil Purushwalkam, Michael Cogswell, David Crandall, and Dhruv Batra. Why m heads are better than one: Training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314, 2015.
  • Lee et al. [2016] Stefan Lee, Senthil Purushwalkam Shiva Prakash, Michael Cogswell, Viresh Ranjan, David Crandall, and Dhruv Batra. Stochastic multiple choice learning for training diverse deep ensembles. Advances in Neural Information Processing Systems, 29, 2016.
  • Letzelter et al. [2023] Victor Letzelter, Mathieu Fontaine, Mickaël Chen, Patrick Pérez, Slim Essid, and Gael Richard. Resilient multiple choice learning: A learned scoring scheme with application to audio scene analysis. In Advances in Neural Information Processing Systems, 2023.
  • Li and Lee [2019] Chen Li and Gim Hee Lee. Generating multiple hypotheses for 3d human pose estimation with mixture density network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9887–9895, 2019.
  • Li and Lee [2020] Chen Li and Gim Hee Lee. Weakly supervised generative network for multiple 3d human pose hypotheses. In British Machine Vision Conference (BMVC), 2020.
  • Li et al. [2021] Jiefeng Li, Chao Xu, Zhicun Chen, Siyuan Bian, Lixin Yang, and Cewu Lu. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3383–3393, 2021.
  • Li et al. [2022] Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, and Luc Van Gool. Mhformer: Multi-hypothesis transformer for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13147–13156, 2022.
  • Liu et al. [2020] Ruixu Liu, Ju Shen, He Wang, Chen Chen, Sen-ching Cheung, and Vijayan Asari. Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5064–5073, 2020.
  • Makansi et al. [2019] Osama Makansi, Eddy Ilg, Ozgun Cicek, and Thomas Brox. Overcoming limitations of mixture density networks: A sampling and fitting framework for multimodal future prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7144–7153, 2019.
  • Martinez et al. [2017] Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE international conference on computer vision, pages 2640–2649, 2017.
  • Mehta et al. [2017] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 international conference on 3D vision (3DV), pages 506–516. IEEE, 2017.
  • Mun et al. [2018] Jonghwan Mun, Kimin Lee, **woo Shin, and Bohyung Han. Learning to specialize with knowledge distillation for visual question answering. Advances in neural information processing systems, 31, 2018.
  • Murray et al. [2017] Richard M Murray, Zexiang Li, and S Shankar Sastry. A mathematical introduction to robotic manipulation. CRC press, 2017.
  • Oikarinen et al. [2021] Tuomas Oikarinen, Daniel Hannah, and Sohrob Kazerounian. Graphmdn: Leveraging graph structure and deep learning to solve inverse problems. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–9. IEEE, 2021.
  • Pavllo et al. [2019] Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7753–7762, 2019.
  • Rommel et al. [2023] Cédric Rommel, Eduardo Valle, Mickaël Chen, Souhaiel Khalfaoui, Renaud Marlet, Matthieu Cord, and Patrick Pérez. DiffHPE: Robust, Coherent 3D Human Pose Lifting with Diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3220–3229, 2023.
  • Rupprecht et al. [2017] Christian Rupprecht, Iro Laina, Robert DiPietro, Maximilian Baust, Federico Tombari, Nassir Navab, and Gregory D Hager. Learning in an uncertain world: Representing ambiguity through multiple hypotheses. In Proceedings of the IEEE international conference on computer vision, pages 3591–3600, 2017.
  • Shan et al. [2022] Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao. P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation, 2022.
  • Sharma et al. [2019] Saurabh Sharma, Pavan Teja Varigonda, Prashast Bindal, Abhishek Sharma, and Arjun Jain. Monocular 3d human pose estimation by generation and ordinal ranking. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2325–2334, 2019.
  • Tian et al. [2019] Kai Tian, Yi Xu, Shuigeng Zhou, and Jihong Guan. Versatile multiple choice learning and its application to vision computing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6349–6357, 2019.
  • Wang et al. [2020] **gbo Wang, Sijie Yan, Yuanjun Xiong, and Dahua Lin. Motion guided 3d pose estimation from videos. In European Conference on Computer Vision, pages 764–780. Springer, 2020.
  • Waskom [2021] Michael L. Waskom. seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021, 2021.
  • Wehrbein et al. [2021] Tom Wehrbein, Marco Rudolph, Bodo Rosenhahn, and Bastian Wandt. Probabilistic monocular 3d human pose estimation with normalizing flows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11199–11208, 2021.
  • Xu et al. [2020] **gwei Xu, Zhenbo Yu, Bingbing Ni, Jiancheng Yang, Xiaokang Yang, and Wenjun Zhang. Deep kinematics analysis for monocular 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on computer vision and Pattern recognition, pages 899–908, 2020.
  • Xu and Takano [2021] Tianhan Xu and Wataru Takano. Graph stacked hourglass networks for 3d human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16105–16114, 2021.
  • Zhang et al. [2022] **lu Zhang, Zhigang Tu, Jianyu Yang, Yu** Chen, and Junsong Yuan. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13232–13242, 2022.
  • Zheng et al. [2021] Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang, Chen Chen, and Zhengming Ding. 3D Human Pose Estimation with Spatial and Temporal Transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11636–11645, Montreal, QC, Canada, 2021. IEEE.
  • Zhou et al. [2019] Yi Zhou, Connelly Barnes, **gwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019.
  • Zou and Tang [2021] Zhiming Zou and Wei Tang. Modulated graph convolutional network for 3d human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11477–11487, 2021.
\thetitle

Supplementary Material

This appendix is organized as follows:

  • Sec. 8 contains empirical verification of our assumptions,

  • Sec. 9 presents the proofs of our theoretical results, together with a few corollaries,

  • Sec. 10 provides further implementation details concerning the 1D-to-2D experiment, as well as an extension to the 2D-to-3D setting,

  • Sec. 11 contains implementation and training details concerning ManiPose, as well as compared baselines,

  • Sec. 12 presents further results of the Human 3.6M experiment,

  • and finally, Sec. 13 explains the provided experiment code.

8 Assumption verifications

We verified on Human 3.6M [11] ground-truth data that assumptions 4.4 and 4.5 hold for real poses in the training and test sets.

Segments rigidity. As shown on Figs. 5 and 10, ground-truth 3D poses have perfect MPSSE (14) and MPSCE (12) metrics, meaning that ground-truth skeletons are perfectly symmetric, with rigid segments. Assumption 4.4 is hence verified in real training and test data.

Non-degenerate distributions. As shown on Fig. 7, the conditional distribution of ground-truth 3D poses given 2D keypoints position is clearly multimodal, and hence non-degenerate (not reduced to a single Dirac distribution). This validates assumption 4.5 and explains why multi-hypothesis techniques are necessary.

Refer to caption
(a) S9, Walking
Refer to caption
(b) S1, Greeting
Refer to caption
(c) S11, Directions
Refer to caption
(d) S1, SittingDown
Figure 7: Estimated joint distributions of ground-truth 2D inputs (u𝑢uitalic_u, v𝑣vitalic_v pixel coordinates) together with 3D z-coordinates (depth) for different subjects and actions. The depth density conditional on inputs is clearly multimodal. Vertical red lines are examples of depth-ambiguous inputs. Distributions are estimated with a kernel density estimator from the Seaborn library [36].

9 Proofs and additional corollaries

This section contains the proofs of the theoretical results presented in Sec. 4, together with a few corollaries.

proof. [Proposition 4.6] Let i𝑖iitalic_i be a joint connected to the root p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (i.e., Ai0=1subscript𝐴𝑖01A_{i0}=1italic_A start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT = 1). From assumptions 4.3 and 4.4, we know that at any instant t𝑡titalic_t, pt,iGsubscriptsuperscriptp𝐺𝑡𝑖\mathrm{p}^{G}_{t,i}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT lies on the sphere S2(0,si,0)superscript𝑆20subscript𝑠𝑖0S^{2}(0,s_{i,0})italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 , italic_s start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ) centered at 00 with radius si,0subscript𝑠𝑖0s_{i,0}italic_s start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT independent of time. Its position can hence be fully parametrized in spherical coordinates by two angles (θt,i,ϕt,i)subscript𝜃𝑡𝑖subscriptitalic-ϕ𝑡𝑖(\theta_{t,i},\phi_{t,i})( italic_θ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ). Let j𝑗jitalic_j be a joint connected to i𝑖iitalic_i. Like before, assumption 4.4 implies that at any instant t𝑡titalic_t, pt,jGsubscriptsuperscriptp𝐺𝑡𝑗\mathrm{p}^{G}_{t,j}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT lies on the moving sphere S2(pt,iG,sj,i)superscript𝑆2subscriptsuperscript𝑝𝐺𝑡𝑖subscript𝑠𝑗𝑖S^{2}(p^{G}_{t,i},s_{j,i})italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT ) centered at pt,iGsubscriptsuperscript𝑝𝐺𝑡𝑖p^{G}_{t,i}italic_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT with radius sj,isubscript𝑠𝑗𝑖s_{j,i}italic_s start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT independent of time. Hence, we can fully describe pt,jGsubscriptsuperscriptp𝐺𝑡𝑗\mathrm{p}^{G}_{t,j}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT with the position of its center, pt,iGsubscriptsuperscriptp𝐺𝑡𝑖\mathrm{p}^{G}_{t,i}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT and the spherical coordinates (θt,j,ϕt,j)subscript𝜃𝑡𝑗subscriptitalic-ϕ𝑡𝑗(\theta_{t,j},\phi_{t,j})( italic_θ start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ) of joint j𝑗jitalic_j relative to the center of the sphere, i.e., joint i𝑖iitalic_i. This means that there is a bijection between the possible positions attainable by pt,jGsubscriptsuperscriptp𝐺𝑡𝑗\mathrm{p}^{G}_{t,j}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT at any instant and the direct product of spheres S2(0,si,0)×S2(0,sj,i)superscript𝑆20subscript𝑠𝑖0superscript𝑆20subscript𝑠𝑗𝑖S^{2}(0,s_{i,0})\,\times\,S^{2}(0,s_{j,i})italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 , italic_s start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ) × italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 , italic_s start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT )222S2(0,sj,i)superscript𝑆20subscript𝑠𝑗𝑖S^{2}(0,s_{j,i})italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 , italic_s start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT ) is homeomorphic to S2(pt,jG;sj,i)superscript𝑆2subscriptsuperscriptp𝐺𝑡𝑗subscript𝑠𝑗𝑖S^{2}(\mathrm{p}^{G}_{t,j};s_{j,i})italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT ). This bijection is an homeomorphism as a composition of homeomorphisms: we can compute pt,jGsubscriptsuperscriptp𝐺𝑡𝑗\mathrm{p}^{G}_{t,j}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT from (θt,i,ϕt,i,θt,j,ϕt,j)subscript𝜃𝑡𝑖subscriptitalic-ϕ𝑡𝑖subscript𝜃𝑡𝑗subscriptitalic-ϕ𝑡𝑗(\theta_{t,i},\phi_{t,i},\theta_{t,j},\phi_{t,j})( italic_θ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ) following the forward kinematics algorithm [27] (cf. algo. 2), i.e., using a composition of rotations and translations.

Now let’s assume for some arbitrary joint k𝑘kitalic_k that pt,kGsubscriptsuperscriptp𝐺𝑡𝑘\mathrm{p}^{G}_{t,k}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT lies at all times on a space 2dsubscript2𝑑\mathcal{M}_{2d}caligraphic_M start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT homeomorphic to a product of spheres of dimension 2d2𝑑2d2 italic_d. This means that pt,kGsubscriptsuperscriptp𝐺𝑡𝑘\mathrm{p}^{G}_{t,k}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT can be fully parametrized using 2d2𝑑2d2 italic_d spherical angles (θ1,ϕ1,,θd,ϕd)subscript𝜃1subscriptitalic-ϕ1subscript𝜃𝑑subscriptitalic-ϕ𝑑(\theta_{1},\phi_{1},\dots,\theta_{d},\phi_{d})( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). Let l𝑙litalic_l be a joint connected to k𝑘kitalic_k (typically one further step away from the root joint p0subscriptp0\mathrm{p}_{0}roman_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and not already represented in 2dsubscript2𝑑\mathcal{M}_{2d}caligraphic_M start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT). As before, at any instant t𝑡titalic_t, pt,lGsubscriptsuperscriptp𝐺𝑡𝑙\mathrm{p}^{G}_{t,l}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT needs to lie on the sphere centered on pt,kGsubscriptsuperscriptp𝐺𝑡𝑘\mathrm{p}^{G}_{t,k}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT of constant radius sk,lsubscript𝑠𝑘𝑙s_{k,l}italic_s start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT. Hence, we can fully describe pt,lGsubscriptsuperscriptp𝐺𝑡𝑙\mathrm{p}^{G}_{t,l}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT using the 2(d+1)2𝑑1{2(d+1)}2 ( italic_d + 1 )-tuple of angles obtained by concatenating its spherical coordinates relative to joint k𝑘kitalic_k, together with the 2d2𝑑2d2 italic_d-tuple describing pt,kGsubscriptsuperscriptp𝐺𝑡𝑘\mathrm{p}^{G}_{t,k}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT, i.e. the center of the sphere. So pt,lGsubscriptsuperscriptp𝐺𝑡𝑙\mathrm{p}^{G}_{t,l}roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT lies on a space 2(d+1)subscript2𝑑1\mathcal{M}_{2(d+1)}caligraphic_M start_POSTSUBSCRIPT 2 ( italic_d + 1 ) end_POSTSUBSCRIPT homeomorphic to a product of spheres of dimension 2(d+1)2𝑑12(d+1)2 ( italic_d + 1 ).

We can hence conclude by induction that at any instant t𝑡titalic_t, pt=[pt,1G,,pt,JG]subscriptp𝑡subscriptsuperscriptp𝐺𝑡1subscriptsuperscriptp𝐺𝑡𝐽\mathrm{p}_{t}=[\mathrm{p}^{G}_{t,1},\dots,\mathrm{p}^{G}_{t,J}]roman_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT , … , roman_p start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_J end_POSTSUBSCRIPT ] lies on the same subspace of (3)Jsuperscriptsuperscript3𝐽(\mathbb{R}^{3})^{J}( blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT, which is homeomorphic to a product of spheres centered at the origin:

i<j/Aij=1S2(0,si,j).subscripttensor-product𝑖𝑗subscript𝐴𝑖𝑗1superscript𝑆20subscript𝑠𝑖𝑗\bigotimes_{i<j/A_{ij}=1}S^{2}(0,s_{i,j})\,.⨂ start_POSTSUBSCRIPT italic_i < italic_j / italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 0 , italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) . (16)

Finally, the previous space is trivially homeomorphic to (S2)J1superscriptsuperscript𝑆2𝐽1(S^{2})^{J-1}( italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J - 1 end_POSTSUPERSCRIPT through the scaling (1/si,j)i<j/Aij=1subscript1subscript𝑠𝑖𝑗𝑖𝑗subscript𝐴𝑖𝑗1(1/s_{i,j})_{i<j/A_{ij}=1}( 1 / italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i < italic_j / italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT. (S2)J1superscriptsuperscript𝑆2𝐽1(S^{2})^{J-1}( italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J - 1 end_POSTSUPERSCRIPT is a manifold of dimension 2(J1)2𝐽12(J-1)2 ( italic_J - 1 ) as the direct product of J1𝐽1J-1italic_J - 1 manifolds of dimension 2222. \blacksquare

proof. [Proposition 4.8] Let G𝐺Gitalic_G be a skeleton with J𝐽Jitalic_J joints. We denote x(2)Jxsuperscriptsuperscript2𝐽\mathrm{x}\in(\mathbb{R}^{2})^{J}roman_x ∈ ( blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT a 2D pose and p(3)Jpsuperscriptsuperscript3𝐽\mathrm{p}\in(\mathbb{R}^{3})^{J}roman_p ∈ ( blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT its corresponding 3D pose. Let P(x,p)Pxp\mathrm{P}(\mathrm{x},\mathrm{p})roman_P ( roman_x , roman_p ) be a joint distribution of poses in 2D and 3D. We define =(j)j=1J1superscriptsubscriptsubscript𝑗𝑗1𝐽1\ell=(\ell_{j})_{j=1}^{J-1}roman_ℓ = ( roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J - 1 end_POSTSUPERSCRIPT the function allowing us to compute the length of the segments of a pose pp\mathrm{p}roman_p:

j:ppjpτ(j)2,0<jJ1,:subscript𝑗formulae-sequencemaps-topsubscriptnormsubscriptp𝑗subscriptp𝜏𝑗20𝑗𝐽1\ell_{j}:\mathrm{p}\mapsto\|\mathrm{p}_{j}-\mathrm{p}_{\tau(j)}\|_{2}\,,\quad 0% <j\leq J-1\,,roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : roman_p ↦ ∥ roman_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - roman_p start_POSTSUBSCRIPT italic_τ ( italic_j ) end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 0 < italic_j ≤ italic_J - 1 , (17)

where τ:{1,,J1}{0,,J1}:𝜏1𝐽10𝐽1\tau:\{1,\dots,J-1\}\to\{0,\dots,J-1\}italic_τ : { 1 , … , italic_J - 1 } → { 0 , … , italic_J - 1 } maps joint indices to the index of their parent joint:

τ(i)=j<i,s.t.Aij=1.formulae-sequence𝜏𝑖𝑗𝑖s.t.subscript𝐴𝑖𝑗1\tau(i)=j<i,\quad\text{s.t.}~{}A_{ij}=1\,.italic_τ ( italic_i ) = italic_j < italic_i , s.t. italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 . (18)

From assumption 4.4, we know that for any pose pp\mathrm{p}roman_p from the training distribution,

j,j(p)=sj,τ(j).for-all𝑗subscript𝑗psubscript𝑠𝑗𝜏𝑗\forall j\,,\quad\ell_{j}(\mathrm{p})=s_{j,\tau(j)}\,.∀ italic_j , roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_p ) = italic_s start_POSTSUBSCRIPT italic_j , italic_τ ( italic_j ) end_POSTSUBSCRIPT . (19)

Given D={(xi,pi)}i=1NP(x,p)𝐷superscriptsubscriptsubscriptx𝑖subscriptp𝑖𝑖1𝑁similar-toPxp{D=\{(\mathrm{x}_{i},\mathrm{p}_{i})\}_{i=1}^{N}\sim\mathrm{P}(\mathrm{x},% \mathrm{p})}italic_D = { ( roman_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∼ roman_P ( roman_x , roman_p ), some i.i.d. evaluation data, the MSE of a model f𝑓fitalic_f is defined as:

MSE(f;N)=1Ni=1Npif(xi)22,MSE𝑓𝑁1𝑁superscriptsubscript𝑖1𝑁subscriptsuperscriptnormsubscriptp𝑖𝑓subscriptx𝑖22\text{MSE}(f;N)=\frac{1}{N}\sum_{i=1}^{N}\|\mathrm{p}_{i}-f(\mathrm{x}_{i})\|^% {2}_{2}\,,MSE ( italic_f ; italic_N ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ roman_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_f ( roman_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (20)

and converges to

MSE*(f)=𝔼x,p[pf(x)22]superscriptMSE𝑓subscript𝔼xpdelimited-[]subscriptsuperscriptnormp𝑓x22\text{MSE}^{*}(f)=\mathbb{E}_{\mathrm{x},\mathrm{p}}\big{[}\|\mathrm{p}-f(% \mathrm{x})\|^{2}_{2}\big{]}MSE start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_f ) = blackboard_E start_POSTSUBSCRIPT roman_x , roman_p end_POSTSUBSCRIPT [ ∥ roman_p - italic_f ( roman_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] (21)

as the dataset size N𝑁Nitalic_N goes to infinity. We then define the oracle MSE minimizer as

f*=argminfMSE*(f).superscript𝑓subscript𝑓superscriptMSE𝑓f^{*}=\arg\min_{f}\text{MSE}^{*}(f)\,.italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT MSE start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_f ) . (22)

The quantity in (21) is known in statistics as the expected L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-risk and it is a well-known fact that its minimizer is the conditional expectation:

f*(x)=𝔼[p|x=x].superscript𝑓x𝔼delimited-[]conditionalpxxf^{*}(\mathrm{x})=\mathbb{E}[\mathrm{p}|\mathrm{x}=\mathrm{x}]\,.italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( roman_x ) = blackboard_E [ roman_p | roman_x = roman_x ] . (23)

Hence, since j2superscriptsubscript𝑗2\ell_{j}^{2}roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are strictly convex and P(p|x)Pconditionalpx\mathrm{P}(\mathrm{p}|\mathrm{x})roman_P ( roman_p | roman_x ) is non-degenerate according to assumption 4.5, we can conclude from Jensen’s strict inequality that for all j𝑗jitalic_j,

j2(f*(x))=j2(𝔼[p|x=x])<𝔼[j2(p)|x=x]=sjτ(j)2,subscriptsuperscript2𝑗superscript𝑓xsubscriptsuperscript2𝑗𝔼delimited-[]conditionalpxx𝔼delimited-[]conditionalsubscriptsuperscript2𝑗pxxsubscriptsuperscript𝑠2𝑗𝜏𝑗\ell^{2}_{j}(f^{*}(\mathrm{x}))=\ell^{2}_{j}(\mathbb{E}[\mathrm{p}|\mathrm{x}=% \mathrm{x}])<\mathbb{E}[\ell^{2}_{j}(\mathrm{p})|\mathrm{x}=\mathrm{x}]=s^{2}_% {j\tau(j)}\,,roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( roman_x ) ) = roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( blackboard_E [ roman_p | roman_x = roman_x ] ) < blackboard_E [ roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_p ) | roman_x = roman_x ] = italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_τ ( italic_j ) end_POSTSUBSCRIPT , (24)

where the last equality arises from the fact that j2(p)subscriptsuperscript2𝑗p\ell^{2}_{j}(\mathrm{p})roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_p ) is not random according to (19). Hence, given that j>0subscript𝑗0\ell_{j}>0roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 and sj,τ(j)>0subscript𝑠𝑗𝜏𝑗0s_{j,\tau(j)}>0italic_s start_POSTSUBSCRIPT italic_j , italic_τ ( italic_j ) end_POSTSUBSCRIPT > 0, we can say that j(f*(x))<sj,τ(j)subscript𝑗superscript𝑓xsubscript𝑠𝑗𝜏𝑗{\ell_{j}(f^{*}(\mathrm{x}))}<s_{j,\tau(j)}roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( roman_x ) ) < italic_s start_POSTSUBSCRIPT italic_j , italic_τ ( italic_j ) end_POSTSUBSCRIPT for all joints j𝑗jitalic_j. We conclude that the model f*superscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT minimizing MSE*superscriptMSE\text{MSE}^{*}MSE start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT predicts poses that violate assumption 4.4 and are hence inconsistent. \blacksquare

As an immediate corollary of proposition 4.8, we may state the following result, which was empirically illustrated in many parts of our paper:

Corollary 9.1.

Given a fixed training distribution P(x,p)normal-Pnormal-xnormal-p\mathrm{P}(\mathrm{x},\mathrm{p})roman_P ( roman_x , roman_p ) respecting assumptions 4.3-4.5, for all 3D-HPE model f𝑓fitalic_f predicting consistent poses, i.e., that respect assumption 4.4, there is an inconsistent model fsuperscript𝑓normal-′f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with lower mean-squared error.

proof.  Let fargminf~MSE*(f~)superscript𝑓subscriptargmin~𝑓superscriptMSE~𝑓{f^{\prime}\in\operatorname*{arg\,min}_{\tilde{f}}\text{MSE}^{*}(\tilde{f})}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT over~ start_ARG italic_f end_ARG end_POSTSUBSCRIPT MSE start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( over~ start_ARG italic_f end_ARG ). According to proposition 4.8, fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is inconsistent. Suppose that the consistent model f𝑓fitalic_f is such that

MSE*(f)MSE*(f).superscriptMSE𝑓superscriptMSEsuperscript𝑓\text{MSE}^{*}(f)\leq\text{MSE}^{*}(f^{\prime})\,.MSE start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_f ) ≤ MSE start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (25)

Since MSE*superscriptMSE\text{MSE}^{*}MSE start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT reaches its minimum at fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we have MSE*(f)=MSE*(f)superscriptMSE𝑓superscriptMSEsuperscript𝑓\text{MSE}^{*}(f)=\text{MSE}^{*}(f^{\prime})MSE start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_f ) = MSE start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Hence, fargminf~MSE*(f~)𝑓subscriptargmin~𝑓superscriptMSE~𝑓f\in\operatorname*{arg\,min}_{\tilde{f}}\text{MSE}^{*}(\tilde{f})italic_f ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT over~ start_ARG italic_f end_ARG end_POSTSUBSCRIPT MSE start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( over~ start_ARG italic_f end_ARG ), which means that f𝑓fitalic_f is also inconsistent according to proposition 4.8. This is impossible given that we assumed f𝑓fitalic_f to be consistent. Thus, Eq. 25 is wrong and we conclude that

MSE*(f)>MSE*(f).superscriptMSE𝑓superscriptMSEsuperscript𝑓\text{MSE}^{*}(f)>\text{MSE}^{*}(f^{\prime})\,.MSE start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_f ) > MSE start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (26)

\blacksquare

Note that propositions 4.8 and 9.1 assume the use of the MSE loss, which is the most widely used loss in 3D-HPE. We can however extend them to the case where MPJPE serves as optimization criteria under an additional technical assumption:

Corollary 9.2.

The predicted poses minimizing the mean-per-joint-position-error loss are inconsistent if the training poses distribution P(x,p)normal-Pnormal-xnormal-p\mathrm{P}(\mathrm{x},\mathrm{p})roman_P ( roman_x , roman_p ) verifies Asm. 4.3-4.5 and if the joint-wise residuals’ norm standard deviation is small compared to the joint-wise loss:

0j<J,𝕍x,p[pjfj(x)2]𝔼x,p[pjfj(x)2]0.formulae-sequence0𝑗𝐽similar-to-or-equalssubscript𝕍xpdelimited-[]subscriptnormsubscriptp𝑗subscript𝑓𝑗x2subscript𝔼𝑥pdelimited-[]subscriptnormsubscriptp𝑗subscript𝑓𝑗x200\leq j<J\,,\quad\frac{\sqrt{\mathbb{V}_{\mathrm{x},\mathrm{p}}\big{[}\|% \mathrm{p}_{j}-f_{j}(\mathrm{x})\|_{2}\big{]}}}{\mathbb{E}_{x,\mathrm{p}}\big{% [}\|\mathrm{p}_{j}-f_{j}(\mathrm{x})\|_{2}\big{]}}\simeq 0\,.0 ≤ italic_j < italic_J , divide start_ARG square-root start_ARG blackboard_V start_POSTSUBSCRIPT roman_x , roman_p end_POSTSUBSCRIPT [ ∥ roman_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] end_ARG end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_x , roman_p end_POSTSUBSCRIPT [ ∥ roman_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] end_ARG ≃ 0 . (27)

proof.  From proposition 4.8 we know that the poses predicted by the minimizer f*superscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT of

MSE*(f)=𝔼x,p[pf(x)22]superscriptMSE𝑓subscript𝔼xpdelimited-[]subscriptsuperscriptnormp𝑓x22\text{MSE}^{*}(f)=\mathbb{E}_{\mathrm{x},\mathrm{p}}\big{[}\|\mathrm{p}-f(% \mathrm{x})\|^{2}_{2}\big{]}MSE start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_f ) = blackboard_E start_POSTSUBSCRIPT roman_x , roman_p end_POSTSUBSCRIPT [ ∥ roman_p - italic_f ( roman_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] (28)

are inconsistent. Let fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT be the component of f𝑓fitalic_f corresponding to the jthsuperscript𝑗thj^{\text{th}}italic_j start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT joint. We define the jthsuperscript𝑗thj^{\text{th}}italic_j start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT mean-per-joint-position-error component as:

MPJPEj*(f)𝔼x,p[pjfj(x)2].superscriptsubscriptMPJPE𝑗𝑓subscript𝔼xpdelimited-[]subscriptnormsubscriptp𝑗subscript𝑓𝑗x2\displaystyle\text{MPJPE}_{j}^{*}(f)\triangleq\mathbb{E}_{\mathrm{x},\mathrm{p% }}\big{[}\|\mathrm{p}_{j}-f_{j}(\mathrm{x})\|_{2}\big{]}\,.MPJPE start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_f ) ≜ blackboard_E start_POSTSUBSCRIPT roman_x , roman_p end_POSTSUBSCRIPT [ ∥ roman_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] . (29)

Under the small variance assumption, we have:

𝕍x,p[pjfj(x)2]𝔼x,p[pjfj(x)2]2subscript𝕍xpdelimited-[]subscriptnormsubscriptp𝑗subscript𝑓𝑗x2subscript𝔼𝑥psuperscriptdelimited-[]subscriptnormsubscriptp𝑗subscript𝑓𝑗x22\displaystyle\frac{\mathbb{V}_{\mathrm{x},\mathrm{p}}\big{[}\|\mathrm{p}_{j}-f% _{j}(\mathrm{x})\|_{2}\big{]}}{\mathbb{E}_{x,\mathrm{p}}\big{[}\|\mathrm{p}_{j% }-f_{j}(\mathrm{x})\|_{2}\big{]}^{2}}divide start_ARG blackboard_V start_POSTSUBSCRIPT roman_x , roman_p end_POSTSUBSCRIPT [ ∥ roman_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_x , roman_p end_POSTSUBSCRIPT [ ∥ roman_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (30)
=𝔼x,p[pf(x)22]𝔼x,p[pjfj(x)2]2𝔼x,p[pjfj(x)2]2absentsubscript𝔼xpdelimited-[]subscriptsuperscriptnormp𝑓x22subscript𝔼xpsuperscriptdelimited-[]subscriptnormsubscriptp𝑗subscript𝑓𝑗x22subscript𝔼𝑥psuperscriptdelimited-[]subscriptnormsubscriptp𝑗subscript𝑓𝑗x22\displaystyle\quad=\frac{\mathbb{E}_{\mathrm{x},\mathrm{p}}\big{[}\|\mathrm{p}% -f(\mathrm{x})\|^{2}_{2}\big{]}-\mathbb{E}_{\mathrm{x},\mathrm{p}}\big{[}\|% \mathrm{p}_{j}-f_{j}(\mathrm{x})\|_{2}\big{]}^{2}}{\mathbb{E}_{x,\mathrm{p}}% \big{[}\|\mathrm{p}_{j}-f_{j}(\mathrm{x})\|_{2}\big{]}^{2}}= divide start_ARG blackboard_E start_POSTSUBSCRIPT roman_x , roman_p end_POSTSUBSCRIPT [ ∥ roman_p - italic_f ( roman_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] - blackboard_E start_POSTSUBSCRIPT roman_x , roman_p end_POSTSUBSCRIPT [ ∥ roman_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_x , roman_p end_POSTSUBSCRIPT [ ∥ roman_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (31)
=MSEj*(f)MPJPEj*(f)2MPJPEj*(f)20,absentsuperscriptsubscriptMSE𝑗𝑓superscriptsubscriptMPJPE𝑗superscript𝑓2superscriptsubscriptMPJPE𝑗superscript𝑓2similar-to-or-equals0\displaystyle\quad=\frac{\text{MSE}_{j}^{*}(f)-\text{MPJPE}_{j}^{*}(f)^{2}}{% \text{MPJPE}_{j}^{*}(f)^{2}}\simeq 0\,,= divide start_ARG MSE start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_f ) - MPJPE start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_f ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG MPJPE start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_f ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≃ 0 , (32)

so both criteria, MSE and MPJPE, are asymptotically equivalent and have the same minimizer f*superscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, which is inconsistent according to proposition 4.8\blacksquare

10 Further details of 1D-to-2D case study

10.1 Implementation details

Datasets. We created a dataset of input-output pairs {(xi,(xi,yi))}i=1Nsuperscriptsubscriptsubscript𝑥𝑖subscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁\{(x_{i},(x_{i},y_{i}))\}_{i=1}^{N}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, divided into 1,00010001,0001 , 000 training examples, 1,00010001,0001 , 000 validation examples and 1,00010001,0001 , 000 test examples. Since the 2D position of J1subscript𝐽1J_{1}italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is fully determined by the angle θ𝜃\thetaitalic_θ between the segment (J0,J1)subscript𝐽0subscript𝐽1(J_{0},J_{1})( italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and the x𝑥xitalic_x-axis, the dataset is generated by first sampling θ𝜃\thetaitalic_θ from a von Mises mixture distribution, then converting it into Cartesian coordinates (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) to form the outputs, and finally projecting them into the x𝑥xitalic_x-axis to obtain the inputs.

Distribution scenarios. We considered three different distribution scenarios with different levels of difficulty:

  1. 1.

    Easy scenario: a unimodal distribution centered at θ=2π5𝜃2𝜋5{\theta=\frac{2\pi}{5}}italic_θ = divide start_ARG 2 italic_π end_ARG start_ARG 5 end_ARG, where the axis of maximum 2D variance is approximately parallel to the x𝑥xitalic_x-axis (Figure 3 A).

  2. 2.

    Difficult unimodal scenario: a unimodal distribution centered at θ=0𝜃0\theta=0italic_θ = 0, where the axis of maximum 2D variance is perpendicular to the x𝑥xitalic_x-axis (Figure 3 B).

  3. 3.

    Difficult multimodal scenario: a bimodal distribution, with modes at θ1=π3subscript𝜃1𝜋3\theta_{1}=\frac{\pi}{3}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG italic_π end_ARG start_ARG 3 end_ARG and θ2=π3subscript𝜃2𝜋3\theta_{2}=-\frac{\pi}{3}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - divide start_ARG italic_π end_ARG start_ARG 3 end_ARG and mixture weights w1=23subscript𝑤123w_{1}=\frac{2}{3}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG 2 end_ARG start_ARG 3 end_ARG and w2=13subscript𝑤213w_{2}=\frac{1}{3}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 3 end_ARG, i.e., hence where the projection of modes onto the x𝑥xitalic_x-axis are close to each other (Figure 3 C).

All von Mises components in all scenarios had concentrations equal to 20202020.

Architectures and training. All three models were based on a multi-layer perceptron (MLP) with 2 hidden layers of 32323232 neurons each, using tanh activation.

The constrained and unconstrained MLPs were trained using the mean-squared loss 1Ni=1N((x^ixi)2+(y^iyi)2)1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript^𝑥𝑖subscript𝑥𝑖2superscriptsubscript^𝑦𝑖subscript𝑦𝑖2\frac{1}{N}\sum_{i=1}^{N}((\hat{x}_{i}-x_{i})^{2}+(\hat{y}_{i}-y_{i})^{2})divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). ManiPose was trained with loss (7), and had K=2𝐾2K=2italic_K = 2 heads. We trained all models with batches of 100100100100 examples for a maximum of 50505050 epochs. We used the Adam optimizer [12], with default hyperparameters and no weight decay. Learning rates were searched for each model and distribution independently over a small grid: [105,104,103,102]superscript105superscript104superscript103superscript102[10^{-5},10^{-4},10^{-3},10^{-2}][ 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ] (cf. selected values in Tab. 6). They were scheduled during training using a plateau strategy of factor 0.50.50.50.5, patience of 10101010 epochs and threshold of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT.

Distribution A B C
Unconstr. MLP 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
Constrained MLP 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
ManiPose 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
Table 6: Selected learning rates for 1D-to-2D synthetic experiment.

10.2 Extension to 2D-to-3D setup with more joints

In this section, we further extend the 2-joint 1D-to-2D lifting experiment from Sec. 3 to the case of 2D-to-3D lifting with three joints. The idea is to get one step closer to the 3D-HPE scenario.

As in Sec. 3, we suppose that joint J0subscript𝐽0J_{0}italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is at the origin at all times, that J1subscript𝐽1J_{1}italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is connected to J0subscript𝐽0J_{0}italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT through a rigid segment of length s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and that J2subscript𝐽2J_{2}italic_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is connected to J1subscript𝐽1J_{1}italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT through a second rigid segment of length s1<s0subscript𝑠1subscript𝑠0s_{1}<s_{0}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We further assume that both J1subscript𝐽1J_{1}italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and J2subscript𝐽2J_{2}italic_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are allowed to rotate around two axis orthogonal to each other. Hence, J1subscript𝐽1J_{1}italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is constrained to lie on a circle S1(0,s0)superscript𝑆10subscript𝑠0S^{1}(0,s_{0})italic_S start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( 0 , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), while J2subscript𝐽2J_{2}italic_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT lies on a torus T𝑇Titalic_T homeomorphic to S1(0,s0)×S1(0,s1)superscript𝑆10subscript𝑠0superscript𝑆10subscript𝑠1S^{1}(0,s_{0})\times S^{1}(0,s_{1})italic_S start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( 0 , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) × italic_S start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( 0 , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). Without loss of generality, we set the radii s0=2subscript𝑠02s_{0}=2italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 2 and s1=1subscript𝑠11s_{1}=1italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 and assume them to be known.

Given this setup, we are interested in learning to predict the 3D pose (J1,J2)=(x1,y1,z1,x2,y2,z2)6subscript𝐽1subscript𝐽2subscript𝑥1subscript𝑦1subscript𝑧1subscript𝑥2subscript𝑦2subscript𝑧2superscript6{(J_{1},J_{2})=(x_{1},y_{1},z_{1},x_{2},y_{2},z_{2})\in\mathbb{R}^{6}}( italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, given its 2D projection (K1,K2)=(x1,z1,x2,z2)4subscript𝐾1subscript𝐾2subscript𝑥1subscript𝑧1subscript𝑥2subscript𝑧2superscript4{(K_{1},K_{2})=(x_{1},z_{1},x_{2},z_{2})\in\mathbb{R}^{4}}( italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. We create a dataset comprising 20,0002000020,00020 , 000 training, 2,00020002,0002 , 000 validation, and 2,00020002,0002 , 000 test examples, sampled using an arbitrary von Mises mixture of poloidal and toroidal angles (θ,ϕ)𝜃italic-ϕ(\theta,\phi)( italic_θ , italic_ϕ ) in T𝑇Titalic_T. We set the modes of such a mixture at [(π,0),(0,π/4),(12,π/4),(2π/3,π/2)]𝜋00𝜋412𝜋42𝜋3𝜋2[(-\pi,0),(0,\pi/4),(\frac{1}{2},-\pi/4),(2\pi/3,\pi/2)][ ( - italic_π , 0 ) , ( 0 , italic_π / 4 ) , ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG , - italic_π / 4 ) , ( 2 italic_π / 3 , italic_π / 2 ) ], with concentrations of [2,4,3,10]24310[2,4,3,10][ 2 , 4 , 3 , 10 ] and weights [0.3,0.4,0.2,0.1]0.30.40.20.1[0.3,0.4,0.2,0.1][ 0.3 , 0.4 , 0.2 , 0.1 ]. Similarly to Fig. 3 C, this is a difficult multimodal distribution, which is depicted in Fig. 8.

Refer to caption
(a)
Refer to caption
(b)
Figure 8: Visualisation of the von Mises mixture distribution on the torus T𝑇Titalic_T. The different colors (blue, green, red, purple) represent the modes of the sampled points. We are only representing joint J2subscript𝐽2J_{2}italic_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT here for clarity.

We train and evaluate the same baselines as in Sec. 3 in this new scenario, using a similar setup (cf. Sec. 10.1, Architectures and training). The corresponding Mean Per Segment Consistency Error (MPSCE) and Mean Per Joint Position Error (MPJPE) results are reported in Tab. 7.

MPJPE \downarrow MPSCE \downarrow
Unconst. MLP 1.1468 0.2539
Constrained MLP 1.1593 0.0000
ManiPose 1.1337 0.0000
Table 7: Mean per joint prediction error (MPJPE) and mean per segment consistency error (MPSCE) in a 2D-to-3D scenario. ManiPose reaches perfect MPSCE consistency without degrading MPJPE performance.

We see that the same observations as in Sec. 3 also apply here: although the unconstrained MLP yields competitive MPJPE results, its predictions are not consistently aligned with the manifold, as indicated by its poor MPSCE performance. Again, we show here that ManiPose offers an effective balance between maintaining manifold consistency and achieving high joint-position-error performance.

11 Further ManiPose implementation details

11.1 Architectural details

Our architecture is backbone-agnostic, as shown on Fig. 4. Hence, in order to have a fair comparison, we decided to implement it using the most powerful architecture available, i.e., MixSTE [40].

In practice, the rotations module follows the MixSTE architecture with dl=8subscript𝑑𝑙8d_{l}=8italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 8 spatio-temporal transformer blocks of dimension dm=512subscript𝑑𝑚512d_{m}=512italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 512 and time receptive field of L=243𝐿243L=243italic_L = 243 frames for Human 3.6M experiments and L=43𝐿43L=43italic_L = 43 frames for MPI-INF-3DHP experiments. Contrary to MixSTE, this network outputs rotation embeddings of dimension 6666 for each joint and frame, instead of Cartesian coordinates of dimension 3333.

Concerning the segment module, it was also implemented with a smaller MixSTE backbone of depth dl=2subscript𝑑𝑙2d_{l}=2italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 2 and dimension dm=128subscript𝑑𝑚128d_{m}=128italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 128.

The ablation study presented in Tab. 5 shows that the increase in the number of parameters between MixSTE and ManiPose is negligible.

11.2 Pose decoding details

The pose decoding block from Fig. 4 is described in Sec. 5.1 and is based on Algorithms 1 and 2. The whole procedure is illustrated on Fig. 9.

Refer to caption
Figure 9: Pose decoder overview. First, the predicted segment length s𝑠sitalic_s is used to scale a unit reference pose uu\mathrm{u}roman_u into a scaled reference pose usuperscriptu\mathrm{u}^{\prime}roman_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with the subject morphology. Then, predicted rotation representations sequences {rt}t=1Lsuperscriptsubscriptsubscript𝑟𝑡𝑡1𝐿\{r_{t}\}_{t=1}^{L}{ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT are converted into sequences of rotation matrices {Rt}t=1Lsuperscriptsubscriptsubscript𝑅𝑡𝑡1𝐿\{R_{t}\}_{t=1}^{L}{ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT. Finally, these rotations are applied to the reference pose usuperscriptu\mathrm{u}^{\prime}roman_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT using the forward kinematics algorithm. This is repeated for each hypothesis k𝑘kitalic_k in the case of ManiPose.
Joint 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Weight 1 1 2.5 2.5 1 2.5 2.5 1 1 1 1.5 1.5 4 4 1.5 4 4
Table 8: Joint-wise weights used in the Winner-takes-all loss Eq. 8 (as in [40]).
Algorithm 1 6D rotation representation conversion [42]
1:Predicted 6D rotation representation r6𝑟superscript6r\in\mathbb{R}^{6}italic_r ∈ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT.
2:x[r0,r1,r2],superscript𝑥subscript𝑟0subscript𝑟1subscript𝑟2x^{\prime}\leftarrow[r_{0},r_{1},r_{2}]\,,italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← [ italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ,
3:y[r3,r4,r5],superscript𝑦subscript𝑟3subscript𝑟4subscript𝑟5y^{\prime}\leftarrow[r_{3},r_{4},r_{5}]\,,italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← [ italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ] ,
4:xx/x2,𝑥superscript𝑥subscriptnormsuperscript𝑥2x\leftarrow x^{\prime}/\|x^{\prime}\|_{2}\,,italic_x ← italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / ∥ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
5:zxy,superscript𝑧𝑥superscript𝑦z^{\prime}\leftarrow x\wedge y^{\prime}\,,italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_x ∧ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ,
6:zz/z2,𝑧superscript𝑧subscriptnormsuperscript𝑧2z\leftarrow z^{\prime}/\|z^{\prime}\|_{2}\,,italic_z ← italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / ∥ italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
7:yzx,𝑦𝑧𝑥y\leftarrow z\wedge x\,,italic_y ← italic_z ∧ italic_x ,
8:return R=[x|y|z]3×3.𝑅delimited-[]𝑥𝑦𝑧superscript33R=[x|y|z]\in\mathbb{R}^{3\times 3}\,.italic_R = [ italic_x | italic_y | italic_z ] ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT .
Algorithm 2 Forward Kinematics [27, 20]
1:Scaled reference pose u(3)Jsuperscript𝑢superscriptsuperscript3𝐽u^{\prime}\in(\mathbb{R}^{3})^{J}italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT, predicted rotation matrices Rt,jsubscript𝑅𝑡𝑗R_{t,j}italic_R start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT, 0j<J0𝑗𝐽0\leq j<J0 ≤ italic_j < italic_J.
2:Rt,0Rt,0subscriptsuperscript𝑅𝑡0subscript𝑅𝑡0R^{\prime}_{t,0}\leftarrow R_{t,0}italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 0 end_POSTSUBSCRIPT ← italic_R start_POSTSUBSCRIPT italic_t , 0 end_POSTSUBSCRIPT  ,
3:pt,0u0subscriptp𝑡0subscriptsuperscriptu0\mathrm{p}_{t,0}\leftarrow\mathrm{u}^{\prime}_{0}roman_p start_POSTSUBSCRIPT italic_t , 0 end_POSTSUBSCRIPT ← roman_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT  ,
4:for j=1,,J1𝑗1𝐽1j=1,\dots,J-1italic_j = 1 , … , italic_J - 1 do
5:       Rt,jRt,jRt,τ(j)subscriptsuperscript𝑅𝑡𝑗subscript𝑅𝑡𝑗subscriptsuperscript𝑅𝑡𝜏𝑗R^{\prime}_{t,j}\leftarrow R_{t,j}R^{\prime}_{t,\tau(j)}italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ← italic_R start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_τ ( italic_j ) end_POSTSUBSCRIPT  , \triangleright Compose relative rotations
6:       pt,jRt,j(ujuτ(j))+pt,τ(j)subscriptp𝑡𝑗subscriptsuperscript𝑅𝑡𝑗subscriptsuperscriptu𝑗subscriptsuperscriptu𝜏𝑗subscriptp𝑡𝜏𝑗\mathrm{p}_{t,j}\leftarrow R^{\prime}_{t,j}(\mathrm{u}^{\prime}_{j}-\mathrm{u}% ^{\prime}_{\tau(j)})+\mathrm{p}_{t,\tau(j)}roman_p start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ← italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ( roman_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - roman_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ ( italic_j ) end_POSTSUBSCRIPT ) + roman_p start_POSTSUBSCRIPT italic_t , italic_τ ( italic_j ) end_POSTSUBSCRIPT  ,
7:end for
8:return pt=[pt,j]0j<Jsubscriptp𝑡subscriptdelimited-[]subscriptp𝑡𝑗0𝑗𝐽\mathrm{p}_{t}=[\mathrm{p}_{t,j}]_{0\leq j<J}roman_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ roman_p start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 0 ≤ italic_j < italic_J end_POSTSUBSCRIPT

11.3 Training details

Training tactics.

In order to have a fair comparison with MixSTE [40], we trained ManiPose using the same training tactics, such as pose flip augmentation both at training and test time. Moreover, the training loss (7) was complemented with two additional terms described in [40]:

  1. 1.

    a TCloss term, initially introduced in [9];

  2. 2.

    a velocity loss term, introduced in [29].

We also weighted the Winner-takes-all MPJPE loss (8) as in [40] (cf. weights in Tab. 8). The score loss weight, β𝛽\betaitalic_β, was set to 0.10.10.10.1 according to our hyperparameter study (Sec. 12), while TCloss and velocity loss terms had respective weights of 0.50.50.50.5 and 2222 (values from [40]).

Training settings. We trained our model for a maximum of 200200200200 epochs with the Adam optimizer [12], using default hyperparameters, a weight decay of 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and an initial learning rate of 4×1054superscript1054\times 10^{-5}4 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. The latter was reduced with a plateau scheduler of factor 0.50.50.50.5, patience of 11111111 epochs and threshold of 0.10.10.10.1 mm. Batches contained 3333 sequences of L=243𝐿243L=243italic_L = 243 frames each for the Human 3.6M training, and 30303030 sequences of L=43𝐿43L=43italic_L = 43 frames for MPI-INF-3DHP.

11.4 Baselines evaluation.

All Human 3.6M evaluations of MPSSE and MPSCE listed in Tabs. 2 and 5 were performed using the official checkpoints of these methods and their corresponding official evaluation scripts. Concerning MPI-INF-3DHP evaluations from Tab. 4, checkpoints were not available (except for P-STMO). Baseline models were hence retrained from scratch using the official MPI-INF-3DHP training scripts provided by the authors of each work, using hyperparameters reported in their corresponding papers. We checked that we were able to reproduce the reported MPJPE results.

12 Further results on H36M dataset

L𝐿Litalic_L K𝐾Kitalic_K Dir. Disc Eat Greet Phone Photo Pose Purch. Sit SitD. Smoke Wait WalkD. Walk WalkT. Avg.
MGCN [43] 1 1 35.7 38.6 36.3 40.5 39.2 44.5 37.0 35.4 46.4 51.2 40.5 35.6 41.7 30.7 33.9 39.1
ST-GCN [2] 1 1 35.7 37.8 36.9 40.7 39.6 45.2 37.4 34.5 46.9 50.1 40.5 36.1 41.0 29.6 33.2 39.0
Pavllo et al. [29] 243 1 34.2 36.8 33.9 37.5 37.1 43.2 34.4 33.5 45.3 52.7 37.7 34.1 38.0 25.8 27.7 36.8
Zheng et al. [41] 81 1 34.1 36.1 34.4 37.2 36.4 42.2 34.4 33.6 45.0 52.5 37.4 33.8 37.8 25.6 27.3 36.5
Liu et al. [22] 243 1 32.3 35.2 33.3 35.8 35.9 41.5 33.2 32.7 44.6 50.9 37.0 32.4 37.0 25.2 27.2 35.6
Anatomy3D [4] 243 1 32.6 35.1 32.8 35.4 36.3 40.4 32.4 32.3 42.7 49.0 36.8 32.4 36.0 24.9 26.5 35.0
UGCN [35] 96 1 31.8 34.3 35.4 33.5 35.4 41.7 31.1 31.6 44.4 49.0 36.4 32.2 35.0 24.9 23.0 34.5
MixSTE [40] 243 1 30.8 33.1 30.3 31.8 33.1 39.1 31.1 30.5 42.5 44.5 34.0 30.8 32.7 22.1 22.9 32.6
ManiPose (Ours) 243 5 31.9 35.7 30.8 33.5 34.0 39.8 33.0 31.4 41.1 45.9 36.0 32.3 35.4 24.7 25.8 34.1
Table 9: Quantitative comparison with the state-of-the-art methods on Human3.6M under Protocol #2 (P-MPJPE in mm), using detected 2D poses. L𝐿Litalic_L: sequence length. K𝐾Kitalic_K: number of hypotheses. Bold: best; Underlined: second best. ManiPose results using the oracle evaluation.
Refer to caption
Figure 10: Detailed results on H36M. Top: Mean position errors per joint. Bottom: Human 3.6M skeleton.

Protocol #2 results. A detailed quantitative comparison in terms of P-MPJPE between ManiPose and regression-based state-of-the-art methods is shown in Tab. 9. We observe the same patterns as in Tab. 3, namely that ManiPose reaches the second-best P-MPJPE performance on average and for most actions. We confirm here again that the substantial improvements in pose consistency brought by ManiPose are not obtained at the expense of traditional metrics derived from MPJPE.

Errors per joint. On the top of Fig. 10 we see that most of MixSTE errors come from feet, elbows and wrist joints, which are most prone to depth ambiguity. ManiPose helps to reduce the position errors for most of these ambiguous joints, probably as a byproduct of its major consistency improvements shown in Fig. 5.

Impact of hyperparameters. ManiPose introduces two additional hyperparameters when compared to MixSTE: the number K𝐾Kitalic_K of hypotheses and the score loss weight β𝛽\betaitalic_β (cf. Eq. 7). We hence assess the impact of their respective values on MPJPE. For computational cost reasons, we used a smaller version of our model for this study, with transformer blocks of dimension dm=64subscript𝑑𝑚64d_{m}=64italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 64 and time receptive field of L=27𝐿27L=27italic_L = 27 frames.

Refer to caption
Figure 11: Parameter study on H36M. Impact of the number K𝐾Kitalic_K of hypotheses (left) and score loss weight β𝛽\betaitalic_β (right) on ManiPose aggregated and oracle performance. Results are obtained with a smaller network (dm=64subscript𝑑𝑚64{d_{m}=64}italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 64) and a shorter sequence (L=27𝐿27L=27italic_L = 27). Left plot obtained with β=0.1𝛽0.1\beta=0.1italic_β = 0.1 and right plot with K=5𝐾5K=5italic_K = 5.

Fig. 11 (left) shows that more hypotheses help, but that the performance improvements saturate around 5 hypotheses. Concerning β𝛽\betaitalic_β, Fig. 11 (right) shows that lower values help to improve the MPJPE performance.

13 Code

We provide the code to reproduce the 1D-to-2D experiment from Sec. 3 attached to this appendix. The full code of this work will be made openly available upon publication.