ManiPose: Manifold-Constrained Multi-Hypothesis 3D Human Pose Estimation

Cédric Rommel

{}^{1}

Corresponding author. Victor Letzelter

{}^{1,2}

Nermin Samet

{}^{1}

Renaud Marlet

{}^{1,5}

Matthieu Cord

{}^{1,3}

Patrick Pérez

{}^{1}

Eduardo Valle

{}^{1,4}

{}^{1}

Valeo.ai, Paris, France

{}^{2}

Telecom Paris, Palaiseau, France

{}^{3}

Sorbonne Université, Paris, France

{}^{4}

Recod.ai Lab, School of Electrical and Computing Engineering, University of Campinas, Brazil

{}^{5}

LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS, Marne-la-Vallee, France

Abstract

Monocular 3D human pose estimation (3D-HPE) is an inherently ambiguous task, as a 2D pose in an image might originate from different possible 3D poses. Yet, most 3D-HPE methods rely on regression models, which assume a one-to-one map** between inputs and outputs. In this work, we provide theoretical and empirical evidence that, because of this ambiguity, common regression models are bound to predict topologically inconsistent poses, and that traditional evaluation metrics, such as the MPJPE, P-MPJPE and PCK, are insufficient to assess this aspect. As a solution, we propose ManiPose, a novel manifold-constrained multi-hypothesis model capable of proposing multiple candidate 3D poses for each 2D input, together with their corresponding plausibility. Unlike previous multi-hypothesis approaches, our solution is completely supervised and does not rely on complex generative models, thus greatly facilitating its training and usage. Furthermore, by constraining our model to lie within the human pose manifold, we can guarantee the consistency of all hypothetical poses predicted with our approach, which was not possible in previous works. We illustrate the usefulness of ManiPose in a synthetic 1D-to-2D lifting setting and demonstrate on real-world datasets that it outperforms state-of-the-art models in pose consistency by a large margin, while still reaching competitive MPJPE performance.

1 Introduction

Refer to caption — Figure 1: When faced with depth ambiguity, state-of-the-art regression methods [40] fail to predict poses with consistent segments’ length. In contrast, ManiPose predicts multiple consistent hypotheses and is thus capable of dealing with such ambiguities.

Monocular 3D human pose estimation (3D-HPE) is a significant and challenging learning problem that aims at predicting human poses in the 3D space given a single image or video. Most modern approaches to 3D-HPE split the problem into two steps: 2D-HPE is first used to predict keypoint positions in the 2D pixel space, followed by a 2D-to-3D lifting step, predominantly cast as a statistical regression problem. Because of depth ambiguity and occlusions, it is a well-known fact that this is an intrinsically ill-posed problem, given that multiple 3D poses can correspond to the same 2D pose observed in an image. Despite these difficulties, recent developments have spurred rapid transformations in the field, leading to substantial performance improvements in terms of mean per joint prediction error (MPJPE) and other derived metrics (e.g., P-MPJPE, PCK) [40, 41, 32, 35].

However, recent studies [37, 8, 30] have noted that poses predicted by state-of-the-art models fail to respect the basic invariances of human morphology, such as bilateral symmetry of left and right-side body parts, or the constant length of rigid segments connecting the joints of a given subject across an image sequence. In this work, we provide theoretical elements clarifying the cause of these issues. Namely, we formally prove that pose consistency and traditional performance metrics, such as MPJPE, are somehow antagonistic and cannot be optimized simultaneously using a standard regression model. This is because (i) MPJPE ignores the topology of the space of human poses, and (ii) because regression models rely on a one-to-one map** between inputs and outputs, thus overlooking the inherently ambiguous nature of the 3D-HPE problem.

To reconcile both MPJPE performance and pose consistency, and better accommodate the 3D-HPE task ambiguity, we propose ManiPose, a novel manifold-constrained multi-hypothesis human pose lifting method. Unlike previously proposed multi-hypothesis (MH) approaches to 3D-HPE, our solution relies on a completely supervised model instead of a generative one, thus alleviating training complexity. Furthermore, this structural novelty allows us to enforce pose consistency as a hard constraint in the network architecture, guaranteeing that all predicted poses lie on a pose manifold, which we estimate. As a third difference with previous approaches, ManiPose predicts not only a set of hypothetical poses, but also their corresponding likelihood conditional to the 2D input. Together, these two outputs allow ManiPose to estimate deterministically the conditional distribution of 3D poses given a 2D input in a single forward pass. All these ingredients allow ManiPose to reduce pose inconsistency by more than one order of magnitude with regard to state-of-the-art methods, while maintaining competitive performance in terms of MPJPE.

Our contributions can be summarized as follows:

•

We prove that the standard regression setting inevitably leads to predicted poses outside the human pose manifold.
•

As a corollary, we also prove that regression models constrained to lie on the human pose manifold cannot beat unconstrained models in terms of the traditional performance metrics, thus calling to complement benchmarks with pose consistency measures.
•

We motivate our approach and illustrate our theoretical findings on simulated lifting experiments, where we demonstrate that only multi-hypothesis approaches can conciliate MPJPE performance and pose consistency.
•

We hence propose a novel fully-supervised multi-hypothesis method for 3D-HPE called ManiPose.
•

Finally, we demonstrate on two challenging datasets, Human 3.6M and MPI-INF-3DHP, that ManiPose outperforms state-of-the-art methods by a substantial margin in terms of pose consistency, while reaching comparable MPJPE performance.

2 Related work

Regression-based 2D-to-3D pose lifting. While 2D-to-3D human pose lifting was initially tackled at the frame level [24, 3], the field quickly adopted recurrent [9], convolutional [29] and graph neural networks [2, 43, 10, 39] to move towards video-level predictions. More recently, spatial-temporal transformers were proposed [32, 41], including MixSTE [40], which, arguably, is the state of the art. Our work also processes videos with a transformer architecture. A few previous works have proposed to constrain predicted poses to respect human symmetries [38, 4]. We follow the same idea, but in a multi-hypothesis setting and with a different constraint implementation.

Multi-hypothesis 3D-HPE. The intrinsic ambiguity of the 3D-HPE task led the community to investigate multi-hypothesis approaches (MH). Efforts included the use of Mixture Density Networks [19, 28, 1], variational autoencoders [33], normalizing flows [13, 37] and diffusion models [8]. Contrary to our approach, these methods rely on training a conditional generative model to sample 3D pose hypotheses given the 2D input. A notable exception is MHFormer [21], which, like us, proposes a deterministic MH approach. However, it treats hypotheses as intermediate representations and aggregates them into a single pose at the final layers of the network, thus falling back to a one-to-one map**. We try to avoid this in this work for reasons detailed in the next sections. Moreover, none of the previous MH approaches constrain hypotheses to lie on the human pose manifold, thus not guaranteeing good pose consistency.

Multiple Choice Learning. Suited for ambiguous tasks, MCL [7] consists in optimizing a set of predictors using the oracle loss. Adapted for deep learning by Lee et al. [15, 16], it results in diverse predictors being specialized in subsets of the data distribution. It has proved its effectiveness in several computer vision tasks [31, 14, 26, 6, 23, 34], and was first applied to 2D-HPE in [31]. Our work is the first to revisit MCL for the 3D-HPE task, building upon recent novelties from Letzelter et al. [17].

3 Need for constrained multiple hypotheses

This section aims to illustrate and motivate our contributions by highlighting the shortcomings of traditional lifting-based 3D-HPE methods, without unnecessary complexities.

To this end, we propose to tackle a 1D-to-2D lifting task. As in human pose lifting, we take a root joint $J_{0}$ as reference, which is equivalent to fixing its position at $(0,0)$ . Hence, for a joint $J_{1}$ , the problem amounts to predicting the Cartesian coordinates $(x,y)$ defining its position, given its 1D projection $u=x$ . To strip the problem to its essence, we consider only those two joints, and omit camera perspective. As in the case of human poses, we suppose that joints $J_{0}$ and $J_{1}$ are connected by a rigid segment of length $s=1$ , assumed to be fixed for the whole dataset (as if there were a single subject in a 3D-HPE problem). As a consequence, we know that $J_{1}$ lies on a manifold: the circle of radius $s$ centered on $J_{0}=(0,0)$ (cf. Fig. 2 left).

To show empirical results to illustrate our point, we create three datasets of input-output pairs $\{(x_{i},(x_{i},y_{i}))\}_{i=1}^{N}$ with different angular distributions as depicted on Fig. 3 (more details about the experiment setting can be found in the supplementary material in Sec. 10). In an easy scenario (Fig. 3 A), a simple 2-layer MLP, trained in a regression setting to predict $(x,y)$ given $x$ , can solve the problem reasonably well. However, this simple solution fails in the two other scenarios (Fig. 3 B-C), leading to predictions that are outside the manifold. This happens because depth ambiguity makes the lifting problem heavily ill-posed in these cases: there are multiple 2D output positions in the training set corresponding to the same projected inputs.

A possible idea to mitigate this issue is to inject of prior knowledge about the manifold into the model, such as forcing its predictions to lie on the circle. This can be implemented by training the same MLP to predict the angle $\theta$ instead of the Cartesian coordinates $(x,y)$ directly, and then decode the prediction $\hat{\theta}$ into ${(\hat{x},\hat{y})=(\cos\hat{\theta},\sin\hat{\theta})}$ . Although this solution leads to predictions on the circle (cf. Fig. 3 B-C), it achieves worse MPJPE performance than the previous unconstrained MLP in the multimodal case (cf. Constrained MLP in Tab. 1).

	MPJPE $\downarrow$	Distance to circle $\downarrow$
Unconst. MLP	0.748	0.411
Constrained MLP	0.759	0.000
ManiPose	0.733	0.000

Table 1: Mean per joint prediction error (MPJPE) and mean distance to circle for each model on the test set of the bimodal scenario C. Best in bold. MPJPE is computed using (1) for ManiPose.

This lower performance can be explained by the choice of the evaluation metric used, MPJPE, which assumes an Euclidean topology, thus leading to its minimum always falling inside the circle, hence outside the manifold. This is shown in the right side of Fig. 2, where we plot the true minimizer of the mean position error for an ambiguous input. We can see that the prediction made by the unconstrained MLP lies very close to the minimizer. These findings are made theoretically precise and generalized to the real 3D-HPE setting in the next section.

A first consequence of this result is that unconstrained regression models are bound to predict inconsistent poses, where segment lengths are not constant, as seen for state-of-the-art 3D-HPE models [37, 8, 30]. A second consequence is that, no matter how good a model constrained to the manifold is, there is always an unconstrained model leading to better MPJPE performance, as illustrated by the minimizer constrained to the circle in Fig. 2 (right).

These realizations raise the question: Is there a way to have the best of both constrained and unconstrained models, namely a good MPJPE performance, while still predicting consistent poses on the manifold? Given the multimodality of human pose distributions (cf. Fig. 7), the only way out is to adopt a multi-hypothesis framework, leaving behind the regression setting, which cannot accommodate the ambiguity of the lifting task. Ideally, we would want our model to predict both possible poses (marked with a ${\color[rgb]{0,0.88,0}\star}$ ) in Fig. 2 (right), with their corresponding likelihood.

To this end, we tested a simplified version of our new multi-hypothesis approach, named ManiPose. It consists here in replacing the last linear layer of the constrained MLP model with $K\,{=}\,2$ identical linear heads. Each head $k$ predicts a hypothesis for the angle $\hat{\theta}^{k}$ and a corresponding score $\gamma^{k}$ modeling the probability that the solution $\theta$ is closer to $\hat{\theta}^{k}$ than to other hypothesis $\{\hat{\theta}^{i}\}_{i\neq k}$ . This is described in details in Sec. 5 in a general 3D-HPE setting.

As shown in Fig. 3, not only does this method predict hypotheses on the circle, it also yields predictions close to the MLP predictions when hypotheses $\hat{\theta}^{k}$ and scores $\gamma^{k}$ are aggregated:

\hat{x}_{\text{aggr}}=\sum_{k=1}^{K}\hat{x}^{k}\gamma^{k}\,,\quad\hat{y}_{% \text{aggr}}=\sum_{k=1}^{K}\hat{y}^{k}\gamma^{k}\,,

(1)

where $(\hat{x}^{k},\hat{y}^{k})=(\cos\hat{\theta}^{k},\sin\hat{\theta}^{k})$ , leading even to superior MPJPE performance here (Tab. 1). Further experiments can be found in the supplementary material.

4 Theoretical results: Regression and MPJPE are not enough

In this section, we generalize the intuitions from previous Sec. 3 to the 3D-HPE setting, proving our claims rigorously in the supplementary material (Sec. 9).

Starting with a few definitions needed for our derivations:

Definition 4.1 (Human skeleton).

We define a human skeleton as an undirected connected graph $G=(V,E)$ with $J=|V|$ nodes, called joints, associated with different human body articulation points. We assume a predefined order of joints and denote $A=[A_{ij}]_{0\leq i,j<J}\in\{0,1\}^{J\times J}$ the adjacency matrix of $G$ , defining joints connections.

Definition 4.2 (Human pose and movement).

Let $G$ be a skeleton of $J$ joints. We attach to each joint $i$ a position $\mathrm{p}^{G}_{i}$ in $\mathbb{R}^{3}$ and call the vector ${\mathrm{p}^{G}=[\mathrm{p}^{G}_{0},\dots,\mathrm{p}^{G}_{J-1}]\in(\mathbb{R}^% {3})^{J}}$ a human pose. Furthermore, given a series of increasing time steps ${t_{1}<t_{2}<\dots<t_{L}\in\mathbb{R}}$ , we define a human movement $m^{G}$ as a sequence of poses of the same subject at those instants ${m^{G}=[\mathrm{p}^{G}_{t_{1}},\dots,\mathrm{p}^{G}_{t_{L}}]\in(\mathbb{R}^{3}% )^{J\times L}}$ .

Our theoretical results rely on a number of assumptions described hereafter. The first one states what reference frame is usually used for assessing 3D-HPE models:

Assumption 4.3 (Reference root joint).

For any skeleton $G$ and movement $m^{G}$ of length $L$ , the joint of index $0$ , called the root joint, is at the origin $\mathrm{p}^{G}_{t,0}=[0,0,0]$ at all times $t_{1}\leq t\leq t_{L}$ . This is equivalent to measuring positions $\mathrm{p}^{G}_{t}$ in a reference frame attached to the root joint.

The second assumption concerns the rigidity of human body parts, and was verified for ground-truth 3D poses from the Human 3.6M dataset (cf. Fig. 5):

Assumption 4.4 (Rigid segments).

We assume that the Euclidean distance between adjacent joints is constant within a movement $m^{G}$ : for any pair of instants $t$ and $t^{\prime}$ and for any joints $i,j$ such that $A_{ij}=1$ , we assume that

s_{t,i,j}=s_{t^{\prime},i,j}=s_{i,j}\,,

(2)

where $s_{t,i,j}=\|\mathrm{p}^{G}_{t,i}-\mathrm{p}^{G}_{t,j}\|_{2}>0$ .

Lastly, we assume that the conditional distribution of poses is not reduced to a single point, i.e., we have a one-to-many problem. This was also empirically verified (cf. Fig. 7 in supplementary material).

Assumption 4.5 (Non-degenerate conditional distribution).

Given a joint distribution $\mathrm{P}(\mathrm{x}^{G},\mathrm{p}^{G})$ of 3D poses ${\mathrm{p}^{G}\in(\mathbb{R}^{3})^{J}}$ and corresponding 2D inputs ${\mathrm{x}^{G}\in(\mathbb{R}^{2})^{J}}$ , we assume that the conditional distribution $\mathrm{P}(\mathrm{p}^{G}|\mathrm{x}^{G})$ is non-degenerate, i.e., it is not a single Dirac distribution.

Note that this can be true even when $\mathrm{P}(\mathrm{x}^{G},\mathrm{p}^{G})$ is unimodal (e.g., Fig. 3 B).

A first consequence of our assumptions is that true human poses forming a movement lie on a smooth manifold:

Proposition 4.6 (Human pose manifold).

If Asm. 4.3 and 4.4 are verified, then all poses $\mathrm{p}^{G}_{t}$ forming a human movement $m^{G}$ lie on the same manifold $\mathcal{M}$ of dimension $2(J-1)$ . The latter is homeomorphic to the direct product of 2D unit spheres $(S^{2})^{J-1}$ :

\forall t\in\{t_{1},\dots,t_{L}\},\quad\mathrm{p}^{G}_{t}\in\mathcal{M}\cong(S% ^{2})^{J-1}\,.

(3)

Note that when $J=2$ and poses have one fewer dimension, we fall back to the unit circle manifold $S^{1}$ , as in Sec. 3.

With this new proposition in mind, we can properly define the notion of pose consistency:

Definition 4.7 (Inconsistent poses and movements).

We say a predicted movement $m^{G}$ is inconsistent if it violates assumption 4.4.

We now have everything needed to state the main theoretical result of this work:

Proposition 4.8 (Inconsistency of MSE minimizer).

If the training poses distribution verifies Asm. 4.3-4.5, predicted poses minimizing the traditional mean-squared-error loss are inconsistent.

Prop. 4.8 has two main take-away messages: 1. MPJPE cannot fully assess 3D-HPE models, and must be complemented with pose consistency metrics; 2. When minimizing the mean joint error with an unconstrained model that assumes a one-to-one map**, like a regression model, one cannot predict consistent poses. In the next section, we propose a solution to these problems.

5 Method

As in previous state-of-the-art 3D-HPE approaches, we adopt a 2-step estimation procedure, where we first estimate human keypoints in the pixel space ${[\mathrm{x}^{G}_{1},\dots,\mathrm{x}^{G}_{L}]\in(\mathbb{R}^{2})^{J\times L}}$ from a sequence of $L$ video frames, and then lift them to 3D joint positions ${[\hat{\mathrm{p}}^{G}_{1},\dots,\hat{\mathrm{p}}^{G}_{L}]\in(\mathbb{R}^{3})^% {J\times L}}$ . We focus on the second step (i.e., lifting) in the rest of the paper, assuming the availability of predicted keypoints $\mathrm{x}^{G}_{i}$ . In the following, we omit the $G$ superscript from pose and keypoints notations for clarity.

5.1 Constraining predictions to the pose manifold

Rationale. While traditional human pose lifting methods directly predict 3D joint positions using a regression model, the human morphology does not allow joints to occupy the whole 3D space (cf. Prop. 4.6). If we knew the length of each segment connecting pairs of joints for a given subject, we could mimic the procedure adopted in Sec. 3, and guarantee that predicted poses lie on the correct pose manifold by only predicting body part’s rotations with respect to a reference pose. Since we do not have access to ground-truth segment lengths in real use-cases, we propose to predict them, thus estimating the manifold.

Disentangled representations. We constrain model predictions to lie on an estimated manifold by predicting parametrized disentangled transformations of a reference pose $\mathrm{u}$ , for which all segments have unit length. Namely, we propose to split the network into two parts (cf. Fig. 4):

1.

Segments module: which predicts segment lengths ${s\in\mathbb{R}^{J-1}}$ , shared by all $L$ frames of the input sequence;
2.

Rotations module: which predicts relative rotation representations ${r=[r_{1,0},\dots,r_{L,J-1}]\in(\mathbb{R}^{d})^{J\times L}}$ of each joint with respect to their parent joint at each time step.

Rotations representation. As proposed in [42], we represent rotations continuously using 6D embeddings (i.e., $d=6$ ). Compared to quaternions or axis-angles, these representations are continuous and hence more amenable to be learned by neural networks, as demonstrated in [42].

Pose decoding. In order to deliver pose predictions in $(\mathbb{R}^{3})^{J\times L}$ , the intermediate representations $(s,r)$ need to be decoded, just like $\hat{\theta}$ in Sec. 3. This is achieved in three steps:

First, we scale the unit segments of the reference pose $\mathrm{u}\in(\mathbb{R}^{3})^{J}$ using $s$ , forming a scaled reference pose $\mathrm{u}^{\prime}$ :

\mathrm{u}^{\prime}_{j}=s_{\tau(j)}+s_{j}(\mathrm{u}_{j}-\mathrm{u}_{\tau(j)})% \,,\quad 0<j\leq J-1\,,

(4)

where $\tau$ maps the index of a joint to its parent’s, if any.

2.

Then, for each time step ${1\leq t\leq L}$ and joint $0\leq j<J$ , we convert the predicted rotation representations $r_{t,j}$ into rotation matrices ${R_{t,j}\in\mathrm{SO}(3)}$ (cf. Algorithm 1).
3.

Finally, we apply these rotation matrices $R_{t,j}$ at each time step $t$ to the scaled reference pose $\mathrm{u}^{\prime}$ using the forward kinematics Algorithm 2.

5.2 Multiple Choice Learning

ManiPose architecture. As explained in Sec. 3, given the inherent depth ambiguity of the pose lifting task, the only way to conciliate pose consistency and MPJPE performance is to predict multiple hypotheses. In response, we suggest to leverage the Multiple Choice Learning (MCL) framework [7], drawing upon the resilient multiple choice learning method proposed by Letzelter et al. [17]. This variant of MCL allows us to estimate conditional distributions in regression tasks, enabling our model to predict a variety of plausible 3D poses for each given input. Namely, instead of estimating a single rotation $r_{t}\in(\mathbb{R}^{d})^{J}$ per time step, the rotations module in ManiPose predicts an intermediate representation $e_{t}\in(\mathbb{R}^{d^{\prime}})^{J}$ which is fed to $K$ linear heads with weights $W^{k}_{r}$ and $W^{k}_{\gamma}$ used as follows. Each head $k$ predicts a different rotation hypothesis $r^{k}_{t}\in(\mathbb{R}^{d})^{J}$ together with its corresponding likelihood $\gamma^{k}_{t}\in[0,1]$ , by computing

r^{k}_{t}=W^{k}_{r}e_{t}\,,\quad\tilde{\gamma}^{k}_{t}=W^{k}_{\gamma}e_{t}\,,% \quad 1\leq t\leq L\,,

(5)

followed by

\gamma^{k}_{t}=\sigma[\tilde{\gamma}_{t}]_{k}\,,\quad\text{with\ \ }\tilde{% \gamma}_{t}=[\tilde{\gamma}_{t}^{1},\dots,\tilde{\gamma}_{t}^{K}]\in\mathbb{R}% ^{K},

(6)

where $\sigma$ is the softmax function.

The rotations predicted by each head are then all decoded together with the common predictions of the segments’ length $s$ , just as before (cf. Fig. 4). This produces $K$ different hypothetical pose sequences $\hat{\mathrm{p}}^{k}=(\hat{\mathrm{p}}^{k}_{t})_{t=1}^{L}$ , together with their corresponding likelihood sequences ${\gamma^{k}=(\gamma^{k}_{t})_{t=1}^{L}}$ , called scores hereafter.

Loss function. As in [17], the ManiPose model is trained with a composite loss:

\mathcal{L}=\mathcal{L}_{\text{wta}}+\beta\mathcal{L}_{\text{score}}\,.

(7)

The first term, $\mathcal{L}_{\text{wta}}$ , is the Winner-takes-all loss [16]

\mathcal{L}_{\text{wta}}(\hat{\mathrm{p}}(\mathrm{x}),\mathrm{p})=\frac{1}{L}% \sum_{t=1}^{L}\min_{k\in\llbracket 1,K\rrbracket}\ell(\hat{\mathrm{p}}_{t}^{k}% (\mathrm{x}),\mathrm{p}_{t})

(8)

where $\ell(\hat{\mathrm{p}}^{k}_{t}(\mathrm{x}),\mathrm{p}_{t})\triangleq\frac{1}{J}% \sum_{j=0}^{J-1}\|\mathrm{p}_{t,j}-\hat{\mathrm{p}}^{k}_{t,j}(\mathrm{x})\|_{2}$ , and $\hat{\mathrm{p}}_{t}^{k}(\mathrm{x})$ denotes the pose prediction at time $t$ using the $k^{\text{th}}$ head.

The second term, $\mathcal{L}_{\text{score}}$ , is the scoring loss

\mathcal{L}_{\text{score}}(\hat{\mathrm{p}}(\mathrm{x}),\gamma(\mathrm{x}),% \mathrm{p})=\frac{1}{L}\sum_{t=1}^{L}\mathcal{H}\big{(}\delta(\hat{\mathrm{p}}% _{t},\mathrm{p}_{t}),\gamma_{t}(\mathrm{x})\big{)}\,,

(9)

where $\mathcal{H}(\cdot,\cdot)$ is the cross-entropy, $\hat{\mathrm{p}}_{t}=(\hat{\mathrm{p}}^{k}_{t})_{k=1}^{K}$ , and

[\delta(\hat{\mathrm{p}}_{t},\mathrm{p}_{t})]_{k}\triangleq\mathbf{1}\Big{[}k% \in\operatorname*{arg\,min}_{k^{\prime}\in\llbracket 1,K\rrbracket}\;\ell\left% (\hat{\mathrm{p}}^{k^{\prime}}_{t},\mathrm{p}_{t}\right)\Big{]}\,,

(10)

is the indicator function of the winner pose hypothesis, which is the closest to the ground truth. Eq. (9) corresponds to the average cross-entropy between the target and predicted scores $\gamma_{t}(\mathrm{x})\in[0,1]^{K}$ at each time $t$ .

These two losses complement each other. The Winner-takes-all loss only updates the best predicted hypothesis, specializing each head on a region of the data distribution [16]. Concurrently, the scoring loss allows the model to learn how likely each head wins for a given input, thereby avoiding overconfidence [14, 34] of non-winner heads at inference time. Furthermore, as detailed in [17], the joint prediction of modes $\hat{\mathrm{p}}^{k}$ and likelihoods $\gamma^{k}(\mathrm{x})$ allow the derivation of the predicted conditional expectation of 3D poses given 2D inputs:

\mathbb{E}[\mathrm{p}|\mathrm{x}]\simeq\sum_{k=1}^{K}\gamma^{k}(\mathrm{x})% \hat{\mathrm{p}}^{k}(\mathrm{x})\,.

(11)

6 Experiments

	$L$	$K$	Orac.	MPJPE $\downarrow$	MPSSE $\downarrow$	MPSCE $\downarrow$
Single-hypothesis methods:
ST-GCN [2]	7	1	N/A	48.8	8.9	10.8
VideoPose3D [29]	243	1	N/A	46.8	6.5	7.8
PoseFormer [41]	81	1	N/A	44.3	4.3	7.2
Anatomy3D [4]	243	1	N/A	44.1	1.4	2.0
MixSTE [40]	243	1	N/A	40.9	8.8	9.9
Multi-hypothesis methods:
Sharma et al. [33]	1	10	✓	46.8	13.0	9.9
Wehrbein et al. [37]	1	200	✓	44.3	12.2	14.8
Diffpose [8]*	1	200	✓	43.3	14.9	-
MHFormer [21]	351	3	N/A	43.0	5.7	8.0
ManiPose (Ours)	243	5	✓	41.9	0.3	0.7

Table 2: Pose consistency evaluation of state-of-the-art methods on H3.6M. MPJPE performance and pose consistency are not correlated.

L

: sequence length.

K

: number of hypotheses. Orac.: Metric computed using oracle hypothesis. N/A: non-applicable. Bold: best; Underlined: second best. *: MPSSE values reported in [8]. Missing entries: methods with unavailable code.

	$L$	$K$	Orac.	Dir.	Disc	Eat	Greet	Phone	Photo	Pose	Purch.	Sit	SitD.	Smoke	Wait	WalkD.	Walk	WalkT.	Avg.
Single-hypothesis methods:
GraphSH [39]	1	1	N/A	45.2	49.9	47.5	50.9	54.9	66.1	48.5	46.3	59.7	71.5	51.4	48.6	53.9	39.9	44.1	51.9
MGCN [43]	1	1	N/A	45.4	49.2	45.7	49.4	50.4	58.2	47.9	46.0	57.5	63.0	49.7	46.6	52.2	38.9	40.8	49.4
ST-GCN [2]	7	1	N/A	44.6	47.4	45.6	48.8	50.8	59.0	47.2	43.9	57.9	61.9	49.7	46.6	51.3	37.1	39.4	48.8
VideoPose3D [29]	243	1	N/A	45.2	46.7	43.3	45.6	48.1	55.1	44.6	44.3	57.3	65.8	47.1	44.0	49.0	32.8	33.9	46.8
UGCN [35]	96	1	N/A	41.3	43.9	44.0	42.2	48.0	57.1	42.2	43.2	57.3	61.3	47.0	43.5	47.0	32.6	31.8	45.6
Liu et al. [22]	243	1	N/A	41.8	44.8	41.1	44.9	47.4	54.1	43.4	42.2	56.2	63.6	45.3	43.5	45.3	31.3	32.2	45.1
PoseFormer [41]	81	1	N/A	41.5	44.8	39.8	42.5	46.5	51.6	42.1	42.0	53.3	60.7	45.5	43.3	46.1	31.8	32.2	44.3
Anatomy3D [4]	243	1	N/A	41.4	43.2	40.1	42.9	46.6	51.9	41.7	42.3	53.9	60.2	45.4	41.7	46.0	31.5	32.7	44.1
MixSTE [40]	243	1	N/A	37.6	40.9	37.3	39.7	42.3	49.9	40.1	39.8	51.7	55.0	42.1	39.8	41.0	27.9	27.9	40.9
Multi-hypothesis methods:
Li et al. [19]	1	10	✓	62.0	69.7	64.3	73.6	75.1	84.8	68.7	75.0	81.2	104.3	70.2	72.0	75.0	67.0	69.0	73.9
Li et al. [18]	1	5	✓	43.8	48.6	49.1	49.8	57.6	61.5	45.9	48.3	62.0	73.4	54.8	50.6	56.0	43.4	45.5	52.7
Oikarinen et al. [28]	1	200	✓	40.0	43.2	41.0	43.4	50.0	53.6	40.1	41.4	52.6	67.3	48.1	44.2	44.9	39.5	40.2	46.2
Sharma et al. [33]	1	10	✓	37.8	43.2	43.0	44.3	51.1	57.0	39.7	43.0	56.3	64.0	48.1	45.4	50.4	37.9	39.9	46.8
Wehrbein et al. [37]	1	200	✓	38.5	42.5	39.9	41.7	46.5	51.6	39.9	40.8	49.5	56.8	45.3	46.4	46.8	37.8	40.4	44.3
DiffPose [8]	1	200	✓	38.1	43.1	35.3	43.1	46.6	48.2	39.0	37.6	51.9	59.3	41.7	47.6	45.4	37.4	36.0	43.3
MHFormer [21]	351	3	N/A	39.2	43.1	40.1	40.9	44.9	51.2	40.6	41.3	53.5	60.3	43.7	41.1	43.8	29.8	30.6	43.0
ManiPose (Ours)	243	5	✗	42.6	47.3	39.3	42.5	45.4	53.1	44.3	41.1	53.6	58.9	45.4	43.1	46.3	31.5	33.3	44.5
ManiPose (Ours)	243	5	✓	40.2	44.0	37.0	39.9	42.6	49.6	41.2	38.8	50.0	55.4	43.0	40.4	43.8	30.4	32.0	41.9

Table 3: Quantitative comparison with the state-of-the-art methods on Human3.6M under Protocol #1 (MPJPE in mm), using detected 2D poses.

L

: sequence length.

K

: number of hypotheses. Orac.: Metric computed using oracle hypothesis. N/A: non-applicable. Bold: best; Underlined: second best.

	PCK $\uparrow$	AUC $\uparrow$	MPJPE $\downarrow$	MPSSE $\downarrow$	MPSCE $\downarrow$
VideoPose3D [29]	85.5	51.5	84.8	10.4	27.5
PoseFormer [41]	86.6	56.4	77.1	10.8	14.2
MixSTE [40]	94.4	66.5	54.9	17.3	21.6
P-STMO [32]	97.9	75.8	32.2	8.5	11.3
ManiPose (Ours) Aggr.	98.0	75.3	37.7	0.6	1.3
ManiPose (Ours) Orac.	98.4	77.0	34.6	0.6	1.3

Table 4: Quantitative comparison with the state-of-the-art on MPI-INF-3DHP using ground-truth 2D poses.

	MR	MC	$K$	# Params.	MPJPE $\downarrow$	MPSSE $\downarrow$	MPSCE $\downarrow$
ManiPose (Ours)	✗	✓	5	34.44 M	41.9	0.3	0.7
w/o MH	✗	✓	1	34.42 M	44.6	0.3	0.7
w/o MC, w/ MR	✓	✗	1	33.78 M	42.3	5.7	7.3
w/o MR (MixSTE)	✗	✗	1	33.78 M	40.9	8.8	9.9

Table 5: Ablation study: Single-hypothesis cannot optimize both MPJPE performance and consistency. ManiPose uses the same backbone as MixSTE. MR: using manifold regularization. MC: manifold-constrained. Bold: best. Underlined: second best.

6.1 Datasets

We evaluate our model on two 3D-HPE datasets: Human 3.6M [11] and MPI-INF-3DHP [25].

Human 3.6M contains 3.6 million images of 7 actors performing 15 different indoor actions. It is the most widely used dataset for 3D-HPE. Following previous works [40, 21, 41, 29], we train our models on subjects S1, S5, S6, S7, S8, and test on subjects S9 and S11, adopting a 17-joint skeleton (cf. Fig. 5). A pre-trained CPN [5] network is employed to compute the input 2D keypoints as in [29, 40].

MPI-INF-3DHP also adopts a 17-joint skeleton, but is smaller than Human 3.6M and contains both indoor and outdoor scenes. It is hence more challenging. We used ground-truth 2D keypoints for this dataset, as usually done [41, 4, 40].

6.2 Evaluation metrics

Traditional metrics. Performance is usually evaluated on these datasets using the mean per-joint position error (MPJPE) under different protocols. Under protocol #1, the root joint position is set as reference, and the predicted root position is translated to $0$ . Under protocol #2, also denoted P-MPJPE, predictions are additionally Procrustes-corrected. Both are reported in mm.

For MPI-INF-3DHP, other thresholded metrics derived from MPJPE are also often reported, such as AUC and PCK (with a threshold at 150 mm), as explained in [25].

Pose consistency. In addition to the traditional metrics, we also evaluate the consistency of predicted poses.

Given the results from Sec. 4, we assess to which extent Asm. 4.4 is verified by measuring the average standard deviations of segment lengths across time in predicted action sequences:

	$\displaystyle\text{MPSCE}\triangleq\frac{1}{J-1}\sum_{j=1}^{J-1}\sqrt{\frac{1}% {L}\sum_{t=1}^{L}(s_{t,j,\tau(j)}-\bar{s}_{j,\tau(j)})^{2}}$		(12)
	$\displaystyle s_{t,j,i}=\\|\hat{\mathrm{p}}_{t,j}-\hat{\mathrm{p}}_{t,i}\\|_{2},% \quad\bar{s}_{j,i}=\frac{1}{L}\sum_{t=1}^{L}s_{t,j,i}\,,$		(13)

where $\tau$ was defined in Sec. 5.1. We call this metric the Mean Per Segment Consistency Error (MPSCE), also reported in mm.

Following [8, 30], we also assess the bilateral symmetry of predicted skeletons through the Mean Per Segment Symmetry Error (MPSSE) in mm:

	$\displaystyle\text{MPSSE}\triangleq\frac{1}{L\,\|\mathcal{J}_{\text{left}}\|}% \sum_{t=1}^{L}\sum_{j\in\mathcal{J}_{\text{left}}}\|s_{t,j,\tau(j)}-s_{t,j^{% \prime},\tau(j^{\prime})}\|\,,$		(14)
	$\displaystyle\text{with \ \ }j^{\prime}=\zeta(j)\,,$		(15)

where $\mathcal{J}_{\text{left}}$ denotes the set of indices of left-side joints and $\zeta$ maps left-side joint indices to their right-side counterpart.

Multi-hypothesis setting. One must decide how to use multiple hypotheses (MH) to compute the previous metrics.

The most widely used approach [18, 19, 28, 33, 37, 8] is the oracle evaluation, i.e., using the predicted hypothesis closer to the ground truth (i.e., Eq. (8) in the case of MPJPE). This metric makes sense for MH methods as it measures the distance between the target and the discrete set of predicted hypotheses. It is hence aligned with the existence of many possible outputs for a given input.

Hypotheses can also be aggregated into a final hypothesis, e.g., through unweighted or weighted averaging as in Eq. (11). The latter has the disadvantage of falling back to a one-to-one map** scheme, which is precisely what we want to avoid when working in a MH setting.

We hence report both oracle and aggregated metrics in our experiments, favoring oracle results.

6.3 Implementation details

The ManiPose method presented in Sec. 5 is compatible with any backbone. Here, we chose to build on the MixSTE [40] network for both the rotation and the segments¹¹1In a reduced scale. modules. The details about our architecture and training can be found in the supp. material (Sec. 11).

6.4 Comparison to the state-of-the-art

Human 3.6M. Comparisons with state-of-the-art regression and multi-hypothesis methods are presented in Tab. 2. As a general remark, MPJPE and consistency metrics are not positively correlated. As predicted by Sec. 4, our results show that MPJPE improvements achieved by MixSTE come at the cost of poorer consistency compared to previous models. In contrast, the only single-hypothesis constrained model, Anatomy3D [4], achieves good consistency at the expense of inferior MPJPE. As opposed to all previous methods, ManiPose reaches nearly perfect consistency without compromising the MPJPE performance.

Detailed MPJPE results can be found in Tab. 3. ManiPose reaches the second-best MPJPE performance on average and on most actions. Note that while ManiPose is deterministic, previous multi-hypothesis methods are generative and frame-based, except MHFormer. Tab. 3 shows that they require up to two orders of magnitude more hypotheses than ManiPose to reach competitive performance. Protocol #2 results can be found in the appendix (Sec. 12).

We present qualitative results on Figs. 1 and 6.

MPI-INF-3DHP. Similar results were obtained for this dataset (cf. Tab. 4). Not only does ManiPose reach consistency errors close to $0$ , but also best PCK and AUC performance. As for MPJPE, only [32] achieves slightly better performance, at the cost of large pose consistency errors.

6.5 Ablation study

Impact of components. We evaluate the impact of removing each component of ManiPose on the Human 3.6M performance (Tab. 5). The components tested are the multiple hypothesis (MH) and the manifold constraint (MC). We also compare MC to a more standard manifold regularization (MR), i.e., adding Eq. (12) to the loss. Note that without all these components, we fall back to MixSTE [40].

We see that MR helps to improve pose consistency, but not as much as MC. However, without multiple hypotheses, MC consistency improvements come at the cost of degraded MPJPE performance, as explained in Secs. 3 and 4.

Fine error analysis. We can see in Fig. 5 that, compared to MixSTE, ManiPose reaches substantially superior MPSSE and MPSCE consistency across all skeleton segments. Furthermore, note that larger MixSTE errors occur for segments knee-foot and elbow-wrist, which are the most prone to depth ambiguity. This agrees with coordinate-wise errors depicted in Fig. 5, showing that ManiPose improvements mostly translate into a reduction of MixSTE depth errors, which are twice larger than for other coordinates. Further ablation studies can be found in the supp. material.

7 Conclusion

This work provided empirical and theoretical evidence challenging the common unconstrained single-hypothesis approaches to 3D-HPE. We proved that unconstrained single-hypothesis methods cannot deliver consistent poses and that existing evaluation metrics are insufficient to assess this aspect. We also showed that constraining or regularizing single-hypothesis models is not enough to optimize both joint position error and pose consistency, which are somehow antagonistic. In response, we presented a new manifold-constrained multi-hypothesis human pose lifting method (ManiPose) and demonstrated its empirical superiority to the existing state-of-the-art on two challenging datasets. One limitation of our method is its reliance on the forward kinematics algorithm, which guarantees its manifold-consistency, but requires non-parallelizable iterations across joints. This warrants further investigations.

References

Bishop [1994] Christopher M Bishop. Mixture density networks. 1994.
Cai et al. [2019] Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham, Junsong Yuan, and Nadia Magnenat Thalmann. Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2272–2281, 2019.
Chen and Ramanan [2017] Ching-Hang Chen and Deva Ramanan. 3d human pose estimation= 2d pose estimation+ matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7035–7043, 2017.
Chen et al. [2021] Tianlang Chen, Chen Fang, Xiaohui Shen, Yiheng Zhu, Zhili Chen, and Jiebo Luo. Anatomy-aware 3d human pose estimation with bone-based pose decomposition. IEEE Transactions on Circuits and Systems for Video Technology, 32(1):198–209, 2021.
Chen et al. [2018] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7103–7112, 2018.
Firman et al. [2018] Michael Firman, Neill DF Campbell, Lourdes Agapito, and Gabriel J Brostow. Diversenet: When one right answer is not enough. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5598–5607, 2018.
Guzman-Rivera et al. [2012] Abner Guzman-Rivera, Dhruv Batra, and Pushmeet Kohli. Multiple choice learning: Learning to produce multiple structured outputs. Advances in neural information processing systems, 25, 2012.
Holmquist and Wandt [2023] Karl Holmquist and Bastian Wandt. Diffpose: Multi-hypothesis human pose estimation using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15977–15987, 2023.
Hossain and Little [2018] Mir Rayat Imtiaz Hossain and James J Little. Exploiting temporal information for 3d human pose estimation. In Proceedings of the European conference on computer vision (ECCV), pages 68–84, 2018.
Hu et al. [2021] Wenbo Hu, Changgong Zhang, Fangneng Zhan, Lei Zhang, and Tien-Tsin Wong. Conditional directed graph convolution for 3d human pose estimation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 602–611, 2021.
Ionescu et al. [2014] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 2014.
Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Kolotouros et al. [2021] Nikos Kolotouros, Georgios Pavlakos, Dinesh Jayaraman, and Kostas Daniilidis. Probabilistic modeling for human mesh recovery. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11605–11614, 2021.
Lee et al. [2017] Kimin Lee, Changho Hwang, KyoungSoo Park, and **woo Shin. Confident multiple choice learning. In International Conference on Machine Learning, pages 2014–2023. PMLR, 2017.
Lee et al. [2015] Stefan Lee, Senthil Purushwalkam, Michael Cogswell, David Crandall, and Dhruv Batra. Why m heads are better than one: Training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314, 2015.
Lee et al. [2016] Stefan Lee, Senthil Purushwalkam Shiva Prakash, Michael Cogswell, Viresh Ranjan, David Crandall, and Dhruv Batra. Stochastic multiple choice learning for training diverse deep ensembles. Advances in Neural Information Processing Systems, 29, 2016.
Letzelter et al. [2023] Victor Letzelter, Mathieu Fontaine, Mickaël Chen, Patrick Pérez, Slim Essid, and Gael Richard. Resilient multiple choice learning: A learned scoring scheme with application to audio scene analysis. In Advances in Neural Information Processing Systems, 2023.
Li and Lee [2019] Chen Li and Gim Hee Lee. Generating multiple hypotheses for 3d human pose estimation with mixture density network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9887–9895, 2019.
Li and Lee [2020] Chen Li and Gim Hee Lee. Weakly supervised generative network for multiple 3d human pose hypotheses. In British Machine Vision Conference (BMVC), 2020.
Li et al. [2021] Jiefeng Li, Chao Xu, Zhicun Chen, Siyuan Bian, Lixin Yang, and Cewu Lu. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3383–3393, 2021.
Li et al. [2022] Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, and Luc Van Gool. Mhformer: Multi-hypothesis transformer for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13147–13156, 2022.
Liu et al. [2020] Ruixu Liu, Ju Shen, He Wang, Chen Chen, Sen-ching Cheung, and Vijayan Asari. Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5064–5073, 2020.
Makansi et al. [2019] Osama Makansi, Eddy Ilg, Ozgun Cicek, and Thomas Brox. Overcoming limitations of mixture density networks: A sampling and fitting framework for multimodal future prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7144–7153, 2019.
Martinez et al. [2017] Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE international conference on computer vision, pages 2640–2649, 2017.
Mehta et al. [2017] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 international conference on 3D vision (3DV), pages 506–516. IEEE, 2017.
Mun et al. [2018] Jonghwan Mun, Kimin Lee, **woo Shin, and Bohyung Han. Learning to specialize with knowledge distillation for visual question answering. Advances in neural information processing systems, 31, 2018.
Murray et al. [2017] Richard M Murray, Zexiang Li, and S Shankar Sastry. A mathematical introduction to robotic manipulation. CRC press, 2017.
Oikarinen et al. [2021] Tuomas Oikarinen, Daniel Hannah, and Sohrob Kazerounian. Graphmdn: Leveraging graph structure and deep learning to solve inverse problems. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–9. IEEE, 2021.
Pavllo et al. [2019] Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7753–7762, 2019.
Rommel et al. [2023] Cédric Rommel, Eduardo Valle, Mickaël Chen, Souhaiel Khalfaoui, Renaud Marlet, Matthieu Cord, and Patrick Pérez. DiffHPE: Robust, Coherent 3D Human Pose Lifting with Diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3220–3229, 2023.
Rupprecht et al. [2017] Christian Rupprecht, Iro Laina, Robert DiPietro, Maximilian Baust, Federico Tombari, Nassir Navab, and Gregory D Hager. Learning in an uncertain world: Representing ambiguity through multiple hypotheses. In Proceedings of the IEEE international conference on computer vision, pages 3591–3600, 2017.
Shan et al. [2022] Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao. P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation, 2022.
Sharma et al. [2019] Saurabh Sharma, Pavan Teja Varigonda, Prashast Bindal, Abhishek Sharma, and Arjun Jain. Monocular 3d human pose estimation by generation and ordinal ranking. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2325–2334, 2019.
Tian et al. [2019] Kai Tian, Yi Xu, Shuigeng Zhou, and Jihong Guan. Versatile multiple choice learning and its application to vision computing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6349–6357, 2019.
Wang et al. [2020] **gbo Wang, Sijie Yan, Yuanjun Xiong, and Dahua Lin. Motion guided 3d pose estimation from videos. In European Conference on Computer Vision, pages 764–780. Springer, 2020.
Waskom [2021] Michael L. Waskom. seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021, 2021.
Wehrbein et al. [2021] Tom Wehrbein, Marco Rudolph, Bodo Rosenhahn, and Bastian Wandt. Probabilistic monocular 3d human pose estimation with normalizing flows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11199–11208, 2021.
Xu et al. [2020] **gwei Xu, Zhenbo Yu, Bingbing Ni, Jiancheng Yang, Xiaokang Yang, and Wenjun Zhang. Deep kinematics analysis for monocular 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on computer vision and Pattern recognition, pages 899–908, 2020.
Xu and Takano [2021] Tianhan Xu and Wataru Takano. Graph stacked hourglass networks for 3d human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16105–16114, 2021.
Zhang et al. [2022] **lu Zhang, Zhigang Tu, Jianyu Yang, Yu** Chen, and Junsong Yuan. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13232–13242, 2022.
Zheng et al. [2021] Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang, Chen Chen, and Zhengming Ding. 3D Human Pose Estimation with Spatial and Temporal Transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11636–11645, Montreal, QC, Canada, 2021. IEEE.
Zhou et al. [2019] Yi Zhou, Connelly Barnes, **gwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019.
Zou and Tang [2021] Zhiming Zou and Wei Tang. Modulated graph convolutional network for 3d human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11477–11487, 2021.

\thetitle

Supplementary Material

This appendix is organized as follows:

•

Sec. 8 contains empirical verification of our assumptions,
•

Sec. 9 presents the proofs of our theoretical results, together with a few corollaries,
•

Sec. 10 provides further implementation details concerning the 1D-to-2D experiment, as well as an extension to the 2D-to-3D setting,
•

Sec. 11 contains implementation and training details concerning ManiPose, as well as compared baselines,
•

Sec. 12 presents further results of the Human 3.6M experiment,
•

and finally, Sec. 13 explains the provided experiment code.

8 Assumption verifications

We verified on Human 3.6M [11] ground-truth data that assumptions 4.4 and 4.5 hold for real poses in the training and test sets.

Segments rigidity. As shown on Figs. 5 and 10, ground-truth 3D poses have perfect MPSSE (14) and MPSCE (12) metrics, meaning that ground-truth skeletons are perfectly symmetric, with rigid segments. Assumption 4.4 is hence verified in real training and test data.

Non-degenerate distributions. As shown on Fig. 7, the conditional distribution of ground-truth 3D poses given 2D keypoints position is clearly multimodal, and hence non-degenerate (not reduced to a single Dirac distribution). This validates assumption 4.5 and explains why multi-hypothesis techniques are necessary.

9 Proofs and additional corollaries

This section contains the proofs of the theoretical results presented in Sec. 4, together with a few corollaries.

proof. [Proposition 4.6] Let $i$ be a joint connected to the root $p_{0}$ (i.e., $A_{i0}=1$ ). From assumptions 4.3 and 4.4, we know that at any instant $t$ , $\mathrm{p}^{G}_{t,i}$ lies on the sphere $S^{2}(0,s_{i,0})$ centered at $0$ with radius $s_{i,0}$ independent of time. Its position can hence be fully parametrized in spherical coordinates by two angles $(\theta_{t,i},\phi_{t,i})$ . Let $j$ be a joint connected to $i$ . Like before, assumption 4.4 implies that at any instant $t$ , $\mathrm{p}^{G}_{t,j}$ lies on the moving sphere $S^{2}(p^{G}_{t,i},s_{j,i})$ centered at $p^{G}_{t,i}$ with radius $s_{j,i}$ independent of time. Hence, we can fully describe $\mathrm{p}^{G}_{t,j}$ with the position of its center, $\mathrm{p}^{G}_{t,i}$ and the spherical coordinates $(\theta_{t,j},\phi_{t,j})$ of joint $j$ relative to the center of the sphere, i.e., joint $i$ . This means that there is a bijection between the possible positions attainable by $\mathrm{p}^{G}_{t,j}$ at any instant and the direct product of spheres $S^{2}(0,s_{i,0})\,\times\,S^{2}(0,s_{j,i})$ ²²2 $S^{2}(0,s_{j,i})$ is homeomorphic to $S^{2}(\mathrm{p}^{G}_{t,j};s_{j,i})$ . This bijection is an homeomorphism as a composition of homeomorphisms: we can compute $\mathrm{p}^{G}_{t,j}$ from $(\theta_{t,i},\phi_{t,i},\theta_{t,j},\phi_{t,j})$ following the forward kinematics algorithm [27] (cf. algo. 2), i.e., using a composition of rotations and translations.

Now let’s assume for some arbitrary joint $k$ that $\mathrm{p}^{G}_{t,k}$ lies at all times on a space $\mathcal{M}_{2d}$ homeomorphic to a product of spheres of dimension $2d$ . This means that $\mathrm{p}^{G}_{t,k}$ can be fully parametrized using $2d$ spherical angles $(\theta_{1},\phi_{1},\dots,\theta_{d},\phi_{d})$ . Let $l$ be a joint connected to $k$ (typically one further step away from the root joint $\mathrm{p}_{0}$ and not already represented in $\mathcal{M}_{2d}$ ). As before, at any instant $t$ , $\mathrm{p}^{G}_{t,l}$ needs to lie on the sphere centered on $\mathrm{p}^{G}_{t,k}$ of constant radius $s_{k,l}$ . Hence, we can fully describe $\mathrm{p}^{G}_{t,l}$ using the ${2(d+1)}$ -tuple of angles obtained by concatenating its spherical coordinates relative to joint $k$ , together with the $2d$ -tuple describing $\mathrm{p}^{G}_{t,k}$ , i.e. the center of the sphere. So $\mathrm{p}^{G}_{t,l}$ lies on a space $\mathcal{M}_{2(d+1)}$ homeomorphic to a product of spheres of dimension $2(d+1)$ .

We can hence conclude by induction that at any instant $t$ , $\mathrm{p}_{t}=[\mathrm{p}^{G}_{t,1},\dots,\mathrm{p}^{G}_{t,J}]$ lies on the same subspace of $(\mathbb{R}^{3})^{J}$ , which is homeomorphic to a product of spheres centered at the origin:

\bigotimes_{i<j/A_{ij}=1}S^{2}(0,s_{i,j})\,.

(16)

Finally, the previous space is trivially homeomorphic to $(S^{2})^{J-1}$ through the scaling $(1/s_{i,j})_{i<j/A_{ij}=1}$ . $(S^{2})^{J-1}$ is a manifold of dimension $2(J-1)$ as the direct product of $J-1$ manifolds of dimension $2$ . $\blacksquare$

proof. [Proposition 4.8] Let $G$ be a skeleton with $J$ joints. We denote $\mathrm{x}\in(\mathbb{R}^{2})^{J}$ a 2D pose and $\mathrm{p}\in(\mathbb{R}^{3})^{J}$ its corresponding 3D pose. Let $\mathrm{P}(\mathrm{x},\mathrm{p})$ be a joint distribution of poses in 2D and 3D. We define $\ell=(\ell_{j})_{j=1}^{J-1}$ the function allowing us to compute the length of the segments of a pose $\mathrm{p}$ :

\ell_{j}:\mathrm{p}\mapsto\|\mathrm{p}_{j}-\mathrm{p}_{\tau(j)}\|_{2}\,,\quad 0% <j\leq J-1\,,

(17)

where $\tau:\{1,\dots,J-1\}\to\{0,\dots,J-1\}$ maps joint indices to the index of their parent joint:

\tau(i)=j<i,\quad\text{s.t.}~{}A_{ij}=1\,.

(18)

From assumption 4.4, we know that for any pose $\mathrm{p}$ from the training distribution,

\forall j\,,\quad\ell_{j}(\mathrm{p})=s_{j,\tau(j)}\,.

(19)

Given ${D=\{(\mathrm{x}_{i},\mathrm{p}_{i})\}_{i=1}^{N}\sim\mathrm{P}(\mathrm{x},% \mathrm{p})}$ , some i.i.d. evaluation data, the MSE of a model $f$ is defined as:

\text{MSE}(f;N)=\frac{1}{N}\sum_{i=1}^{N}\|\mathrm{p}_{i}-f(\mathrm{x}_{i})\|^% {2}_{2}\,,

(20)

and converges to

\text{MSE}^{*}(f)=\mathbb{E}_{\mathrm{x},\mathrm{p}}\big{[}\|\mathrm{p}-f(% \mathrm{x})\|^{2}_{2}\big{]}

(21)

as the dataset size $N$ goes to infinity. We then define the oracle MSE minimizer as

f^{*}=\arg\min_{f}\text{MSE}^{*}(f)\,.

(22)

The quantity in (21) is known in statistics as the expected $L_{2}$ -risk and it is a well-known fact that its minimizer is the conditional expectation:

f^{*}(\mathrm{x})=\mathbb{E}[\mathrm{p}|\mathrm{x}=\mathrm{x}]\,.

(23)

Hence, since $\ell_{j}^{2}$ are strictly convex and $\mathrm{P}(\mathrm{p}|\mathrm{x})$ is non-degenerate according to assumption 4.5, we can conclude from Jensen’s strict inequality that for all $j$ ,

\ell^{2}_{j}(f^{*}(\mathrm{x}))=\ell^{2}_{j}(\mathbb{E}[\mathrm{p}|\mathrm{x}=% \mathrm{x}])<\mathbb{E}[\ell^{2}_{j}(\mathrm{p})|\mathrm{x}=\mathrm{x}]=s^{2}_% {j\tau(j)}\,,

(24)

where the last equality arises from the fact that $\ell^{2}_{j}(\mathrm{p})$ is not random according to (19). Hence, given that $\ell_{j}>0$ and $s_{j,\tau(j)}>0$ , we can say that ${\ell_{j}(f^{*}(\mathrm{x}))}<s_{j,\tau(j)}$ for all joints $j$ . We conclude that the model $f^{*}$ minimizing $\text{MSE}^{*}$ predicts poses that violate assumption 4.4 and are hence inconsistent. $\blacksquare$

As an immediate corollary of proposition 4.8, we may state the following result, which was empirically illustrated in many parts of our paper:

Corollary 9.1.

Given a fixed training distribution $\mathrm{P}(\mathrm{x},\mathrm{p})$ respecting assumptions 4.3-4.5, for all 3D-HPE model $f$ predicting consistent poses, i.e., that respect assumption 4.4, there is an inconsistent model $f^{\prime}$ with lower mean-squared error.

proof. Let ${f^{\prime}\in\operatorname*{arg\,min}_{\tilde{f}}\text{MSE}^{*}(\tilde{f})}$ . According to proposition 4.8, $f^{\prime}$ is inconsistent. Suppose that the consistent model $f$ is such that

\text{MSE}^{*}(f)\leq\text{MSE}^{*}(f^{\prime})\,.

(25)

Since $\text{MSE}^{*}$ reaches its minimum at $f^{\prime}$ , we have $\text{MSE}^{*}(f)=\text{MSE}^{*}(f^{\prime})$ . Hence, $f\in\operatorname*{arg\,min}_{\tilde{f}}\text{MSE}^{*}(\tilde{f})$ , which means that $f$ is also inconsistent according to proposition 4.8. This is impossible given that we assumed $f$ to be consistent. Thus, Eq. 25 is wrong and we conclude that

\text{MSE}^{*}(f)>\text{MSE}^{*}(f^{\prime})\,.

(26)

$\blacksquare$

Note that propositions 4.8 and 9.1 assume the use of the MSE loss, which is the most widely used loss in 3D-HPE. We can however extend them to the case where MPJPE serves as optimization criteria under an additional technical assumption:

Corollary 9.2.

The predicted poses minimizing the mean-per-joint-position-error loss are inconsistent if the training poses distribution $\mathrm{P}(\mathrm{x},\mathrm{p})$ verifies Asm. 4.3-4.5 and if the joint-wise residuals’ norm standard deviation is small compared to the joint-wise loss:

0\leq j<J\,,\quad\frac{\sqrt{\mathbb{V}_{\mathrm{x},\mathrm{p}}\big{[}\|% \mathrm{p}_{j}-f_{j}(\mathrm{x})\|_{2}\big{]}}}{\mathbb{E}_{x,\mathrm{p}}\big{% [}\|\mathrm{p}_{j}-f_{j}(\mathrm{x})\|_{2}\big{]}}\simeq 0\,.

(27)

proof. From proposition 4.8 we know that the poses predicted by the minimizer $f^{*}$ of

\text{MSE}^{*}(f)=\mathbb{E}_{\mathrm{x},\mathrm{p}}\big{[}\|\mathrm{p}-f(% \mathrm{x})\|^{2}_{2}\big{]}

(28)

are inconsistent. Let $f_{j}$ be the component of $f$ corresponding to the $j^{\text{th}}$ joint. We define the $j^{\text{th}}$ mean-per-joint-position-error component as:

\displaystyle\text{MPJPE}_{j}^{*}(f)\triangleq\mathbb{E}_{\mathrm{x},\mathrm{p% }}\big{[}\|\mathrm{p}_{j}-f_{j}(\mathrm{x})\|_{2}\big{]}\,.

(29)

Under the small variance assumption, we have:

	$\displaystyle\frac{\mathbb{V}_{\mathrm{x},\mathrm{p}}\big{[}\\|\mathrm{p}_{j}-f% _{j}(\mathrm{x})\\|_{2}\big{]}}{\mathbb{E}_{x,\mathrm{p}}\big{[}\\|\mathrm{p}_{j% }-f_{j}(\mathrm{x})\\|_{2}\big{]}^{2}}$		(30)
	$\displaystyle\quad=\frac{\mathbb{E}_{\mathrm{x},\mathrm{p}}\big{[}\\|\mathrm{p}% -f(\mathrm{x})\\|^{2}_{2}\big{]}-\mathbb{E}_{\mathrm{x},\mathrm{p}}\big{[}\\|% \mathrm{p}_{j}-f_{j}(\mathrm{x})\\|_{2}\big{]}^{2}}{\mathbb{E}_{x,\mathrm{p}}% \big{[}\\|\mathrm{p}_{j}-f_{j}(\mathrm{x})\\|_{2}\big{]}^{2}}$		(31)
	$\displaystyle\quad=\frac{\text{MSE}_{j}^{}(f)-\text{MPJPE}_{j}^{}(f)^{2}}{% \text{MPJPE}_{j}^{*}(f)^{2}}\simeq 0\,,$		(32)

so both criteria, MSE and MPJPE, are asymptotically equivalent and have the same minimizer $f^{*}$ , which is inconsistent according to proposition 4.8. $\blacksquare$

10 Further details of 1D-to-2D case study

10.1 Implementation details

Datasets. We created a dataset of input-output pairs $\{(x_{i},(x_{i},y_{i}))\}_{i=1}^{N}$ , divided into $1,000$ training examples, $1,000$ validation examples and $1,000$ test examples. Since the 2D position of $J_{1}$ is fully determined by the angle $\theta$ between the segment $(J_{0},J_{1})$ and the $x$ -axis, the dataset is generated by first sampling $\theta$ from a von Mises mixture distribution, then converting it into Cartesian coordinates $(x_{i},y_{i})$ to form the outputs, and finally projecting them into the $x$ -axis to obtain the inputs.

Distribution scenarios. We considered three different distribution scenarios with different levels of difficulty:

1.

Easy scenario: a unimodal distribution centered at ${\theta=\frac{2\pi}{5}}$ , where the axis of maximum 2D variance is approximately parallel to the $x$ -axis (Figure 3 A).
2.

Difficult unimodal scenario: a unimodal distribution centered at $\theta=0$ , where the axis of maximum 2D variance is perpendicular to the $x$ -axis (Figure 3 B).
3.

Difficult multimodal scenario: a bimodal distribution, with modes at $\theta_{1}=\frac{\pi}{3}$ and $\theta_{2}=-\frac{\pi}{3}$ and mixture weights $w_{1}=\frac{2}{3}$ and $w_{2}=\frac{1}{3}$ , i.e., hence where the projection of modes onto the $x$ -axis are close to each other (Figure 3 C).

All von Mises components in all scenarios had concentrations equal to $20$ .

Architectures and training. All three models were based on a multi-layer perceptron (MLP) with 2 hidden layers of $32$ neurons each, using tanh activation.

The constrained and unconstrained MLPs were trained using the mean-squared loss $\frac{1}{N}\sum_{i=1}^{N}((\hat{x}_{i}-x_{i})^{2}+(\hat{y}_{i}-y_{i})^{2})$ . ManiPose was trained with loss (7), and had $K=2$ heads. We trained all models with batches of $100$ examples for a maximum of $50$ epochs. We used the Adam optimizer [12], with default hyperparameters and no weight decay. Learning rates were searched for each model and distribution independently over a small grid: $[10^{-5},10^{-4},10^{-3},10^{-2}]$ (cf. selected values in Tab. 6). They were scheduled during training using a plateau strategy of factor $0.5$ , patience of $10$ epochs and threshold of $10^{-4}$ .

Distribution	A	B	C
Unconstr. MLP	$10^{-3}$	$10^{-3}$	$10^{-2}$
Constrained MLP	$10^{-2}$	$10^{-4}$	$10^{-2}$
ManiPose	$10^{-2}$	$10^{-3}$	$10^{-2}$

Table 6: Selected learning rates for 1D-to-2D synthetic experiment.

10.2 Extension to 2D-to-3D setup with more joints

In this section, we further extend the 2-joint 1D-to-2D lifting experiment from Sec. 3 to the case of 2D-to-3D lifting with three joints. The idea is to get one step closer to the 3D-HPE scenario.

As in Sec. 3, we suppose that joint $J_{0}$ is at the origin at all times, that $J_{1}$ is connected to $J_{0}$ through a rigid segment of length $s_{1}$ and that $J_{2}$ is connected to $J_{1}$ through a second rigid segment of length $s_{1}<s_{0}$ . We further assume that both $J_{1}$ and $J_{2}$ are allowed to rotate around two axis orthogonal to each other. Hence, $J_{1}$ is constrained to lie on a circle $S^{1}(0,s_{0})$ , while $J_{2}$ lies on a torus $T$ homeomorphic to $S^{1}(0,s_{0})\times S^{1}(0,s_{1})$ . Without loss of generality, we set the radii $s_{0}=2$ and $s_{1}=1$ and assume them to be known.

Given this setup, we are interested in learning to predict the 3D pose ${(J_{1},J_{2})=(x_{1},y_{1},z_{1},x_{2},y_{2},z_{2})\in\mathbb{R}^{6}}$ , given its 2D projection ${(K_{1},K_{2})=(x_{1},z_{1},x_{2},z_{2})\in\mathbb{R}^{4}}$ . We create a dataset comprising $20,000$ training, $2,000$ validation, and $2,000$ test examples, sampled using an arbitrary von Mises mixture of poloidal and toroidal angles $(\theta,\phi)$ in $T$ . We set the modes of such a mixture at $[(-\pi,0),(0,\pi/4),(\frac{1}{2},-\pi/4),(2\pi/3,\pi/2)]$ , with concentrations of $[2,4,3,10]$ and weights $[0.3,0.4,0.2,0.1]$ . Similarly to Fig. 3 C, this is a difficult multimodal distribution, which is depicted in Fig. 8.

We train and evaluate the same baselines as in Sec. 3 in this new scenario, using a similar setup (cf. Sec. 10.1, Architectures and training). The corresponding Mean Per Segment Consistency Error (MPSCE) and Mean Per Joint Position Error (MPJPE) results are reported in Tab. 7.

	MPJPE $\downarrow$	MPSCE $\downarrow$
Unconst. MLP	1.1468	0.2539
Constrained MLP	1.1593	0.0000
ManiPose	1.1337	0.0000

Table 7: Mean per joint prediction error (MPJPE) and mean per segment consistency error (MPSCE) in a 2D-to-3D scenario. ManiPose reaches perfect MPSCE consistency without degrading MPJPE performance.

We see that the same observations as in Sec. 3 also apply here: although the unconstrained MLP yields competitive MPJPE results, its predictions are not consistently aligned with the manifold, as indicated by its poor MPSCE performance. Again, we show here that ManiPose offers an effective balance between maintaining manifold consistency and achieving high joint-position-error performance.

11 Further ManiPose implementation details

11.1 Architectural details

Our architecture is backbone-agnostic, as shown on Fig. 4. Hence, in order to have a fair comparison, we decided to implement it using the most powerful architecture available, i.e., MixSTE [40].

In practice, the rotations module follows the MixSTE architecture with $d_{l}=8$ spatio-temporal transformer blocks of dimension $d_{m}=512$ and time receptive field of $L=243$ frames for Human 3.6M experiments and $L=43$ frames for MPI-INF-3DHP experiments. Contrary to MixSTE, this network outputs rotation embeddings of dimension $6$ for each joint and frame, instead of Cartesian coordinates of dimension $3$ .

Concerning the segment module, it was also implemented with a smaller MixSTE backbone of depth $d_{l}=2$ and dimension $d_{m}=128$ .

The ablation study presented in Tab. 5 shows that the increase in the number of parameters between MixSTE and ManiPose is negligible.

11.2 Pose decoding details

The pose decoding block from Fig. 4 is described in Sec. 5.1 and is based on Algorithms 1 and 2. The whole procedure is illustrated on Fig. 9.

Joint	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
Weight	1	1	2.5	2.5	1	2.5	2.5	1	1	1	1.5	1.5	4	4	1.5	4	4

Table 8: Joint-wise weights used in the Winner-takes-all loss Eq. 8 (as in [40]).

Algorithm 1 6D rotation representation conversion [42]

1:Predicted 6D rotation representation

r\in\mathbb{R}^{6}

x^{\prime}\leftarrow[r_{0},r_{1},r_{2}]\,,

y^{\prime}\leftarrow[r_{3},r_{4},r_{5}]\,,

x\leftarrow x^{\prime}/\|x^{\prime}\|_{2}\,,

z^{\prime}\leftarrow x\wedge y^{\prime}\,,

z\leftarrow z^{\prime}/\|z^{\prime}\|_{2}\,,

y\leftarrow z\wedge x\,,

8:return

R=[x|y|z]\in\mathbb{R}^{3\times 3}\,.

Algorithm 2 Forward Kinematics [27, 20]

1:Scaled reference pose

u^{\prime}\in(\mathbb{R}^{3})^{J}

, predicted rotation matrices

R_{t,j}

0\leq j<J

R^{\prime}_{t,0}\leftarrow R_{t,0}

\mathrm{p}_{t,0}\leftarrow\mathrm{u}^{\prime}_{0}

4:for

j=1,\dots,J-1

R^{\prime}_{t,j}\leftarrow R_{t,j}R^{\prime}_{t,\tau(j)}

\triangleright

Compose relative rotations

\mathrm{p}_{t,j}\leftarrow R^{\prime}_{t,j}(\mathrm{u}^{\prime}_{j}-\mathrm{u}% ^{\prime}_{\tau(j)})+\mathrm{p}_{t,\tau(j)}

7:end for

8:return

\mathrm{p}_{t}=[\mathrm{p}_{t,j}]_{0\leq j<J}

11.3 Training details

Training tactics.

In order to have a fair comparison with MixSTE [40], we trained ManiPose using the same training tactics, such as pose flip augmentation both at training and test time. Moreover, the training loss (7) was complemented with two additional terms described in [40]:

1.

a TCloss term, initially introduced in [9];
2.

a velocity loss term, introduced in [29].

We also weighted the Winner-takes-all MPJPE loss (8) as in [40] (cf. weights in Tab. 8). The score loss weight, $\beta$ , was set to $0.1$ according to our hyperparameter study (Sec. 12), while TCloss and velocity loss terms had respective weights of $0.5$ and $2$ (values from [40]).

Training settings. We trained our model for a maximum of $200$ epochs with the Adam optimizer [12], using default hyperparameters, a weight decay of $10^{-6}$ and an initial learning rate of $4\times 10^{-5}$ . The latter was reduced with a plateau scheduler of factor $0.5$ , patience of $11$ epochs and threshold of $0.1$ mm. Batches contained $3$ sequences of $L=243$ frames each for the Human 3.6M training, and $30$ sequences of $L=43$ frames for MPI-INF-3DHP.

11.4 Baselines evaluation.

All Human 3.6M evaluations of MPSSE and MPSCE listed in Tabs. 2 and 5 were performed using the official checkpoints of these methods and their corresponding official evaluation scripts. Concerning MPI-INF-3DHP evaluations from Tab. 4, checkpoints were not available (except for P-STMO). Baseline models were hence retrained from scratch using the official MPI-INF-3DHP training scripts provided by the authors of each work, using hyperparameters reported in their corresponding papers. We checked that we were able to reproduce the reported MPJPE results.

12 Further results on H36M dataset

	$L$	$K$	Dir.	Disc	Eat	Greet	Phone	Photo	Pose	Purch.	Sit	SitD.	Smoke	Wait	WalkD.	Walk	WalkT.	Avg.
MGCN [43]	1	1	35.7	38.6	36.3	40.5	39.2	44.5	37.0	35.4	46.4	51.2	40.5	35.6	41.7	30.7	33.9	39.1
ST-GCN [2]	1	1	35.7	37.8	36.9	40.7	39.6	45.2	37.4	34.5	46.9	50.1	40.5	36.1	41.0	29.6	33.2	39.0
Pavllo et al. [29]	243	1	34.2	36.8	33.9	37.5	37.1	43.2	34.4	33.5	45.3	52.7	37.7	34.1	38.0	25.8	27.7	36.8
Zheng et al. [41]	81	1	34.1	36.1	34.4	37.2	36.4	42.2	34.4	33.6	45.0	52.5	37.4	33.8	37.8	25.6	27.3	36.5
Liu et al. [22]	243	1	32.3	35.2	33.3	35.8	35.9	41.5	33.2	32.7	44.6	50.9	37.0	32.4	37.0	25.2	27.2	35.6
Anatomy3D [4]	243	1	32.6	35.1	32.8	35.4	36.3	40.4	32.4	32.3	42.7	49.0	36.8	32.4	36.0	24.9	26.5	35.0
UGCN [35]	96	1	31.8	34.3	35.4	33.5	35.4	41.7	31.1	31.6	44.4	49.0	36.4	32.2	35.0	24.9	23.0	34.5
MixSTE [40]	243	1	30.8	33.1	30.3	31.8	33.1	39.1	31.1	30.5	42.5	44.5	34.0	30.8	32.7	22.1	22.9	32.6
ManiPose (Ours)	243	5	31.9	35.7	30.8	33.5	34.0	39.8	33.0	31.4	41.1	45.9	36.0	32.3	35.4	24.7	25.8	34.1

Table 9: Quantitative comparison with the state-of-the-art methods on Human3.6M under Protocol #2 (P-MPJPE in mm), using detected 2D poses.

L

: sequence length.

K

: number of hypotheses. Bold: best; Underlined: second best. ManiPose results using the oracle evaluation.

Protocol #2 results. A detailed quantitative comparison in terms of P-MPJPE between ManiPose and regression-based state-of-the-art methods is shown in Tab. 9. We observe the same patterns as in Tab. 3, namely that ManiPose reaches the second-best P-MPJPE performance on average and for most actions. We confirm here again that the substantial improvements in pose consistency brought by ManiPose are not obtained at the expense of traditional metrics derived from MPJPE.

Errors per joint. On the top of Fig. 10 we see that most of MixSTE errors come from feet, elbows and wrist joints, which are most prone to depth ambiguity. ManiPose helps to reduce the position errors for most of these ambiguous joints, probably as a byproduct of its major consistency improvements shown in Fig. 5.

Impact of hyperparameters. ManiPose introduces two additional hyperparameters when compared to MixSTE: the number $K$ of hypotheses and the score loss weight $\beta$ (cf. Eq. 7). We hence assess the impact of their respective values on MPJPE. For computational cost reasons, we used a smaller version of our model for this study, with transformer blocks of dimension $d_{m}=64$ and time receptive field of $L=27$ frames.

Fig. 11 (left) shows that more hypotheses help, but that the performance improvements saturate around 5 hypotheses. Concerning $\beta$ , Fig. 11 (right) shows that lower values help to improve the MPJPE performance.

13 Code

We provide the code to reproduce the 1D-to-2D experiment from Sec. 3 attached to this appendix. The full code of this work will be made openly available upon publication.