EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans

Nicola Garau{\dagger} University of Trento Epic Games Giulia Martinelli University of Trento Niccolò Bisagno University of Trento Denis Tomè Epic Games Carsten Stoll Epic Games
Abstract

Monocular Human Pose Estimation (HPE) aims at determining the 3D positions of human joints from a single 2D image captured by a camera. However, a single 2D point in the image may correspond to multiple points in 3D space. Typically, the uniqueness of the 2D-3D relationship is approximated using an orthographic or weak-perspective camera model. In this study, instead of relying on approximations, we advocate for utilizing the full perspective camera model. This involves estimating camera parameters and establishing a precise, unambiguous 2D-3D relationship.

To do so, we introduce the EPOCH framework, comprising two main components: the pose lifter network (LiftNet) and the pose regressor network (RegNet). LiftNet utilizes the full perspective camera model to precisely estimate the 3D pose in an unsupervised manner. It takes a 2D pose and camera parameters as inputs and produces the corresponding 3D pose estimation. These inputs are obtained from RegNet, which starts from a single image and provides estimates for the 2D pose and camera parameters. RegNet utilizes only 2D pose data as weak supervision. Internally, RegNet predicts a 3D pose, which is then projected to 2D using the estimated camera parameters. This process enables RegNet to establish the unambiguous 2D-3D relationship.

Our experiments show that modeling the lifting as an unsupervised task with a camera in-the-loop results in better generalization to unseen data. We obtain state-of-the-art results for the 3D HPE on the Human3.6M and MPI-INF-3DHP datasets. Our code is available at: [Github link upon acceptance, see supplementary materials].

22footnotetext: Work primarily done during an internship at Epic Games.
Refer to caption
Figure 1: (a) In human pose estimation, classical approaches perform a direct regression of the 2D/3D joint location directly from an image. If the ground truth is available, the camera parameters can be used/learned to refine the accuracy. (b) Lifting approaches aim at retrieving the depth of each 2D joint to obtain the 3D pose. (c) We propose a novel paradigm, directly estimating the 3D pose and the camera from images. The 2D pose can be calculated by applying the projection of the 3D coordinates to the image space using the camera parameters. (d) Starting from the estimated 2D poses and camera parameters, we perform the lifting to 3D, improving the performances with respect to current approaches.

1 Introduction

There are two main approaches to monocular 3D human pose estimation (HPE) from a single RGB images [18]. One class of algorithms uses a single-stage approach, where the aim is to regress the 3D position of human joints directly from an image [38, 39, 45]. The other class of approaches use two distinct stages, where the first step is to infer 2D poses from monocular RGB images, which is followed by a lifter network that predicts the 3D displacement for each of the 2D joints. Two stage approaches typically outperform single stage approaches [57].

Estimating 3D human poses from a single RGB image is difficult as the problem is inherently ill-posed. For any 2D observation there exist multiple plausible 3D poses that will lead to the same 2D projection [56, 33]. Additionally, collecting reliable ground-truth 3D data is difficult. Annotating 3D ground truth on 2D images inevitably introduces inaccuracies, and collecting actual ground truth requires a complex and expensive controlled environment using multi-view camera systems or additional capture modalities. Even using multiple views, triangulation can lead to errors, as there are inherent ambiguities in the position of the joint under the body surface. While limited datasets providing 3D data are available [17, 19], 2D datasets still provide more data in more general scenarios and environments.

To address the ill-posed nature of the problem, past approaches have relied on different strategies, like fully supervised training using either real [6, 44] or synthetic 3D ground truth [28, 26], weakly supervised training relying on multiple views: either paired [47, 51] or unpaired [50], 2D supervision [46], or video motion consistency [16, 10, 15]. Unsupervised approaches have relied on cycle consistency coupled with a weak perspective camera projection to lift 2D poses to 3D [49, 2]. Relying on a weak perspective camera projection is not ideal because it does not accurately capture the perspective transformation [14], leading to depth inaccuracy and scale ambiguity when projecting a 2D skeleton in the 3D space. Recent works [23] have shown that using fully perspective cameras reduces this ambiguity.

In this paper, we introduce EPOCH, a novel unsupervised framework that effectively addresses the challenges of data scarcity by harnessing unsupervised techniques and mitigates the inherent ill-posed nature of the problem through explicit camera modeling, as shown in Fig. 1. Our approach stands out due to its capability to estimate the full perspective camera parameters leveraging only 2D human poses, without relying on any camera ground truth. We claim that by incorporating the estimated camera into the 3D lifting operation, it is possible to enhance the accuracy and the consistency of 3D unsupervised human pose estimation while generalizing to unseen data. Our method consists of an unsupervised 3D human pose lifter network (LiftNet) and a lightweight capsule-based regressor network (RegNet) that estimates the camera pose and 2D joint positions.

LiftNet performs the 3D lifting from estimated 2D poses and camera parameters. Inspired by [2], we employ a self-supervising cycle-consistent framework. Unlike [49, 2], our approach uses a full perspective camera, allowing us to use a wider range of camera transformations for supervision in our cycle consistency, and improving the accuracy of the model.

We estimate the camera pose and 2D poses used as input for the lifting stage using RegNet, a lightweight capsule-based regressor network that is trained on weakly supervised 2D pose data instead of fully supervised data as [23]. It use contrastive pre-training and heatmap-free joint position regression to estimate the 2D poses as well as the intrinsic (the camera matrix [K]delimited-[]𝐾[K][ italic_K ]) and extrinsic parameters (rotation matrix [R]delimited-[]𝑅[R][ italic_R ] and translation vector T𝑇Titalic_T) of a camera based on the standard pinhole camera model. To supervise the camera estimation with the 2D pose data we internally predict a 3D pose that is then projected to 2D with the estimated camera. The internally predicted 3D pose is not quite as accurate as the refined output of LiftNet but helps regularizing the camera estimation.

With no prior about 3D human appearance, both LiftNet and RegNet estimate a single 2D projection which is not enough to guarantee a plausible 3D pose. Thus, we employ Normalizing Flows (NF) to ensure the plausibility of multiple 2D projections of a single 3D estimate. Different from previous approaches [49], our NF is based on simple 1x1 convolutions [20], which can be applied to the full feature representation of the poses, without the need to reduce their dimensionality using the Principal Component Analysis (PCA).

The EPOCH framework is the sequential combination of RegNet and LiftNet, which allows for the direct inference of accurate 3D poses from images. RegNet estimates 3D poses with weak 2D pose supervision, while the camera parameters are estimated without any ground truth camera data. LiftNet predicts 3D poses based on the estimates of 2D poses and camera poses. We argue that RegNet is weakly supervised, whereas LiftNet is fully unsupervised as it relies solely on estimates, without any ground truth data. This reasoning applies to both poses and cameras.

The novelty of our work can be summarized as:

  • We define an innovative EPOCH framework to address the challenges of data scarcity and the ill-posed nature of 3D HPE problem by harnessing the full camera perspective projection, enabling the direct lifting of accurate 3D poses from input images.

  • We present LiftNet, a novel 3D unsupervised HPE framework that leverages perspective camera projection to improve the accuracy of 2D pose lifting.

  • We introduce an original capsule-based regressor, called RegNet, which jointly estimates 3D joint positions and camera parameters. Final 2D joints are computed using the estimated camera by perspective projection.

  • We adopt a lightweight Normalizing Flows [20] model to enforce anthropomorphic constraints. Our NF accepts the entire 2D skeleton as input without the need for dimensionality reduction using PCA.

  • We obtain state-of-the-art results on both 3D HPE direct regression and 3D unsupervised HPE on the common benchmark datasets Human3.6M and MPI-INF-3DHP.

2 Related work

3D human pose estimation (HPE) from monocular 2D images has been extensively researched through supervised, weakly-supervised, and unsupervised approaches [57, 32]. This section gives an overview of the different methods, also focusing on the challenging in-the-wild approaches.

Fully supervised approaches. In this paradigm the 3D ground truth is readily available. Gathering such data requires collecting vast datasets such as Human3.6M [17], 3DPW [48], MPI-INF-3DHP [36] and CMU Panoptic [19]. In supervised methods, two primary strategies exist: direct regression of 3D coordinates from the image [38, 39, 45] (Fig. 1(a)), or 2D pose estimation followed by lifting to 3D [52, 53, 58, 5] (Fig. 1(b)). Direct regression proves more challenging because it involves the simultaneous estimation of 3D coordinates for each joint, often leading to inferior results compared to lifting-based techniques [57]. Supervised networks achieve the best results on multiple datasets [58, 5], but often struggle to generalize to different scenarios like out-of-distribution poses, challenging camera angles and in-the-wild pose estimation [57].

Weakly-supervised approaches rely on the lifting framework [47, 51, 50, 46, 15, 55, 30, 27, 8] using various supervision signals without directly accessing the 3D ground truth paired with the corresponding 2D image. For instance, multiple paired or unpaired views of the same subject provide a supervision signal, through the consistency of the estimated 3D pose seen from different viewpoints [47, 51, 50, 42, 27, 24]. In monocular approaches, temporally correlated 2D poses can be estimated from an input video and used as a supervision signal for a frame-specific 3D pose estimation [15, 55, 30]. In-the-wild approaches have mostly relied on 2D pose as ground truths to supervise intermediate 3D estimates [54, 12]. Other approaches perform monocular 3D pose estimation using only 2D pose supervision [27, 9, 2, 50].

Following this line of work, our regressor network (RegNet) is a novel weakly-supervised approach that uses 2D poses for supervision, computing the 3D to 2D projection via estimated camera parameters. Moreover, our approach is the first that jointly estimates the full perspective camera parameters without relying on any ground truth camera.

Unsupervised approaches. Unsupervised approaches usually rely on the lifting paradigm, employing multiple different signals to regularise their 3D predictions. In [2], the authors proposed an unsupervised lifting network grounded in closure and invariance properties, incorporating a geometric self-consistency loss. The closure property for a lifted 3D skeleton means that, after random rotation and re-projection, the resulting 2D skeleton will lie within the distribution of valid 2D poses. The invariance propriety means that, when changing the viewpoint of 2D projection from a 3D skeleton, the re-lifted 3D skeleton should be the same. Following a similar concept [49] introduces a weak camera projection to model the lift-reproject-lift process. The weak camera projection is coupled with the elevation estimation of the camera, providing an approximation of the full camera perspective model. Moreover, they introduce the use of Normalizing Flows (NFs) [7], which are used to ensure the closure property more accurately than GAN-based methods [2]. In [13] a similar framework is extended to a multi-person scenario, where relative positions are also used as supervision signals.

Our lifter network (LiftNet) follows this line of work while introducing the following novelties: (i) we employ a full perspective camera model for the projection, making it much more accurate and robust to varying focal lengths, (ii) we drop the need for an additional elevation prediction branch in the lifting network [49], (iii) we avoid applying the PCA to reduce data dimensionality by training a normalizing flow based on Glow [20] instead of RealNVP, (iv) we add a geometric constraint on unnatural joints folds.

3 Method

3.1 Preliminaries

Camera model. While many prior works for 3D human pose estimation rely on simplified weak perspective camera models, we use a full perspective camera model, consisting of intrinsic parameters K (focal length and center of projection) and extrinsic parameters R and T (rotation and translation of the camera respectively). We transform a 3D point (X,Y,Z)𝑋𝑌𝑍(X,Y,Z)( italic_X , italic_Y , italic_Z ) into camera coordinates by multiplying with the extrinsic and intrinsic matrices, and get the final image space coordinates I=[u/w,v/w]𝐼𝑢𝑤𝑣𝑤I=[u/w,v/w]italic_I = [ italic_u / italic_w , italic_v / italic_w ].

[uvw]2D=[K][R|T][XYZ1]3Dsubscriptmatrix𝑢𝑣𝑤2𝐷delimited-[]Kdelimited-[]conditionalRTsubscriptmatrix𝑋𝑌𝑍13𝐷\begin{bmatrix}u\\ v\\ w\end{bmatrix}_{2D}={[\textbf{K}]}{[\textbf{R}|\textbf{T}]}\begin{bmatrix}X\\ Y\\ Z\\ 1\end{bmatrix}_{3D}[ start_ARG start_ROW start_CELL italic_u end_CELL end_ROW start_ROW start_CELL italic_v end_CELL end_ROW start_ROW start_CELL italic_w end_CELL end_ROW end_ARG ] start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT = [ K ] [ R | T ] [ start_ARG start_ROW start_CELL italic_X end_CELL end_ROW start_ROW start_CELL italic_Y end_CELL end_ROW start_ROW start_CELL italic_Z end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ] start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT (1)

To invert the projection given image coordinates I~~𝐼\tilde{I}over~ start_ARG italic_I end_ARG, we need to estimate the unknown depth w𝑤witalic_w of the point. Similar to [31] we estimate the intrinsic parameters directly from the full image size and the bounding box crop using a model ΨΨ\Psiroman_Ψ. While this is only an approximation of the real field-of-view of the camera, prior work [21] has shown this to be a good approximation for most real life cameras used. The estimated intrinsic camera matrix [K]delimited-[]𝐾[K][ italic_K ] includes the focal length f=(fw,fh)𝑓subscript𝑓𝑤subscript𝑓f=(f_{w},f_{h})italic_f = ( italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) and the principal point c=(cw,ch)𝑐subscript𝑐𝑤subscript𝑐c=(c_{w},c_{h})italic_c = ( italic_c start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ). Moreover, we estimate a scaling factor s=(sw,sh)𝑠subscript𝑠𝑤subscript𝑠s=(s_{w},s_{h})italic_s = ( italic_s start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) for the regularisation of the image size and skeleton height. See Supplementary Materials for further details on model ΨΨ\Psiroman_Ψ.

Normalizing flows loss. Normalizing flows (NFs) are a class of generative neural networks capable of map** a complex distribution into a simpler one using invertible functions [22]. They are trained to learn the probability density function (PDF) of a given dataset relying on an invertible function f𝑓fitalic_f. See Supplementary material for further details about the modeling of this function. When presented with a novel sample, NFs can estimate the likelihood (plausibility) that the given sample belongs to the learned dataset distribution.

In [49], the learnable function f𝑓fitalic_f is based on RealNVP [7] which is not suitable for high dimensional data like 2D poses, necessitating a PCA reduction of the input vector x𝑥xitalic_x for an optimal convergence during training. In contrast, our NFs are based on the Glow framework [20] relying on 1x1 convolutions. Small and fast convolutions allow the use of the full 2D joints’ coordinates without the need for a feature reduction as well as reducing the computational costs.

During the training of our architecture, the NFs are used to verify whether multiple projections of 3D poses are all plausible 2D poses without relying on multi-view data. To achieve this, we define the normalizing flow loss NFsubscript𝑁𝐹\mathcal{L}_{NF}caligraphic_L start_POSTSUBSCRIPT italic_N italic_F end_POSTSUBSCRIPT, as the negative log-likelihood of the PDF:

NF(x)=log(px(𝐱))subscript𝑁𝐹𝑥𝑙𝑜𝑔subscript𝑝𝑥𝐱\mathcal{L}_{NF}(x)=-log(p_{x}(\mathbf{x}))caligraphic_L start_POSTSUBSCRIPT italic_N italic_F end_POSTSUBSCRIPT ( italic_x ) = - italic_l italic_o italic_g ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( bold_x ) ) (2)
Refer to caption
Figure 2: In (I), we define two vectors, denoted as A𝐴Aitalic_A and B𝐵Bitalic_B, connecting the spine and the hip joints. The cross product of these vectors yields the normal vector N𝑁Nitalic_N, which aligns with the walking direction. In (II) and (III), we show the outcome of the dot product between N𝑁Nitalic_N and the proximal plsubscript𝑝𝑙p_{l}italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and distal dlsubscript𝑑𝑙d_{l}italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT components, resulting in their projections Dlsubscript𝐷𝑙D_{l}italic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and Plsubscript𝑃𝑙P_{l}italic_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. In (II), limbssubscript𝑙𝑖𝑚𝑏𝑠\mathcal{L}_{limbs}caligraphic_L start_POSTSUBSCRIPT italic_l italic_i italic_m italic_b italic_s end_POSTSUBSCRIPT gives an output of 0, indicating a anthropomorphically complaint prediction. In (III), limbssubscript𝑙𝑖𝑚𝑏𝑠\mathcal{L}_{limbs}caligraphic_L start_POSTSUBSCRIPT italic_l italic_i italic_m italic_b italic_s end_POSTSUBSCRIPT returns a positive value, signaling the need for further correction.

Anthropomorphic constraints. In supervised 3D HPE, the neural network has explicit access to the 3D ground truth to learn the appearance of a 3D human pose. In unsupervised or semi-supervised settings, we introduce regularising losses to ensure that the estimated 3D poses respect anthropomorphic constraints, such as proportional bone lengths and articulation angle limits.

As in [49], we use a bones ratio loss bone(y)subscript𝑏𝑜𝑛𝑒𝑦\mathcal{L}_{bone}(y)caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_n italic_e end_POSTSUBSCRIPT ( italic_y ) to ensure that the ratio between bones lengths of 3D pose y3J𝑦superscript3𝐽y\in\mathbb{R}^{3J}italic_y ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_J end_POSTSUPERSCRIPT are respected. This loss leverages the observed nearly constant ratio between bones across different individuals [40], without fixing the bone length to a pre-defined value.

Additionally, we define a novel limbssubscript𝑙𝑖𝑚𝑏𝑠\mathcal{L}_{limbs}caligraphic_L start_POSTSUBSCRIPT italic_l italic_i italic_m italic_b italic_s end_POSTSUBSCRIPT loss which ensures that joints like knees and elbows do not bend in unrealistic manners (e.g. facing backward with respect to the normal walking direction). It is defined as:

limbs(y)=1LlLmax(0,PlDl)subscript𝑙𝑖𝑚𝑏𝑠𝑦1𝐿superscriptsubscript𝑙𝐿𝑚𝑎𝑥0subscript𝑃𝑙subscript𝐷𝑙\ \mathcal{L}_{limbs}(y)=\frac{1}{L}\sum_{l}^{L}{max(0,P_{l}-D_{l})}caligraphic_L start_POSTSUBSCRIPT italic_l italic_i italic_m italic_b italic_s end_POSTSUBSCRIPT ( italic_y ) = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_m italic_a italic_x ( 0 , italic_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) (3)

where L𝐿Litalic_L represents the number of limbs, Pl=Nplsubscript𝑃𝑙𝑁subscript𝑝𝑙P_{l}=N\cdot p_{l}italic_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_N ⋅ italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and Dl=Ndlsubscript𝐷𝑙𝑁subscript𝑑𝑙D_{l}=N\cdot d_{l}italic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_N ⋅ italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT denote the normal components of the proximal (plsubscript𝑝𝑙p_{l}italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT) and distal (dlsubscript𝑑𝑙d_{l}italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT) components of each limb, and N=A×B𝑁𝐴𝐵N=A\times Bitalic_N = italic_A × italic_B represents the normal vector for the plane defined by the hips and spine joints. In Fig. 2 we show a visual representation of limbssubscript𝑙𝑖𝑚𝑏𝑠\mathcal{L}_{limbs}caligraphic_L start_POSTSUBSCRIPT italic_l italic_i italic_m italic_b italic_s end_POSTSUBSCRIPT to better convey the intuitive reasoning behind its mathematical formulation.

3.2 Pose Lifter Network (LiftNet)

LiftNet is our lifter module which introduces the paradigm shown in Fig. 1(d). The overall detailed architecture is shown in Fig. 3.

Refer to caption
Figure 3: LiftNet architecture. The red (2D3D2𝐷3𝐷2D\rightarrow 3D2 italic_D → 3 italic_D), orange (\circlearrowleft and \circlearrowright) and yellow (3D2D3𝐷2𝐷3D\rightarrow 2D3 italic_D → 2 italic_D) blocks describe the Lift, Rotate, Project operations respectively. The symbol x𝑥xitalic_x denotes a 2D pose, y𝑦yitalic_y denotes a 3D pose. The decorator ^^absent\,\hat{}\,over^ start_ARG end_ARG symbolizes a prediction in the forward pass while ~~absent\,\widetilde{}\,over~ start_ARG end_ARG marks a prediction in the backward pass. The subscript r stands for rotated. The solid arrows describe the flow of the network, while the dashed arrows connect each intermediate datum to its loss.

LiftNet aims at retrieving the 3D pose y3J𝑦superscript3𝐽y\in\mathbb{R}^{3J}italic_y ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_J end_POSTSUPERSCRIPT, starting from a 2D pose x^2J^𝑥superscript2𝐽\hat{x}\in\mathbb{R}^{2J}over^ start_ARG italic_x end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_J end_POSTSUPERSCRIPT and its estimated camera parameters [K]delimited-[]𝐾[K][ italic_K ] and [R|T]delimited-[]conditional𝑅𝑇[R|T][ italic_R | italic_T ]. As shown in Fig. 3, our architecture consists of a cycle consistency structure which can be split into two symmetric branches: a forward branch (x^Lifty^Rotatey^rProjectx^r^𝑥𝐿𝑖𝑓𝑡^𝑦𝑅𝑜𝑡𝑎𝑡𝑒subscript^𝑦𝑟𝑃𝑟𝑜𝑗𝑒𝑐𝑡subscript^𝑥𝑟\hat{x}\rightarrow Lift\rightarrow\hat{y}\rightarrow Rotate\rightarrow\hat{y}_% {r}\rightarrow Project\rightarrow\hat{x}_{r}over^ start_ARG italic_x end_ARG → italic_L italic_i italic_f italic_t → over^ start_ARG italic_y end_ARG → italic_R italic_o italic_t italic_a italic_t italic_e → over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT → italic_P italic_r italic_o italic_j italic_e italic_c italic_t → over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT) and a backward branch (x^rLifty~rInverseRotatey~Projectx~subscript^𝑥𝑟𝐿𝑖𝑓𝑡subscript~𝑦𝑟𝐼𝑛𝑣𝑒𝑟𝑠𝑒𝑅𝑜𝑡𝑎𝑡𝑒~𝑦𝑃𝑟𝑜𝑗𝑒𝑐𝑡~𝑥\hat{x}_{r}\rightarrow Lift\rightarrow\widetilde{y}_{r}\rightarrow InverseRotate% \rightarrow\widetilde{y}\rightarrow Project\rightarrow\widetilde{x}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT → italic_L italic_i italic_f italic_t → over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT → italic_I italic_n italic_v italic_e italic_r italic_s italic_e italic_R italic_o italic_t italic_a italic_t italic_e → over~ start_ARG italic_y end_ARG → italic_P italic_r italic_o italic_j italic_e italic_c italic_t → over~ start_ARG italic_x end_ARG). Each step and its input/output are described in Alg. 1. The lift operation is performed by a lifter network while the projection is a mathematical operation. Both of these operations rely on the full perspective model using camera parameters [K][R|t]delimited-[]𝐾delimited-[]conditional𝑅𝑡[K][R|t][ italic_K ] [ italic_R | italic_t ]. All the losses provide self-supervision to the cycle consistency, which does not access either 2D or 3D ground truths, making it a fully unsupervised process.

Algorithm 1 Cycle consistency. Both 2D3D2𝐷3𝐷2D\rightarrow 3D2 italic_D → 3 italic_D and 3D2D3𝐷2𝐷3D\rightarrow 2D3 italic_D → 2 italic_D rely on the camera parameters [K][R|t]delimited-[]𝐾delimited-[]conditional𝑅𝑡[K][R|t][ italic_K ] [ italic_R | italic_t ] to be solved.
x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG, [K][R|t]delimited-[]𝐾delimited-[]conditional𝑅𝑡[K][R|t][ italic_K ] [ italic_R | italic_t ]
Forward branch
1. Lift: 2D3D2𝐷3𝐷2D\rightarrow 3D2 italic_D → 3 italic_D: x^y^^𝑥^𝑦\hat{x}\rightarrow\hat{y}over^ start_ARG italic_x end_ARG → over^ start_ARG italic_y end_ARG
2. Rotate: y^y^r^𝑦subscript^𝑦𝑟\hat{y}\rightarrow\hat{y}_{r}over^ start_ARG italic_y end_ARG → over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT
3. Project: 3D2D3𝐷2𝐷3D\rightarrow 2D3 italic_D → 2 italic_D: y^rx^rsubscript^𝑦𝑟subscript^𝑥𝑟\hat{y}_{r}\rightarrow\hat{x}_{r}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT → over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT
Backward branch
4. Lift: 2D3D2𝐷3𝐷2D\rightarrow 3D2 italic_D → 3 italic_D: x^ry~rsubscript^𝑥𝑟subscript~𝑦𝑟\hat{x}_{r}\rightarrow\tilde{y}_{r}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT → over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT
5. InverseRotate: y~ry~subscript~𝑦𝑟~𝑦\tilde{y}_{r}\rightarrow\tilde{y}over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT → over~ start_ARG italic_y end_ARG
6. Project: 3D2D3𝐷2𝐷3D\rightarrow 2D3 italic_D → 2 italic_D: y~x~~𝑦~𝑥\tilde{y}\rightarrow\tilde{x}over~ start_ARG italic_y end_ARG → over~ start_ARG italic_x end_ARG

Differently from previous approaches using the weak camera model [49], our lifter leverages the full perspective camera model to recover the 3rd dimension w𝑤witalic_w for each input 2D pose x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG. Using the estimated w𝑤witalic_w we can solve the inverse of the projection of (Eq. 1) and recover the 3D joint positions. This Lift operation is symbolized as 2D3D2𝐷3𝐷2D\rightarrow 3D2 italic_D → 3 italic_D.

Given a 3D poses y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG we can perform the Project operation symbolized as 3D2D3𝐷2𝐷3D\rightarrow 2D3 italic_D → 2 italic_D. That means computing the 2D pose x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG using the full camera projection in Eq. 1.

Inspired by previous approaches [35, 49], our lifter network consists of a simple MLP structure. The MLP receives as input a 2D pose x𝑥xitalic_x concatenated with the flattened version the extrinsic parameters [R|t]delimited-[]conditional𝑅𝑡[R|t][ italic_R | italic_t ] (12 total values), the intrinsic parameters f=(fw,fh)𝑓subscript𝑓𝑤subscript𝑓f=(f_{w},f_{h})italic_f = ( italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ), c=(cw,ch)𝑐subscript𝑐𝑤subscript𝑐c=(c_{w},c_{h})italic_c = ( italic_c start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) and the scaling factor s=(sw,sh)𝑠subscript𝑠𝑤subscript𝑠s=(s_{w},s_{h})italic_s = ( italic_s start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ), resulting in a vector of size 2J+182𝐽182J+182 italic_J + 18. The input vector is fed (I) to a linear layer to obtain an embedded vector of size diml𝑑𝑖subscript𝑚𝑙dim_{l}italic_d italic_i italic_m start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, (II) to 3 residual blocks each containing 2 fully connected layers, (III) to a linear layer to obtain the output vector of size J𝐽Jitalic_J representing the depth parameter w𝑤witalic_w for each joint. The vector is concatenated with the input x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG resulting in the estimated 3D pose y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG.

To train the LiftNet, we minimize the following loss:

lift=2D(x^,x~)subscript𝑙𝑖𝑓𝑡subscript2𝐷^𝑥~𝑥\displaystyle\mathcal{L}_{lift}=\mathcal{L}_{2D}(\hat{x},\tilde{x})caligraphic_L start_POSTSUBSCRIPT italic_l italic_i italic_f italic_t end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG , over~ start_ARG italic_x end_ARG ) +3D(y^r,y~r)+NF(x~r)+bone(y^)+limbs(y^)+defsubscript3𝐷subscript^𝑦𝑟subscript~𝑦𝑟subscript𝑁𝐹subscript~𝑥𝑟subscript𝑏𝑜𝑛𝑒^𝑦subscript𝑙𝑖𝑚𝑏𝑠^𝑦subscript𝑑𝑒𝑓\displaystyle+\mathcal{L}_{3D}(\hat{y}_{r},\tilde{y}_{r})+\mathcal{L}_{NF}(% \tilde{x}_{r})+\mathcal{L}_{bone}(\hat{y})+\mathcal{L}_{limbs}(\hat{y})+% \mathcal{L}_{def}+ caligraphic_L start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT italic_N italic_F end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_n italic_e end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG ) + caligraphic_L start_POSTSUBSCRIPT italic_l italic_i italic_m italic_b italic_s end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG ) + caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_f end_POSTSUBSCRIPT (4)

where:

  • 2D(x^,x~)=x^x~1subscript2𝐷^𝑥~𝑥subscriptnorm^𝑥~𝑥1\mathcal{L}_{2D}(\hat{x},\tilde{x})=||\hat{x}-\tilde{x}||_{1}caligraphic_L start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG , over~ start_ARG italic_x end_ARG ) = | | over^ start_ARG italic_x end_ARG - over~ start_ARG italic_x end_ARG | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the norm-1 distance between the initial prediction x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG and its version after the cycle consistency loop x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG;

  • 3D(y^r,y~r)=y^y~2subscript3𝐷subscript^𝑦𝑟subscript~𝑦𝑟subscriptnorm^𝑦~𝑦2\mathcal{L}_{3D}(\hat{y}_{r},\tilde{y}_{r})=||\hat{y}-\tilde{y}||_{2}caligraphic_L start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = | | over^ start_ARG italic_y end_ARG - over~ start_ARG italic_y end_ARG | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the norm-2 distance between the 3D pose y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG and its version after it has been projected and lifted y~~𝑦\tilde{y}over~ start_ARG italic_y end_ARG;

  • NF(x~r)subscript𝑁𝐹subscript~𝑥𝑟\mathcal{L}_{NF}(\tilde{x}_{r})caligraphic_L start_POSTSUBSCRIPT italic_N italic_F end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ), bone(y^)subscript𝑏𝑜𝑛𝑒^𝑦\mathcal{L}_{bone}(\hat{y})caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_n italic_e end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG ) and limbs(y^)subscript𝑙𝑖𝑚𝑏𝑠^𝑦\mathcal{L}_{limbs}(\hat{y})caligraphic_L start_POSTSUBSCRIPT italic_l italic_i italic_m italic_b italic_s end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG ) are the losses defined in Sec. 3.1;

  • defsubscript𝑑𝑒𝑓\mathcal{L}_{def}caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_f end_POSTSUBSCRIPT is a deformation loss computed between two poses yasubscript𝑦𝑎y_{a}italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and ybsubscript𝑦𝑏y_{b}italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT belonging to the same batch defined as

    def=(ya^yb^)(ya~yb~)2subscript𝑑𝑒𝑓subscriptnorm^subscript𝑦𝑎^subscript𝑦𝑏~subscript𝑦𝑎~subscript𝑦𝑏2\mathcal{L}_{def}=||(\hat{y_{a}}-\hat{y_{b}})-(\tilde{y_{a}}-\tilde{y_{b}})||_% {2}caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_f end_POSTSUBSCRIPT = | | ( over^ start_ARG italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG - over^ start_ARG italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG ) - ( over~ start_ARG italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG - over~ start_ARG italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (5)

    which ensures that 2 poses from the same batch have not been deformed in completely different manners by the same Project and Lift operations, providing a supervision similar to the temporal consistency defined in [54], but without relying on temporally-related data.

Refer to caption
Figure 4: RegNet architecture. The W×H𝑊𝐻W\times Hitalic_W × italic_H input image is fed to (a) a contrastive-pretrained encoder and a separate module ΨΨ\Psiroman_Ψ that estimates the intrinsic parameters. The output features are then concatenated and (b) fed into our attention-based capsule decoder. The outputs are three separate capsule vectors, representing an estimation of the 3D pose y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG, of the camera [K][R|t]delimited-[]𝐾delimited-[]conditional𝑅𝑡[K][R|t][ italic_K ] [ italic_R | italic_t ] and a joint presence vector ΣΣ\Sigmaroman_Σ. (c) Each of the outputs needs to be further processed before the loss computation. A copy of y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG is randomly rotated around the vertical axis, obtaining y^rsubscript^𝑦𝑟\hat{y}_{r}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG and y^rsubscript^𝑦𝑟\hat{y}_{r}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT are projected into the camera plane and ΣΣ\Sigmaroman_Σ goes through a sigmoid activation function. (d) y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG, x^rsubscript^𝑥𝑟\hat{x}_{r}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG and σ^^𝜎\hat{\sigma}over^ start_ARG italic_σ end_ARG are fed to the loss functions.

3.3 Pose Regressor Network (RegNet)

RegNet is our direct regression module (Fig. 1(c)) which is used to estimate the camera pose and initial 2D pose used for the lifting stage. The overall detailed architecture is shown in Fig. 4. The input to RegNet is a single square image I𝐼Iitalic_I of size W×H𝑊𝐻W\times Hitalic_W × italic_H pixels, roughly centered on the pelvis similar to [49, 2, 54]. The objective is to retrieve the 2D pose x3J𝑥superscript3𝐽x\in\mathbb{R}^{3J}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_J end_POSTSUPERSCRIPT. Additionally, it estimates the intrinsic camera parameters K𝐾Kitalic_K consisting of focal length f=(fw,fh)𝑓subscript𝑓𝑤subscript𝑓f=(f_{w},f_{h})italic_f = ( italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ), principal point c=(cw,ch)𝑐subscript𝑐𝑤subscript𝑐c=(c_{w},c_{h})italic_c = ( italic_c start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ), and a scaling factor s=(sw,sh)𝑠subscript𝑠𝑤subscript𝑠s=(s_{w},s_{h})italic_s = ( italic_s start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ), as well as the extrinsic camera parameters R𝑅Ritalic_R and t𝑡titalic_t. The intrinsic camera parameters are estimated directly from the image size as shown in [31].

RegNet consists of an encoder-decoder architecture, where the encoder is pre-trained using contrastive learning and the decoder is based on capsules. The output of the decoder yields both 3D poses and camera parameters, which are then used to compute the 2D pose x2J𝑥superscript2𝐽x\in\mathbb{R}^{2J}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_J end_POSTSUPERSCRIPT, with each joint represented as [u/w,v/w]𝑢𝑤𝑣𝑤[u/w,v/w][ italic_u / italic_w , italic_v / italic_w ] on the image plane I𝐼Iitalic_I. It is worth noting that we employ 2D poses for the loss computation, thus introducing a form of weak supervision to the 3D pose estimation process.

Contrastive Encoder. Our encoder employs a ResNet50 pre-trained using contrastive learning on ImageNet as in [1]. Despite its general image focus, the use of contrastive learning allows for faster convergence and better generalization, outperforming supervised pre-training methods [3].

The output vector of the encoder is concatenated with the intrinsic parameters [K]delimited-[]𝐾[K][ italic_K ] that are estimated directly from the full image. The vector resulting from the concatenation of size dim𝑑𝑖𝑚dimitalic_d italic_i italic_m is given as input to the decoder.

Capsule-based decoder. In [25], they first showed how a Soft Attention mechanism can be effectively used to split a feature vector in different capsule features. In [43], they perform an equivalent operation with a fully connected and a Softmax layer. Inspired by [43], we design our capsule-based decoder using a Conv2D fully connected layer to transform the latent space vector of size dim𝑑𝑖𝑚dimitalic_d italic_i italic_m to a vector of size 9×J9𝐽9\times J9 × italic_J.

Given J𝐽Jitalic_J values used for the attention mechanism, the remaining 8×J8𝐽8\times J8 × italic_J values are divided in the following capsules:

  • y^3J^𝑦superscript3𝐽\hat{y}\in\mathbb{R}^{3J}over^ start_ARG italic_y end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_J end_POSTSUPERSCRIPT, representing the estimated 3D pose;

  • Γ3JΓsuperscript3𝐽\Gamma\in\mathbb{R}^{3J}roman_Γ ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_J end_POSTSUPERSCRIPT from which we can compute [R|t]delimited-[]conditional𝑅𝑡[R|t][ italic_R | italic_t ] representing the extrinsic parameters, namely the estimated rotation matrix R𝑅Ritalic_R and translation vector t𝑡titalic_t with respect to the input image’s viewpoint. The mathematical calculations to derive [R|t]delimited-[]conditional𝑅𝑡[R|t][ italic_R | italic_t ] from ΓΓ\Gammaroman_Γ are reported in the Supplementary Materials. To ensure invertibility, an orthogonality constraint is enforced on the rotation matrix [R]delimited-[]𝑅[R][ italic_R ]. The extrinsic parameters are combined with the intrinsic camera matrix [K]delimited-[]𝐾[K][ italic_K ] computed by ΨΨ\Psiroman_Ψ at the encoding stage to obtain the full perspective camera model descriptor [K][R|t]delimited-[]𝐾delimited-[]conditional𝑅𝑡[K][R|t][ italic_K ] [ italic_R | italic_t ];

  • Σ2JΣsuperscript2𝐽\Sigma\in\mathbb{R}^{2J}roman_Σ ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_J end_POSTSUPERSCRIPT, a vector indicating the presence of each joint. Low values indicate uncertain detection of joints, often due to occlusions or joints being outside the image space.

Outputs. To ensure that the estimated 3D pose is plausible from multiple viewpoints, we require both the 2D pose x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG of the original image I𝐼Iitalic_I, as well as the 2D pose x^rsubscript^𝑥𝑟\hat{x}_{r}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT corresponding to the same pose from a different viewpoint.

For x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG, we perform the Project operation as used in the lifting architecture to obtain 2D pose x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG using the full camera projection in Eq. 1 given the 3D pose y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG.

Similar to the lifting stage, we observed that we can increase the accuracy of the estimated 3D poses by using a Normalizing Flow loss to ensure the plausibility of multiple 2D projections of the 3D pose seen from different viewpoints. To this end we also calculate a rotated projection x^rsubscript^𝑥𝑟\hat{x}_{r}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, where we rotate the model around the vertical world axis. The Rotate operation is applied randomly to each 3D pose in the range of [10°,350°]10°350°[$$,$$][ 10 ⁢ ° , 350 ⁢ ° ] to simulate a viewpoint change, before projecting the rotated 3D pose y^rsubscript^𝑦𝑟\hat{y}_{r}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to obtain x^rsubscript^𝑥𝑟\hat{x}_{r}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT.

Regression of the joints’ position can be combined with an uncertainty measure and leads to better results when compared to direct regression of joints and heatmap-based methods [29]. Moreover, the regression is more computation and memory efficient compared to heatmaps. To employ a similar method, we estimate the deviation σ^^𝜎\hat{\sigma}over^ start_ARG italic_σ end_ARG of the predicted joint’s position from the ground truth by applying a sigmoid function on the estimated presence capsule ΣΣ\Sigmaroman_Σ.

Losses. To train RegNet, we need to minimize the following loss:

reg=bone(y^)+limbs(y^)subscript𝑟𝑒𝑔subscript𝑏𝑜𝑛𝑒^𝑦subscript𝑙𝑖𝑚𝑏𝑠^𝑦\displaystyle\mathcal{L}_{reg}=\mathcal{L}_{bone}(\hat{y})+\mathcal{L}_{limbs}% (\hat{y})caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_n italic_e end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG ) + caligraphic_L start_POSTSUBSCRIPT italic_l italic_i italic_m italic_b italic_s end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG ) +NF(x^r)+RLE(x^,σ^)subscript𝑁𝐹subscript^𝑥𝑟subscript𝑅𝐿𝐸^𝑥^𝜎\displaystyle+\mathcal{L}_{NF}(\hat{x}_{r})+\mathcal{L}_{RLE}(\hat{x},\hat{% \sigma})+ caligraphic_L start_POSTSUBSCRIPT italic_N italic_F end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT italic_R italic_L italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG , over^ start_ARG italic_σ end_ARG ) (6)

where:

  • NF(x~r)subscript𝑁𝐹subscript~𝑥𝑟\mathcal{L}_{NF}(\tilde{x}_{r})caligraphic_L start_POSTSUBSCRIPT italic_N italic_F end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ), bone(y^)subscript𝑏𝑜𝑛𝑒^𝑦\mathcal{L}_{bone}(\hat{y})caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_n italic_e end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG ) and limbs(y^)subscript𝑙𝑖𝑚𝑏𝑠^𝑦\mathcal{L}_{limbs}(\hat{y})caligraphic_L start_POSTSUBSCRIPT italic_l italic_i italic_m italic_b italic_s end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG ) are the losses defined in Sec. 3.1;

  • RLE(x,I)subscript𝑅𝐿𝐸𝑥𝐼\mathcal{L}_{RLE}(x,I)caligraphic_L start_POSTSUBSCRIPT italic_R italic_L italic_E end_POSTSUBSCRIPT ( italic_x , italic_I ) is the residual log-likelihood estimation loss defined as,

    RLE(x,I)subscript𝑅𝐿𝐸𝑥𝐼\displaystyle\mathcal{L}_{RLE}(x,I)caligraphic_L start_POSTSUBSCRIPT italic_R italic_L italic_E end_POSTSUBSCRIPT ( italic_x , italic_I ) =logPΘ,ϕ(x,I)|x=x^=logPϕ(x^)+log(σ^)absentevaluated-at𝑙𝑜𝑔subscript𝑃Θitalic-ϕ𝑥𝐼𝑥^𝑥𝑙𝑜𝑔subscript𝑃italic-ϕ^𝑥𝑙𝑜𝑔^𝜎\displaystyle=-logP_{\Theta,\phi}(x,I)|_{x=\hat{x}}=-logP_{\phi}(\hat{x})+log(% \hat{\sigma})= - italic_l italic_o italic_g italic_P start_POSTSUBSCRIPT roman_Θ , italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_I ) | start_POSTSUBSCRIPT italic_x = over^ start_ARG italic_x end_ARG end_POSTSUBSCRIPT = - italic_l italic_o italic_g italic_P start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG ) + italic_l italic_o italic_g ( over^ start_ARG italic_σ end_ARG ) (7)

    As in [29], this loss aims at estimating the joints’ position x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG via direct regression coupled with the learned error distribution σ^^𝜎\hat{\sigma}over^ start_ARG italic_σ end_ARG. We refer to the original work [29] for the full mathematical explanation of the loss.

For further details on loss balancing during training see the Supplementary Materials.

4 Results

We perform our experiments on the common Human3.6M [17] and MPI-INF-3DHP [36] datasets111All datasets were obtained and used only by the authors affiliated with academic institutions. . We follow standard test protocols for both datasets [49]. We also report extensive ablation studies and qualitative results on the unseen 3DPW dataset [48] to demonstrate generalization.

Implementation details. The input images are of size H=224𝐻224H=224italic_H = 224 and W=224𝑊224W=224italic_W = 224. The skeleton model has J=17𝐽17J=17italic_J = 17 joints. RegNet is trained for 45454545 epochs using the optimizer AdamW [34], dim=2048+6𝑑𝑖𝑚20486dim=2048+6italic_d italic_i italic_m = 2048 + 6, learning rate 1e31𝑒31e-31 italic_e - 3, and weight decay 1e41𝑒41e-41 italic_e - 4. LiftNet is trained for 100100100100 epochs using the optimizer AdamW, learning rate 2e42𝑒42e-42 italic_e - 4, and weight decay 1e51𝑒51e-51 italic_e - 5. Both RegNet and LiftNet are trained with batch size 256256256256 and bfloat16 precision on single a NVIDIA RTX 3090. Inference runs on the same GPU at 45absent45\approx 45≈ 45 fps.

Metrics. We adopt the standard mean per joint position error (MPJPE) in two common forms for the 3D HPE unsupervised settings: the PA-MPJPE where reconstructed 3D pose is Procrustes aligned and the N-MPJPE where the 3D pose is normalized the same scale as the ground truth [41]. As in [49], for the MPI-INF-3DHP dataset we report the scale normalized percentage of correct key points (N-PCK) predicted within 150150150150 mm to the original position and its corresponding area under curve (AUC).

4.1 Quantitative results

Supervision Method PA-MPJPE\downarrow N-MPJPE \downarrow
Full-3D Pavalkos [38] 41.8 -
10%-3D Kundu [28] 49.6 59.4
Kundu [26] 48.2 57.6
Gong [11] 39.1 50.2
Multi-view Rhodin [41] 98.2 122.6
Kundu [27] 85.8 -
Usman [47] 44.0 55.0
Full-2D Fish [9] 97.2 -
Kundu [27] 62.4 -
Ours (RegNet) 45.4 69.9
Table 1: Quantitative results for direct regression from images on Human3.6M.
Supervision Method PA-MPJPE\downarrow N-MPJPE\downarrow
Full Martinez [35] 37.1 45.5
Weak Fish [9] 79.0 -
Wandt [50] 38.2 50.9
Drover [8] 38.2 -
Kundu [28] 62.4 -
Multi-view Kocabas [24] 47.9 54.9
Wandt [51] 51.4 65.9
Unsupervised Chen [2] 58.0 -
Yu [54] 42.0 85.3
Wandt [49] 36.7 64.0
Ours (LiftNet) 30.7 50.8
Table 2: Quantitative results for lifting from 2D ground truth on Human3.6M.
Method Backbone PA-MPJPE\downarrow N-MPJPE\downarrow
Chen [2] SH [37] 68.0 -
Wandt [50] SH [37] 65.1 89.9
Kundu [27] ResNet-50 63.8 -
Kundu [27] ResNet-50 62.4 -
Yu [54] CPN [4] 52.3 92.4
Wandt [49] CPN [4] 50.2 74.4
Ours (LiftNet) Ours (RegNet) 43.8 67.1
Table 3: Quantitative results for lifting from 2D predictions on Human3.6M.
Supervision Method 2D Input PA-MPJPE\downarrow N-PCK\uparrow AUC\uparrow
Weak Kundu [27] GT 93.9 84.6 60.8
Unsup. Yu [54] GT - 86.2 51.7
Wandt [49] GT 54.0 86.0 50.1
Ours (LiftNet) Pred 46.8 93.6 61.3
Ours (LiftNet) GT 33.6 97.6 69.5
Unseen data Ours (LiftNet) Pred 57.3 77.5 46.4
Table 4: Quantitative results for 3D HPE on MPI-INF-3DHP.

Direct regression from images. In Tab. 2, we report the results of weakly-supervised direct regression of 3D pose from images on Human3.6M dataset. Weakly-supervised approaches include ones using only a small portion of 3D annotated data (10%-3D) or multi-view supervision. Both approaches rely on 3D spatial information for their supervision, leading to results close to the baseline fully supervised approach [38]. Among approaches using only 2D information, our RegNet outperforms others, obtaining results in line with baseline 3D supervised. In contrast to the previous approaches, RegNet also performs the unsupervised estimation of the full perspective camera parameters [K][R|t]delimited-[]𝐾delimited-[]conditional𝑅𝑡[K][R|t][ italic_K ] [ italic_R | italic_t ].

Lifting from 2D ground truth (GT). In Tab. 2, we report the results of lifting GT 2D pose to 3D on Human3.6M dataset. We compare against baseline fully supervised approaches [35], multi-view supervised approaches [24, 51], and weakly supervised approaches adopting different strategies like domain adaptation [28], 2D GT poses [8, 9] and partial 3D GT [50]. Our unsupervised LiftNet outperforms all previous weakly supervised approaches, also obtaining the best results among unsupervised methods. We also outperform [49] which adopts a weak perspective camera modeling combined with elevation estimation, demonstrating the efficacy of our LiftNet that explicitly leverages the full perspective camera model.

Lifting from 2D predictions. In Tab. 4, we report the results of lifting 2D pose predictions to 3D on the Human3.6M dataset. In contrast to other unsupervised approaches that use either ResNet50 [27], Stacked Hourglass (SH) [2, 50] or Cascaded Pyramid Network (CPN) [4, 49] as the backbone to extract 2D predictions to be lifted, we use RegNet which also provides an unsupervised estimation of the camera parameters. The full EPOCH approach combining both RegNet and LiftNet outperforms all other methods, demonstrating its efficacy as an end-to-end approach to estimate 3D poses from images.

Generalising to unseen data. In Tab. 4, we report the results of 3D HPE on MPI-INF-3DHP dataset. When trained on MPI-INF-3DHP, LiftNet outperforms both weakly-supervised [27] and unsupervised [49, 54] lifting methods starting from either the ground truth or from poses predicted by RegNet. When not trained on MPI-INF-3DHP (last row), EPOCH trained only on Human3.6M achieves results comparable to fine-tuned approaches, proving its ability to generalize to unseen data.

Ablation studies. In Tab. 6, we report the results of the ablation study for RegNet. First, we show how using a backbone trained with supervised learning (S) leads to poorer features (first row), leading to performance degradation compared to the full model trained with contrastive learning (C) (last row). Next, we ablate different losses, showing how NFsubscript𝑁𝐹\mathcal{L}_{NF}caligraphic_L start_POSTSUBSCRIPT italic_N italic_F end_POSTSUBSCRIPT is the loss that causes the biggest drop in performances since it ensures the network does not estimate 3D poses that are plausible only from a single viewpoint. bonesubscript𝑏𝑜𝑛𝑒\mathcal{L}_{bone}caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_n italic_e end_POSTSUBSCRIPT and limbssubscript𝑙𝑖𝑚𝑏𝑠\mathcal{L}_{limbs}caligraphic_L start_POSTSUBSCRIPT italic_l italic_i italic_m italic_b italic_s end_POSTSUBSCRIPT have similar effect on the results as they both enforce anthropomorphic constraints.

In Tab. 6, we report the results of the ablation study for LiftNet. We perform an ablation study on the camera modeling, showing how using a weak perspective camera leads to performance degradation due to its less precise modeling of the 2D/3D relation than the full perspective camera. Moreover, we proceed to study the effect of each loss on the performances. As for RegNet, NFsubscript𝑁𝐹\mathcal{L}_{NF}caligraphic_L start_POSTSUBSCRIPT italic_N italic_F end_POSTSUBSCRIPT causes the biggest drop in performances. The deformation defsubscript𝑑𝑒𝑓\mathcal{L}_{def}caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_f end_POSTSUBSCRIPT, and the two anthropomorphic constraints bonesubscript𝑏𝑜𝑛𝑒\mathcal{L}_{bone}caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_n italic_e end_POSTSUBSCRIPT and limbssubscript𝑙𝑖𝑚𝑏𝑠\mathcal{L}_{limbs}caligraphic_L start_POSTSUBSCRIPT italic_l italic_i italic_m italic_b italic_s end_POSTSUBSCRIPT have similar effects on the numerical results, as they are all enforcing comparable constraints on the deformation and proportions of 3D poses. 2Dsubscript2𝐷\mathcal{L}_{2D}caligraphic_L start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT and 3Dsubscript3𝐷\mathcal{L}_{3D}caligraphic_L start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT all provide regularisation between different stages of the cycle consistency, so removing one of them has a comparable effect on the performance’s degradation, since LiftNet is still regularised by the remaining ones.

Backbone NFsubscript𝑁𝐹\mathcal{L}_{NF}caligraphic_L start_POSTSUBSCRIPT italic_N italic_F end_POSTSUBSCRIPT
bonesubscript𝑏𝑜𝑛𝑒\mathcal{L}_{bone}caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_n italic_e end_POSTSUBSCRIPT
limbssubscript𝑙𝑖𝑚𝑏𝑠\mathcal{L}_{limbs}caligraphic_L start_POSTSUBSCRIPT italic_l italic_i italic_m italic_b italic_s end_POSTSUBSCRIPT
PA-MPJPE\downarrow MPJPE\downarrow
S 60.1 82.7
239.2 342.8
49.1 71.4
48.3 70.9
C 45.4 69.9
Table 5: Ablation study for RegNet. The last row corresponds to our full model. S = Supervised. C = Contrastive.
Camera
model
NFsubscript𝑁𝐹\mathcal{L}_{NF}caligraphic_L start_POSTSUBSCRIPT italic_N italic_F end_POSTSUBSCRIPT
bonesubscript𝑏𝑜𝑛𝑒\mathcal{L}_{bone}caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_n italic_e end_POSTSUBSCRIPT
defsubscript𝑑𝑒𝑓\mathcal{L}_{def}caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_f end_POSTSUBSCRIPT
limbssubscript𝑙𝑖𝑚𝑏𝑠\mathcal{L}_{limbs}caligraphic_L start_POSTSUBSCRIPT italic_l italic_i italic_m italic_b italic_s end_POSTSUBSCRIPT
2Dsubscript2𝐷\mathcal{L}_{2D}caligraphic_L start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT 3Dsubscript3𝐷\mathcal{L}_{3D}caligraphic_L start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT PA-MPJPE\downarrow MPJPE\downarrow
W 47.7 84.9
232.6 293.0
41.5 56.4
41.3 55.8
40.9 54.9
40.1 54.3
39.5 53.4
F 30.7 50.8
Table 6: Ablation study for LiftNet. The last row correspond to our complete model. W = Weak perspective camera. F = Full perspective camera.
Refer to caption
Figure 5: EPOCH qualitative results on MPI-INF-3DHP [36] (columns 1, 2, 3, 4), 3DPW [48] (columns 5, 6). Rows: input images, RegNet output, LiftNet output (front and side view). Our method can generalize to unseen in-the-wild data (3DPW) even if only trained on Human3.6M data.

4.2 Qualitative results

Fig. 5 shows our qualitative results on challenging poses for Human3.6M and MPI-INF-3DHP. Even in the presence of occlusions and rare poses (e.g. sitting on a chair, lying on the floor with crossed legs), both RegNet and LiftNet obtain visually plausible 3D poses. Moreover, we display results on 3DPW which is unseen at training time. Even if the scale of the 3D pose is different, we still obtain plausible poses from challenging images, demonstrating the ability of our EPOCH approach to generalize to unseen scenarios.

5 Conclusions

In this paper, we presented EPOCH, a novel framework that jointly estimates the 3D pose of cameras and humans consisting of LiftNet and RegNet. LiftNet performs the unsupervised 3D lifting starting from estimations of both 2D poses and camera parameters. To address the unavailability of camera parameters in real world scenarios, we design RegNet, a novel human pose regressor that can jointly estimate 2D and 3D poses as well as perspective camera parameters using weak 2D pose supervision. We show that an estimated full perspective camera allows us to substantially improve the unsupervised 3D human pose estimation accuracy and consistency over state-of-the-art results. By estimating the camera only from 2D poses without any 3D or camera ground truth, we can generalise to unseen data, making a step forward towards fully unsupervised 3D HPE in-the-wild. Supplementary Materials

6 Code

The code will be made available in a Github repository upon acceptance.

7 Details about the Normalizing Flows’ invertible function

As in [49], we design a NF to estimate the plausibility of 2D poses. We train the NF to map each pose x2J𝑥superscript2𝐽x\in\mathbb{R}^{2J}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_J end_POSTSUPERSCRIPT onto a sample point z𝑧zitalic_z of a Gaussian Distribution pz(z)subscript𝑝𝑧𝑧p_{z}(z)italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ). Given a 2D pose x𝑥xitalic_x and an invertible function f𝑓fitalic_f such that 𝐟(z)=x𝐟𝑧𝑥\mathbf{f}(z)=xbold_f ( italic_z ) = italic_x, the PDF of x𝑥xitalic_x can be computed as

px(𝐱)=pz(𝐟1(x))|det(𝐟1𝐱)|,subscript𝑝𝑥𝐱subscript𝑝𝑧superscript𝐟1𝑥superscript𝐟1𝐱p_{x}(\mathbf{x})=p_{z}(\mathbf{f}^{-1}(x))\left|\det\left(\frac{\partial% \mathbf{f}^{-1}}{\partial\mathbf{x}}\right)\right|,italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( bold_x ) = italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( bold_f start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) ) | roman_det ( divide start_ARG ∂ bold_f start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_x end_ARG ) | , (8)

where 𝐟1𝐱superscript𝐟1𝐱\frac{\partial\mathbf{f}^{-1}}{\partial\mathbf{x}}divide start_ARG ∂ bold_f start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_x end_ARG is the Jacobian matrix of the inverse transformation. The NF is trained offline to learn the PDF pz(𝐟1(x))subscript𝑝𝑧superscript𝐟1𝑥p_{z}(\mathbf{f}^{-1}(x))italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( bold_f start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) ) of the 2D poses in a chosen training dataset. Once trained, the probability px(𝐱)subscript𝑝𝑥𝐱p_{x}(\mathbf{x})italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( bold_x ) can be computed for a new sample x𝑥xitalic_x and used as a measure of its plausibility to lie within the distribution of the learned dataset.

8 Camera matrix notation

In the paper we refer to the intrinsic [K]delimited-[]𝐾[K][ italic_K ] and extrinsic [R|t]delimited-[]conditional𝑅𝑡[R|t][ italic_R | italic_t ] camera matrices as detailed below.

The intrinsic matrix, representing the internal parameters of a camera, is defined as:

[K]=[fwskcw0fhch001]delimited-[]𝐾matrixsubscript𝑓𝑤𝑠𝑘subscript𝑐𝑤0subscript𝑓subscript𝑐001[K]=\begin{bmatrix}f_{w}&sk&c_{w}\\ 0&f_{h}&c_{h}\\ 0&0&1\end{bmatrix}[ italic_K ] = [ start_ARG start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_CELL start_CELL italic_s italic_k end_CELL start_CELL italic_c start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_CELL start_CELL italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ]

Where:

  • fwsubscript𝑓𝑤f_{w}italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and fhsubscript𝑓f_{h}italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT are the focal lengths in the x and y directions.

  • sk𝑠𝑘skitalic_s italic_k represents skew, which is usually zero in most cameras. In the paper we assume sk=0𝑠𝑘0sk=0italic_s italic_k = 0 for all cameras.

  • (cw,ch)subscript𝑐𝑤subscript𝑐(c_{w},c_{h})( italic_c start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) is the principal point, the optical center of the image.

The extrinsic matrix, denoting the rotation and translation of the camera with respect to the world coordinate system, can be expressed as:

[R|t]=[r11r12r13tXr21r22r23tYr31r32r33tZ]delimited-[]conditional𝑅𝑡delimited-[]subscript𝑟11subscript𝑟12subscript𝑟13subscript𝑡𝑋subscript𝑟21subscript𝑟22subscript𝑟23subscript𝑡𝑌subscript𝑟31subscript𝑟32subscript𝑟33subscript𝑡𝑍[R|t]=\left[\begin{array}[]{ccc|c}r_{11}&r_{12}&r_{13}&t_{X}\\ r_{21}&r_{22}&r_{23}&t_{Y}\\ r_{31}&r_{32}&r_{33}&t_{Z}\end{array}\right][ italic_R | italic_t ] = [ start_ARRAY start_ROW start_CELL italic_r start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL italic_r start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL start_CELL italic_r start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT end_CELL start_CELL italic_t start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_r start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL italic_r start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT end_CELL start_CELL italic_r start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT end_CELL start_CELL italic_t start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_r start_POSTSUBSCRIPT 31 end_POSTSUBSCRIPT end_CELL start_CELL italic_r start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT end_CELL start_CELL italic_r start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT end_CELL start_CELL italic_t start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ]

Where:

  • [R]delimited-[]𝑅[R][ italic_R ] represents the rotation matrix.

  • t𝑡titalic_t represents the translation vector.

9 Mathematical derivation of the intrinsic parameters from the image

The scaling parameters swsubscript𝑠𝑤s_{w}italic_s start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and shsubscript𝑠s_{h}italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT in the paper are computed as follows:

sw=WWBBμhsubscript𝑠𝑤𝑊subscript𝑊𝐵𝐵subscript𝜇s_{w}=\frac{W}{W_{BB}*\mu_{h}}italic_s start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = divide start_ARG italic_W end_ARG start_ARG italic_W start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT ∗ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG
sh=HHBBμhsubscript𝑠𝐻subscript𝐻𝐵𝐵subscript𝜇s_{h}=\frac{H}{H_{BB}*\mu_{h}}italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = divide start_ARG italic_H end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT ∗ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG

where μhsubscript𝜇\mu_{h}italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is the mean length of the vector from the root joint to the head joint, as in [49]. WBB,HBBsubscript𝑊𝐵𝐵subscript𝐻𝐵𝐵W_{BB},H_{BB}italic_W start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_B italic_B end_POSTSUBSCRIPT are the width and height of the bounding box in pixel prior scaling it to 224×224224224224\times 224224 × 224.

Taking inspiration from [31], we estimate the focal length starting from the uncropped input image width Wfullsubscript𝑊𝑓𝑢𝑙𝑙W_{full}italic_W start_POSTSUBSCRIPT italic_f italic_u italic_l italic_l end_POSTSUBSCRIPT and height Hfullsubscript𝐻𝑓𝑢𝑙𝑙H_{full}italic_H start_POSTSUBSCRIPT italic_f italic_u italic_l italic_l end_POSTSUBSCRIPT as follows:

f=Wfull2+Hfull2𝑓superscriptsubscript𝑊𝑓𝑢𝑙𝑙2superscriptsubscript𝐻𝑓𝑢𝑙𝑙2f=\sqrt{W_{full}^{2}+H_{full}^{2}}italic_f = square-root start_ARG italic_W start_POSTSUBSCRIPT italic_f italic_u italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_H start_POSTSUBSCRIPT italic_f italic_u italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG

and scale it along each axis

fw=fswsubscript𝑓𝑤𝑓subscript𝑠𝑤f_{w}=f*s_{w}italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = italic_f ∗ italic_s start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT
fh=fshsubscript𝑓𝑓subscript𝑠f_{h}=f*s_{h}italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_f ∗ italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT

The principal point is computed as:

cw=(CwLEFTW2)swsubscript𝑐𝑤subscript𝐶𝑤𝐿𝐸𝐹𝑇𝑊2subscript𝑠𝑤c_{w}=\Big{(}C_{w}-LEFT-\frac{W}{2}\Big{)}*s_{w}italic_c start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = ( italic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT - italic_L italic_E italic_F italic_T - divide start_ARG italic_W end_ARG start_ARG 2 end_ARG ) ∗ italic_s start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT
ch=(ChTOPH2)shsubscript𝑐subscript𝐶𝑇𝑂𝑃𝐻2subscript𝑠c_{h}=\Big{(}C_{h}-TOP-\frac{H}{2}\Big{)}*s_{h}italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = ( italic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_T italic_O italic_P - divide start_ARG italic_H end_ARG start_ARG 2 end_ARG ) ∗ italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT

where Cw=Wfull2subscript𝐶𝑤subscript𝑊𝑓𝑢𝑙𝑙2C_{w}=\frac{W_{full}}{2}italic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = divide start_ARG italic_W start_POSTSUBSCRIPT italic_f italic_u italic_l italic_l end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG, Ch=Hfull2subscript𝐶subscript𝐻𝑓𝑢𝑙𝑙2C_{h}=\frac{H_{full}}{2}italic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = divide start_ARG italic_H start_POSTSUBSCRIPT italic_f italic_u italic_l italic_l end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG. LEFT𝐿𝐸𝐹𝑇LEFTitalic_L italic_E italic_F italic_T and TOP𝑇𝑂𝑃TOPitalic_T italic_O italic_P are the pixel coordinates of the top-left corner of the unscaled bounding box in the full-size image.

10 Mathematical derivation of the extrinsic parameters from the capsule

When predicting capsules, we estimate Γ3JΓsuperscript3𝐽\Gamma\in\mathbb{R}^{3J}roman_Γ ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_J end_POSTSUPERSCRIPT from which we can compute the rotation matrix [R]delimited-[]𝑅[R][ italic_R ] and translation vector t𝑡titalic_t with respect to the input image’s viewpoint.

Starting from ΓΓ\Gammaroman_Γ we compute the average vector Γ¯3¯Γsuperscript3\bar{\Gamma}\in\mathbb{R}^{3}over¯ start_ARG roman_Γ end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. Γ¯¯Γ\bar{\Gamma}over¯ start_ARG roman_Γ end_ARG is expressed as [θX,θY,wp]subscript𝜃𝑋subscript𝜃𝑌subscript𝑤𝑝[\theta_{X},\theta_{Y},w_{p}][ italic_θ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ], where θXsubscript𝜃𝑋\theta_{X}italic_θ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and θYsubscript𝜃𝑌\theta_{Y}italic_θ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT are the euler rotation angles in the X𝑋Xitalic_X and Y𝑌Yitalic_Y world-space axis and wpsubscript𝑤𝑝w_{p}italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the distance from the pelvis to the camera along the Z axis. We discard the rotation along the Z𝑍Zitalic_Z axis because that would lead to infinite configurations of the predicted camera and 3D pose. Thus, we set θZ=0subscript𝜃𝑍0\theta_{Z}=0italic_θ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT = 0.

Starting from [θX,θY,0]subscript𝜃𝑋subscript𝜃𝑌0[\theta_{X},\theta_{Y},0][ italic_θ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT , 0 ], we can derive the rotation matrix as:

[R]=[1000cos(θX)sin(θX)0sin(θX)cos(θX)]delimited-[]𝑅delimited-[]1000subscript𝜃𝑋subscript𝜃𝑋0subscript𝜃𝑋subscript𝜃𝑋\displaystyle[R]=\left[\begin{array}[]{ccc}1&0&0\\ 0&\cos(\theta_{X})&-\sin(\theta_{X})\\ 0&\sin(\theta_{X})&\cos(\theta_{X})\end{array}\right][ italic_R ] = [ start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_cos ( italic_θ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) end_CELL start_CELL - roman_sin ( italic_θ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_sin ( italic_θ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) end_CELL start_CELL roman_cos ( italic_θ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY ] [cos(θY)0sin(θY)010sin(θY)0cos(θY)]delimited-[]subscript𝜃𝑌0subscript𝜃𝑌010subscript𝜃𝑌0subscript𝜃𝑌\displaystyle\left[\begin{array}[]{ccc}\cos(\theta_{Y})&0&\sin(\theta_{Y})\\ 0&1&0\\ -\sin(\theta_{Y})&0&\cos(\theta_{Y})\end{array}\right][ start_ARRAY start_ROW start_CELL roman_cos ( italic_θ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) end_CELL start_CELL 0 end_CELL start_CELL roman_sin ( italic_θ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL - roman_sin ( italic_θ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) end_CELL start_CELL 0 end_CELL start_CELL roman_cos ( italic_θ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY ]

Starting from wpsubscript𝑤𝑝w_{p}italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, we can derive the translation vector. We define a setting in which the pelvis is centered in (X,Y,Z)=(0,0,0)𝑋𝑌𝑍000(X,Y,Z)=(0,0,0)( italic_X , italic_Y , italic_Z ) = ( 0 , 0 , 0 ) in world coordinates and in (u,v)=(0,0)𝑢𝑣00(u,v)=(0,0)( italic_u , italic_v ) = ( 0 , 0 ) on the image plane. Thus we can express the relation between the pelvis in 2D and 3D as:

[00wp]=[fw0cw0fhch001][r11r12r13tXr21r22r23tYr31r32r33tZ][0001]matrix00subscript𝑤𝑝matrixsubscript𝑓𝑤0subscript𝑐𝑤0subscript𝑓subscript𝑐001delimited-[]subscript𝑟11subscript𝑟12subscript𝑟13subscript𝑡𝑋subscript𝑟21subscript𝑟22subscript𝑟23subscript𝑡𝑌subscript𝑟31subscript𝑟32subscript𝑟33subscript𝑡𝑍matrix0001\begin{bmatrix}0\\ 0\\ w_{p}\end{bmatrix}=\begin{bmatrix}f_{w}&0&c_{w}\\ 0&f_{h}&c_{h}\\ 0&0&1\end{bmatrix}\left[\begin{array}[]{ccc|c}r_{11}&r_{12}&r_{13}&t_{X}\\ r_{21}&r_{22}&r_{23}&t_{Y}\\ r_{31}&r_{32}&r_{33}&t_{Z}\end{array}\right]\begin{bmatrix}0\\ 0\\ 0\\ 1\end{bmatrix}[ start_ARG start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL start_CELL italic_c start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_CELL start_CELL italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] [ start_ARRAY start_ROW start_CELL italic_r start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL italic_r start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL start_CELL italic_r start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT end_CELL start_CELL italic_t start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_r start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL italic_r start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT end_CELL start_CELL italic_r start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT end_CELL start_CELL italic_t start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_r start_POSTSUBSCRIPT 31 end_POSTSUBSCRIPT end_CELL start_CELL italic_r start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT end_CELL start_CELL italic_r start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT end_CELL start_CELL italic_t start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] [ start_ARG start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ]

from which we can obtain

[00wp]=[fwtX+cwtZfhtY+chtZtZ]matrix00subscript𝑤𝑝matrixsubscript𝑓𝑤subscript𝑡𝑋subscript𝑐𝑤subscript𝑡𝑍subscript𝑓subscript𝑡𝑌subscript𝑐subscript𝑡𝑍subscript𝑡𝑍\begin{bmatrix}0\\ 0\\ w_{p}\end{bmatrix}=\begin{bmatrix}f_{w}*t_{X}+c_{w}*t_{Z}\\ f_{h}*t_{Y}+c_{h}*t_{Z}\\ t_{Z}\end{bmatrix}[ start_ARG start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∗ italic_t start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∗ italic_t start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∗ italic_t start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∗ italic_t start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_t start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]

thus deriving

[tXtYtZ]=[cwwpfwchwpfhwp]matrixsubscript𝑡𝑋subscript𝑡𝑌subscript𝑡𝑍matrixsubscript𝑐𝑤subscript𝑤𝑝subscript𝑓𝑤subscript𝑐subscript𝑤𝑝subscript𝑓subscript𝑤𝑝\begin{bmatrix}t_{X}\\ t_{Y}\\ t_{Z}\end{bmatrix}=\begin{bmatrix}-\frac{c_{w}*w_{p}}{f_{w}}\\ -\frac{c_{h}*w_{p}}{f_{h}}\\ w_{p}\end{bmatrix}[ start_ARG start_ROW start_CELL italic_t start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_t start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_t start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL - divide start_ARG italic_c start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∗ italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL - divide start_ARG italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∗ italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]

11 Optimization of Loss Function Weights

In RegNet and LiftNet, we encounter a discrepancy in the scale of different components of the loss function. Specifically, certain losses such as RLEsubscript𝑅𝐿𝐸\mathcal{L}_{RLE}caligraphic_L start_POSTSUBSCRIPT italic_R italic_L italic_E end_POSTSUBSCRIPT and NFsubscript𝑁𝐹\mathcal{L}_{NF}caligraphic_L start_POSTSUBSCRIPT italic_N italic_F end_POSTSUBSCRIPT are formulated as negative-log-likelihoods, resulting in predominantly highly negative values. In contrast, other components of the loss function yield positive values close to zero. To facilitate a more stable and efficient optimization process, we implement a scaling mechanism for RLEsubscript𝑅𝐿𝐸\mathcal{L}_{RLE}caligraphic_L start_POSTSUBSCRIPT italic_R italic_L italic_E end_POSTSUBSCRIPT and NFsubscript𝑁𝐹\mathcal{L}_{NF}caligraphic_L start_POSTSUBSCRIPT italic_N italic_F end_POSTSUBSCRIPT to align their value ranges with the other loss components. More in detail, we shift the loss function vertically by its expected lowest value and divide it by its expected vertical span, to make the scaled loss fit the interval [0,1]01[0,1][ 0 , 1 ].

Additionally, we introduce a non-trainable hyperparameter λ𝜆\lambdaitalic_λ, set to 10101010, to strategically weight certain loss components more heavily in the overall loss function. This approach allows for greater emphasis on key aspects of the learning objective.

For RegNet, we apply this weighting strategy to RLEsubscript𝑅𝐿𝐸\mathcal{L}_{RLE}caligraphic_L start_POSTSUBSCRIPT italic_R italic_L italic_E end_POSTSUBSCRIPT, recognizing its critical role in the network’s performance. In the case of LiftNet, the weights are applied to 2Dsubscript2𝐷\mathcal{L}_{2D}caligraphic_L start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT and bonesubscript𝑏𝑜𝑛𝑒\mathcal{L}_{bone}caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_n italic_e end_POSTSUBSCRIPT. This ensures the integrity of the pose cycle consistency and maintains the correct proportions of the 3D skeletal model in terms of bone lengths.

In both RegNet and LiftNet, limbssubscript𝑙𝑖𝑚𝑏𝑠\mathcal{L}_{limbs}caligraphic_L start_POSTSUBSCRIPT italic_l italic_i italic_m italic_b italic_s end_POSTSUBSCRIPT is treated as a regularization term. Accordingly, it is scaled down by a factor of 0.10.10.10.1, reflecting its role in preventing overfitting and ensuring generalization, rather than directly driving the primary learning objective.

12 Additional qualitative results

In Fig. 6 we show additional qualitative results for LiftNet. Our method shows promising results even when the original annotation x𝑥xitalic_x is not correct (see Fig. 7), since the estimated camera ([K][R|t]delimited-[]𝐾delimited-[]conditional𝑅𝑡[K][R|t][ italic_K ] [ italic_R | italic_t ]) and the estimated 2D pose (x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG) that RegNet gives as output are robust enough to provide a good pseudo-ground truth.

Refer to caption
Figure 6: Intermediate results of our LiftNet architecture on MPI-INF-3DHP [36], as detailed in Fig. 3 in the main paper.
Refer to caption
Figure 7: Additional LiftNet intermediate results on MPI-INF-3DHP [36]. In this particular example, our 2D joint predictions x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG coming from the RegNet correct an annotation error (right and left legs are flipped) in the original annotation x𝑥xitalic_x, leading to a possibly better 3D annotation as well.

References

  • [1] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems 33, 9912–9924 (2020)
  • [2] Chen, C.H., Tyagi, A., Agrawal, A., Drover, D., Mv, R., Stojanov, S., Rehg, J.M.: Unsupervised 3d pose estimation with geometric self-supervision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019)
  • [3] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: Simclr: A simple framework for contrastive learning of visual representations. In: International Conference on Learning Representations. vol. 2 (2020)
  • [4] Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7103–7112 (2018)
  • [5] Chun, S., Park, S., Chang, J.Y.: Learnable human mesh triangulation for 3d human pose and shape estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2850–2859 (2023)
  • [6] Dabral, R., Mundhada, A., Kusupati, U., Afaque, S., Sharma, A., Jain, A.: Learning 3d human pose from structure and motion. In: Proceedings of the European conference on computer vision (ECCV). pp. 668–683 (2018)
  • [7] Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real nvp. arXiv preprint arXiv:1605.08803 (2016)
  • [8] Drover, D., MV, R., Chen, C.H., Agrawal, A., Tyagi, A., Phuoc Huynh, C.: Can 3d pose be learned from 2d projections alone? In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops. pp. 0–0 (2018)
  • [9] Fish Tung, H.Y., Harley, A.W., Seto, W., Fragkiadaki, K.: Adversarial inverse graphics networks: Learning 2d-to-3d lifting and image-to-image translation from unpaired supervision. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4354–4362 (2017)
  • [10] Gholami, M., Wandt, B., Rhodin, H., Ward, R., Wang, Z.J.: Adaptpose: Cross-dataset adaptation for 3d human pose estimation by learnable motion generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13075–13085 (2022)
  • [11] Gong, K., Zhang, J., Feng, J.: Poseaug: A differentiable pose augmentation framework for 3d human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8575–8584 (2021)
  • [12] Habibie, I., Xu, W., Mehta, D., Pons-Moll, G., Theobalt, C.: In the wild human pose estimation using explicit 2d features and intermediate 3d representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10905–10914 (2019)
  • [13] Hardy, P., Kim, H.: Unsupervised reconstruction of 3d human pose interactions from 2d poses alone. arXiv preprint arXiv:2309.14865 (2023)
  • [14] Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge university press (2003)
  • [15] Hu, X., Ahuja, N.: Unsupervised 3d pose estimation for hierarchical dance video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11015–11024 (2021)
  • [16] Huang, B., Shu, Y., Ju, J., Wang, Y.: Occluded human body capture with self-supervised spatial-temporal motion prior. arXiv preprint arXiv:2207.05375 (2022)
  • [17] Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence 36(7), 1325–1339 (2013)
  • [18] Ji, X., Fang, Q., Dong, J., Shuai, Q., Jiang, W., Zhou, X.: A survey on monocular 3d human pose estimation. Virtual Reality & Intelligent Hardware 2(6), 471–500 (2020)
  • [19] Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara, S., Sheikh, Y.: Panoptic studio: A massively multiview system for social motion capture. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 3334–3342 (2015)
  • [20] Kingma, D.P., Dhariwal, P.: Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems 31 (2018)
  • [21] Kissos, I., Fritz, L., Goldman, M., Meir, O., Oks, E., Kliger, M.: Beyond weak perspective for monocular 3d human pose estimation. In: ECCV Workshops (2020)
  • [22] Kobyzev, I., Prince, S.J., Brubaker, M.A.: Normalizing flows: An introduction and review of current methods. IEEE transactions on pattern analysis and machine intelligence 43(11), 3964–3979 (2020)
  • [23] Kocabas, M., Huang, C.H.P., Tesch, J., Müller, L., Hilliges, O., Black, M.J.: Spec: Seeing people in the wild with an estimated camera. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11035–11045 (2021)
  • [24] Kocabas, M., Karagoz, S., Akbas, E.: Self-supervised learning of 3d human pose using multi-view geometry. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1077–1086 (2019)
  • [25] Kosiorek, A., Sabour, S., Teh, Y.W., Hinton, G.E.: Stacked capsule autoencoders. Advances in neural information processing systems 32 (2019)
  • [26] Kundu, J.N., Seth, S., Jamkhandi, A., YM, P., Jampani, V., Chakraborty, A., et al.: Non-local latent relation distillation for self-adaptive 3d human pose estimation. Advances in Neural Information Processing Systems 34, 158–171 (2021)
  • [27] Kundu, J.N., Seth, S., Jampani, V., Rakesh, M., Babu, R.V., Chakraborty, A.: Self-supervised 3d human pose estimation via part guided novel image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6152–6162 (2020)
  • [28] Kundu, J.N., Seth, S., YM, P., Jampani, V., Chakraborty, A., Babu, R.V.: Uncertainty-aware adaptation for self-supervised 3d human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20448–20459 (2022)
  • [29] Li, J., Bian, S., Zeng, A., Wang, C., Pang, B., Liu, W., Lu, C.: Human pose regression with residual log-likelihood estimation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11025–11034 (2021)
  • [30] Li, W., Liu, H., Tang, H., Wang, P., Van Gool, L.: Mhformer: Multi-hypothesis transformer for 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13147–13156 (2022)
  • [31] Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: Cliff: Carrying location information in full frames into human pose and shape estimation. In: European Conference on Computer Vision. pp. 590–606. Springer (2022)
  • [32] Liu, W., Bao, Q., Sun, Y., Mei, T.: Recent advances of monocular 2d and 3d human pose estimation: a deep learning perspective. ACM Computing Surveys 55(4), 1–41 (2022)
  • [33] Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34(6), 248:1–248:16 (Oct 2015)
  • [34] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv (2018)
  • [35] Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE international conference on computer vision. pp. 2640–2649 (2017)
  • [36] Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., Theobalt, C.: Monocular 3d human pose estimation in the wild using improved cnn supervision. In: 3D Vision (3DV), 2017 Fifth International Conference on. IEEE (2017). https://doi.org/10.1109/3dv.2017.00064, http://gvv.mpi-inf.mpg.de/3dhp_dataset
  • [37] Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. pp. 483–499. Springer (2016)
  • [38] Pavlakos, G., Zhou, X., Daniilidis, K.: Ordinal depth supervision for 3d human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7307–7316 (2018)
  • [39] Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3d human pose. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7025–7034 (2017)
  • [40] Pietak, A., Ma, S., Beck, C.W., Stringer, M.D.: Fundamental ratios and logarithmic periodicity in human limb bones. Journal of anatomy 222(5), 526–537 (2013)
  • [41] Rhodin, H., Salzmann, M., Fua, P.: Unsupervised geometry-aware representation for 3d human pose estimation. In: Proceedings of the European conference on computer vision (ECCV). pp. 750–767 (2018)
  • [42] Rhodin, H., Spörri, J., Katircioglu, I., Constantin, V., Meyer, F., Müller, E., Salzmann, M., Fua, P.: Learning monocular 3d human pose estimation from multi-view images. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8437–8446 (2018)
  • [43] Sabour, S., Tagliasacchi, A., Yazdani, S., Hinton, G., Fleet, D.J.: Unsupervised part representation by flow capsules. In: International Conference on Machine Learning. pp. 9213–9223. PMLR (2021)
  • [44] Shan, W., Liu, Z., Zhang, X., Wang, Z., Han, K., Wang, S., Ma, S., Gao, W.: Diffusion-based 3d human pose estimation with multi-hypothesis aggregation. arXiv preprint arXiv:2303.11579 (2023)
  • [45] Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: Proceedings of the IEEE international conference on computer vision. pp. 2602–2611 (2017)
  • [46] Tripathi, S., Ranade, S., Tyagi, A., Agrawal, A.: Posenet3d: Learning temporally consistent 3d human pose via knowledge distillation. In: 2020 International Conference on 3D Vision (3DV). pp. 311–321. IEEE (2020)
  • [47] Usman, B., Tagliasacchi, A., Saenko, K., Sud, A.: Metapose: Fast 3d pose from multiple views without 3d supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6759–6770 (2022)
  • [48] Von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3d human pose in the wild using imus and a moving camera. In: Proceedings of the European conference on computer vision (ECCV). pp. 601–617 (2018)
  • [49] Wandt, B., Little, J.J., Rhodin, H.: Elepose: Unsupervised 3d human pose estimation by predicting camera elevation and learning normalizing flows on 2d poses. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6635–6645 (2022)
  • [50] Wandt, B., Rosenhahn, B.: Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7782–7791 (2019)
  • [51] Wandt, B., Rudolph, M., Zell, P., Rhodin, H., Rosenhahn, B.: Canonpose: Self-supervised monocular 3d human pose estimation in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13294–13304 (2021)
  • [52] Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: Simple vision transformer baselines for human pose estimation. arXiv preprint arXiv:2204.12484 (2022)
  • [53] Yang, S., Quan, Z., Nie, M., Yang, W.: Transpose: Keypoint localization via transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11802–11812 (2021)
  • [54] Yu, Z., Ni, B., Xu, J., Wang, J., Zhao, C., Zhang, W.: Towards alleviating the modeling ambiguity of unsupervised monocular 3d human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8651–8660 (2021)
  • [55] Zhang, J., Tu, Z., Yang, J., Chen, Y., Yuan, J.: Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13232–13242 (2022)
  • [56] Zhang, S., Wang, C., Dong, W., Fan, B.: A survey on depth ambiguity of 3d human pose estimation. Applied Sciences 12(20), 10591 (2022)
  • [57] Zheng, C., Wu, W., Chen, C., Yang, T., Zhu, S., Shen, J., Kehtarnavaz, N., Shah, M.: Deep learning-based human pose estimation: A survey. ACM Computing Surveys 56(1), 1–37 (2023)
  • [58] Zhu, W., Ma, X., Liu, Z., Liu, L., Wu, W., Wang, Y.: Motionbert: A unified perspective on learning human motion representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15085–15099 (2023)