EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans
Abstract
Monocular Human Pose Estimation (HPE) aims at determining the 3D positions of human joints from a single 2D image captured by a camera. However, a single 2D point in the image may correspond to multiple points in 3D space. Typically, the uniqueness of the 2D-3D relationship is approximated using an orthographic or weak-perspective camera model. In this study, instead of relying on approximations, we advocate for utilizing the full perspective camera model. This involves estimating camera parameters and establishing a precise, unambiguous 2D-3D relationship.
To do so, we introduce the EPOCH framework, comprising two main components: the pose lifter network (LiftNet) and the pose regressor network (RegNet). LiftNet utilizes the full perspective camera model to precisely estimate the 3D pose in an unsupervised manner. It takes a 2D pose and camera parameters as inputs and produces the corresponding 3D pose estimation. These inputs are obtained from RegNet, which starts from a single image and provides estimates for the 2D pose and camera parameters. RegNet utilizes only 2D pose data as weak supervision. Internally, RegNet predicts a 3D pose, which is then projected to 2D using the estimated camera parameters. This process enables RegNet to establish the unambiguous 2D-3D relationship.
Our experiments show that modeling the lifting as an unsupervised task with a camera in-the-loop results in better generalization to unseen data. We obtain state-of-the-art results for the 3D HPE on the Human3.6M and MPI-INF-3DHP datasets. Our code is available at: [Github link upon acceptance, see supplementary materials].
![Refer to caption](x1.png)
1 Introduction
There are two main approaches to monocular 3D human pose estimation (HPE) from a single RGB images [18]. One class of algorithms uses a single-stage approach, where the aim is to regress the 3D position of human joints directly from an image [38, 39, 45]. The other class of approaches use two distinct stages, where the first step is to infer 2D poses from monocular RGB images, which is followed by a lifter network that predicts the 3D displacement for each of the 2D joints. Two stage approaches typically outperform single stage approaches [57].
Estimating 3D human poses from a single RGB image is difficult as the problem is inherently ill-posed. For any 2D observation there exist multiple plausible 3D poses that will lead to the same 2D projection [56, 33]. Additionally, collecting reliable ground-truth 3D data is difficult. Annotating 3D ground truth on 2D images inevitably introduces inaccuracies, and collecting actual ground truth requires a complex and expensive controlled environment using multi-view camera systems or additional capture modalities. Even using multiple views, triangulation can lead to errors, as there are inherent ambiguities in the position of the joint under the body surface. While limited datasets providing 3D data are available [17, 19], 2D datasets still provide more data in more general scenarios and environments.
To address the ill-posed nature of the problem, past approaches have relied on different strategies, like fully supervised training using either real [6, 44] or synthetic 3D ground truth [28, 26], weakly supervised training relying on multiple views: either paired [47, 51] or unpaired [50], 2D supervision [46], or video motion consistency [16, 10, 15]. Unsupervised approaches have relied on cycle consistency coupled with a weak perspective camera projection to lift 2D poses to 3D [49, 2]. Relying on a weak perspective camera projection is not ideal because it does not accurately capture the perspective transformation [14], leading to depth inaccuracy and scale ambiguity when projecting a 2D skeleton in the 3D space. Recent works [23] have shown that using fully perspective cameras reduces this ambiguity.
In this paper, we introduce EPOCH, a novel unsupervised framework that effectively addresses the challenges of data scarcity by harnessing unsupervised techniques and mitigates the inherent ill-posed nature of the problem through explicit camera modeling, as shown in Fig. 1. Our approach stands out due to its capability to estimate the full perspective camera parameters leveraging only 2D human poses, without relying on any camera ground truth. We claim that by incorporating the estimated camera into the 3D lifting operation, it is possible to enhance the accuracy and the consistency of 3D unsupervised human pose estimation while generalizing to unseen data. Our method consists of an unsupervised 3D human pose lifter network (LiftNet) and a lightweight capsule-based regressor network (RegNet) that estimates the camera pose and 2D joint positions.
LiftNet performs the 3D lifting from estimated 2D poses and camera parameters. Inspired by [2], we employ a self-supervising cycle-consistent framework. Unlike [49, 2], our approach uses a full perspective camera, allowing us to use a wider range of camera transformations for supervision in our cycle consistency, and improving the accuracy of the model.
We estimate the camera pose and 2D poses used as input for the lifting stage using RegNet, a lightweight capsule-based regressor network that is trained on weakly supervised 2D pose data instead of fully supervised data as [23]. It use contrastive pre-training and heatmap-free joint position regression to estimate the 2D poses as well as the intrinsic (the camera matrix ) and extrinsic parameters (rotation matrix and translation vector ) of a camera based on the standard pinhole camera model. To supervise the camera estimation with the 2D pose data we internally predict a 3D pose that is then projected to 2D with the estimated camera. The internally predicted 3D pose is not quite as accurate as the refined output of LiftNet but helps regularizing the camera estimation.
With no prior about 3D human appearance, both LiftNet and RegNet estimate a single 2D projection which is not enough to guarantee a plausible 3D pose. Thus, we employ Normalizing Flows (NF) to ensure the plausibility of multiple 2D projections of a single 3D estimate. Different from previous approaches [49], our NF is based on simple 1x1 convolutions [20], which can be applied to the full feature representation of the poses, without the need to reduce their dimensionality using the Principal Component Analysis (PCA).
The EPOCH framework is the sequential combination of RegNet and LiftNet, which allows for the direct inference of accurate 3D poses from images. RegNet estimates 3D poses with weak 2D pose supervision, while the camera parameters are estimated without any ground truth camera data. LiftNet predicts 3D poses based on the estimates of 2D poses and camera poses. We argue that RegNet is weakly supervised, whereas LiftNet is fully unsupervised as it relies solely on estimates, without any ground truth data. This reasoning applies to both poses and cameras.
The novelty of our work can be summarized as:
-
•
We define an innovative EPOCH framework to address the challenges of data scarcity and the ill-posed nature of 3D HPE problem by harnessing the full camera perspective projection, enabling the direct lifting of accurate 3D poses from input images.
-
•
We present LiftNet, a novel 3D unsupervised HPE framework that leverages perspective camera projection to improve the accuracy of 2D pose lifting.
-
•
We introduce an original capsule-based regressor, called RegNet, which jointly estimates 3D joint positions and camera parameters. Final 2D joints are computed using the estimated camera by perspective projection.
-
•
We adopt a lightweight Normalizing Flows [20] model to enforce anthropomorphic constraints. Our NF accepts the entire 2D skeleton as input without the need for dimensionality reduction using PCA.
-
•
We obtain state-of-the-art results on both 3D HPE direct regression and 3D unsupervised HPE on the common benchmark datasets Human3.6M and MPI-INF-3DHP.
2 Related work
3D human pose estimation (HPE) from monocular 2D images has been extensively researched through supervised, weakly-supervised, and unsupervised approaches [57, 32]. This section gives an overview of the different methods, also focusing on the challenging in-the-wild approaches.
Fully supervised approaches. In this paradigm the 3D ground truth is readily available. Gathering such data requires collecting vast datasets such as Human3.6M [17], 3DPW [48], MPI-INF-3DHP [36] and CMU Panoptic [19]. In supervised methods, two primary strategies exist: direct regression of 3D coordinates from the image [38, 39, 45] (Fig. 1(a)), or 2D pose estimation followed by lifting to 3D [52, 53, 58, 5] (Fig. 1(b)). Direct regression proves more challenging because it involves the simultaneous estimation of 3D coordinates for each joint, often leading to inferior results compared to lifting-based techniques [57]. Supervised networks achieve the best results on multiple datasets [58, 5], but often struggle to generalize to different scenarios like out-of-distribution poses, challenging camera angles and in-the-wild pose estimation [57].
Weakly-supervised approaches rely on the lifting framework [47, 51, 50, 46, 15, 55, 30, 27, 8] using various supervision signals without directly accessing the 3D ground truth paired with the corresponding 2D image. For instance, multiple paired or unpaired views of the same subject provide a supervision signal, through the consistency of the estimated 3D pose seen from different viewpoints [47, 51, 50, 42, 27, 24]. In monocular approaches, temporally correlated 2D poses can be estimated from an input video and used as a supervision signal for a frame-specific 3D pose estimation [15, 55, 30]. In-the-wild approaches have mostly relied on 2D pose as ground truths to supervise intermediate 3D estimates [54, 12]. Other approaches perform monocular 3D pose estimation using only 2D pose supervision [27, 9, 2, 50].
Following this line of work, our regressor network (RegNet) is a novel weakly-supervised approach that uses 2D poses for supervision, computing the 3D to 2D projection via estimated camera parameters. Moreover, our approach is the first that jointly estimates the full perspective camera parameters without relying on any ground truth camera.
Unsupervised approaches. Unsupervised approaches usually rely on the lifting paradigm, employing multiple different signals to regularise their 3D predictions. In [2], the authors proposed an unsupervised lifting network grounded in closure and invariance properties, incorporating a geometric self-consistency loss. The closure property for a lifted 3D skeleton means that, after random rotation and re-projection, the resulting 2D skeleton will lie within the distribution of valid 2D poses. The invariance propriety means that, when changing the viewpoint of 2D projection from a 3D skeleton, the re-lifted 3D skeleton should be the same. Following a similar concept [49] introduces a weak camera projection to model the lift-reproject-lift process. The weak camera projection is coupled with the elevation estimation of the camera, providing an approximation of the full camera perspective model. Moreover, they introduce the use of Normalizing Flows (NFs) [7], which are used to ensure the closure property more accurately than GAN-based methods [2]. In [13] a similar framework is extended to a multi-person scenario, where relative positions are also used as supervision signals.
Our lifter network (LiftNet) follows this line of work while introducing the following novelties: (i) we employ a full perspective camera model for the projection, making it much more accurate and robust to varying focal lengths, (ii) we drop the need for an additional elevation prediction branch in the lifting network [49], (iii) we avoid applying the PCA to reduce data dimensionality by training a normalizing flow based on Glow [20] instead of RealNVP, (iv) we add a geometric constraint on unnatural joints folds.
3 Method
3.1 Preliminaries
Camera model. While many prior works for 3D human pose estimation rely on simplified weak perspective camera models, we use a full perspective camera model, consisting of intrinsic parameters K (focal length and center of projection) and extrinsic parameters R and T (rotation and translation of the camera respectively). We transform a 3D point into camera coordinates by multiplying with the extrinsic and intrinsic matrices, and get the final image space coordinates .
(1) |
To invert the projection given image coordinates , we need to estimate the unknown depth of the point. Similar to [31] we estimate the intrinsic parameters directly from the full image size and the bounding box crop using a model . While this is only an approximation of the real field-of-view of the camera, prior work [21] has shown this to be a good approximation for most real life cameras used. The estimated intrinsic camera matrix includes the focal length and the principal point . Moreover, we estimate a scaling factor for the regularisation of the image size and skeleton height. See Supplementary Materials for further details on model .
Normalizing flows loss. Normalizing flows (NFs) are a class of generative neural networks capable of map** a complex distribution into a simpler one using invertible functions [22]. They are trained to learn the probability density function (PDF) of a given dataset relying on an invertible function . See Supplementary material for further details about the modeling of this function. When presented with a novel sample, NFs can estimate the likelihood (plausibility) that the given sample belongs to the learned dataset distribution.
In [49], the learnable function is based on RealNVP [7] which is not suitable for high dimensional data like 2D poses, necessitating a PCA reduction of the input vector for an optimal convergence during training. In contrast, our NFs are based on the Glow framework [20] relying on 1x1 convolutions. Small and fast convolutions allow the use of the full 2D joints’ coordinates without the need for a feature reduction as well as reducing the computational costs.
During the training of our architecture, the NFs are used to verify whether multiple projections of 3D poses are all plausible 2D poses without relying on multi-view data. To achieve this, we define the normalizing flow loss , as the negative log-likelihood of the PDF:
(2) |
![Refer to caption](x2.png)
Anthropomorphic constraints. In supervised 3D HPE, the neural network has explicit access to the 3D ground truth to learn the appearance of a 3D human pose. In unsupervised or semi-supervised settings, we introduce regularising losses to ensure that the estimated 3D poses respect anthropomorphic constraints, such as proportional bone lengths and articulation angle limits.
As in [49], we use a bones ratio loss to ensure that the ratio between bones lengths of 3D pose are respected. This loss leverages the observed nearly constant ratio between bones across different individuals [40], without fixing the bone length to a pre-defined value.
Additionally, we define a novel loss which ensures that joints like knees and elbows do not bend in unrealistic manners (e.g. facing backward with respect to the normal walking direction). It is defined as:
(3) |
where represents the number of limbs, and denote the normal components of the proximal () and distal () components of each limb, and represents the normal vector for the plane defined by the hips and spine joints. In Fig. 2 we show a visual representation of to better convey the intuitive reasoning behind its mathematical formulation.
3.2 Pose Lifter Network (LiftNet)
LiftNet is our lifter module which introduces the paradigm shown in Fig. 1(d). The overall detailed architecture is shown in Fig. 3.
![Refer to caption](x3.png)
LiftNet aims at retrieving the 3D pose , starting from a 2D pose and its estimated camera parameters and . As shown in Fig. 3, our architecture consists of a cycle consistency structure which can be split into two symmetric branches: a forward branch () and a backward branch (). Each step and its input/output are described in Alg. 1. The lift operation is performed by a lifter network while the projection is a mathematical operation. Both of these operations rely on the full perspective model using camera parameters . All the losses provide self-supervision to the cycle consistency, which does not access either 2D or 3D ground truths, making it a fully unsupervised process.
Differently from previous approaches using the weak camera model [49], our lifter leverages the full perspective camera model to recover the 3rd dimension for each input 2D pose . Using the estimated we can solve the inverse of the projection of (Eq. 1) and recover the 3D joint positions. This Lift operation is symbolized as .
Given a 3D poses we can perform the Project operation symbolized as . That means computing the 2D pose using the full camera projection in Eq. 1.
Inspired by previous approaches [35, 49], our lifter network consists of a simple MLP structure. The MLP receives as input a 2D pose concatenated with the flattened version the extrinsic parameters (12 total values), the intrinsic parameters , and the scaling factor , resulting in a vector of size . The input vector is fed (I) to a linear layer to obtain an embedded vector of size , (II) to 3 residual blocks each containing 2 fully connected layers, (III) to a linear layer to obtain the output vector of size representing the depth parameter for each joint. The vector is concatenated with the input resulting in the estimated 3D pose .
To train the LiftNet, we minimize the following loss:
(4) |
where:
-
•
is the norm-1 distance between the initial prediction and its version after the cycle consistency loop ;
-
•
is the norm-2 distance between the 3D pose and its version after it has been projected and lifted ;
-
•
, and are the losses defined in Sec. 3.1;
-
•
is a deformation loss computed between two poses and belonging to the same batch defined as
(5) which ensures that 2 poses from the same batch have not been deformed in completely different manners by the same Project and Lift operations, providing a supervision similar to the temporal consistency defined in [54], but without relying on temporally-related data.
![Refer to caption](x4.png)
3.3 Pose Regressor Network (RegNet)
RegNet is our direct regression module (Fig. 1(c)) which is used to estimate the camera pose and initial 2D pose used for the lifting stage. The overall detailed architecture is shown in Fig. 4. The input to RegNet is a single square image of size pixels, roughly centered on the pelvis similar to [49, 2, 54]. The objective is to retrieve the 2D pose . Additionally, it estimates the intrinsic camera parameters consisting of focal length , principal point , and a scaling factor , as well as the extrinsic camera parameters and . The intrinsic camera parameters are estimated directly from the image size as shown in [31].
RegNet consists of an encoder-decoder architecture, where the encoder is pre-trained using contrastive learning and the decoder is based on capsules. The output of the decoder yields both 3D poses and camera parameters, which are then used to compute the 2D pose , with each joint represented as on the image plane . It is worth noting that we employ 2D poses for the loss computation, thus introducing a form of weak supervision to the 3D pose estimation process.
Contrastive Encoder. Our encoder employs a ResNet50 pre-trained using contrastive learning on ImageNet as in [1]. Despite its general image focus, the use of contrastive learning allows for faster convergence and better generalization, outperforming supervised pre-training methods [3].
The output vector of the encoder is concatenated with the intrinsic parameters that are estimated directly from the full image. The vector resulting from the concatenation of size is given as input to the decoder.
Capsule-based decoder. In [25], they first showed how a Soft Attention mechanism can be effectively used to split a feature vector in different capsule features. In [43], they perform an equivalent operation with a fully connected and a Softmax layer. Inspired by [43], we design our capsule-based decoder using a Conv2D fully connected layer to transform the latent space vector of size to a vector of size .
Given values used for the attention mechanism, the remaining values are divided in the following capsules:
-
•
, representing the estimated 3D pose;
-
•
from which we can compute representing the extrinsic parameters, namely the estimated rotation matrix and translation vector with respect to the input image’s viewpoint. The mathematical calculations to derive from are reported in the Supplementary Materials. To ensure invertibility, an orthogonality constraint is enforced on the rotation matrix . The extrinsic parameters are combined with the intrinsic camera matrix computed by at the encoding stage to obtain the full perspective camera model descriptor ;
-
•
, a vector indicating the presence of each joint. Low values indicate uncertain detection of joints, often due to occlusions or joints being outside the image space.
Outputs. To ensure that the estimated 3D pose is plausible from multiple viewpoints, we require both the 2D pose of the original image , as well as the 2D pose corresponding to the same pose from a different viewpoint.
For , we perform the Project operation as used in the lifting architecture to obtain 2D pose using the full camera projection in Eq. 1 given the 3D pose .
Similar to the lifting stage, we observed that we can increase the accuracy of the estimated 3D poses by using a Normalizing Flow loss to ensure the plausibility of multiple 2D projections of the 3D pose seen from different viewpoints. To this end we also calculate a rotated projection , where we rotate the model around the vertical world axis. The Rotate operation is applied randomly to each 3D pose in the range of to simulate a viewpoint change, before projecting the rotated 3D pose to obtain .
Regression of the joints’ position can be combined with an uncertainty measure and leads to better results when compared to direct regression of joints and heatmap-based methods [29]. Moreover, the regression is more computation and memory efficient compared to heatmaps. To employ a similar method, we estimate the deviation of the predicted joint’s position from the ground truth by applying a sigmoid function on the estimated presence capsule .
Losses. To train RegNet, we need to minimize the following loss:
(6) |
where:
-
•
, and are the losses defined in Sec. 3.1;
- •
For further details on loss balancing during training see the Supplementary Materials.
4 Results
We perform our experiments on the common Human3.6M [17] and MPI-INF-3DHP [36] datasets111All datasets were obtained and used only by the authors affiliated with academic institutions. . We follow standard test protocols for both datasets [49]. We also report extensive ablation studies and qualitative results on the unseen 3DPW dataset [48] to demonstrate generalization.
Implementation details. The input images are of size and . The skeleton model has joints. RegNet is trained for epochs using the optimizer AdamW [34], , learning rate , and weight decay . LiftNet is trained for epochs using the optimizer AdamW, learning rate , and weight decay . Both RegNet and LiftNet are trained with batch size and bfloat16 precision on single a NVIDIA RTX 3090. Inference runs on the same GPU at fps.
Metrics. We adopt the standard mean per joint position error (MPJPE) in two common forms for the 3D HPE unsupervised settings: the PA-MPJPE where reconstructed 3D pose is Procrustes aligned and the N-MPJPE where the 3D pose is normalized the same scale as the ground truth [41]. As in [49], for the MPI-INF-3DHP dataset we report the scale normalized percentage of correct key points (N-PCK) predicted within mm to the original position and its corresponding area under curve (AUC).
4.1 Quantitative results
Supervision | Method | PA-MPJPE | N-MPJPE |
---|---|---|---|
Full-3D | Pavalkos [38] | 41.8 | - |
10%-3D | Kundu [28] | 49.6 | 59.4 |
Kundu [26] | 48.2 | 57.6 | |
Gong [11] | 39.1 | 50.2 | |
Multi-view | Rhodin [41] | 98.2 | 122.6 |
Kundu [27] | 85.8 | - | |
Usman [47] | 44.0 | 55.0 | |
Full-2D | Fish [9] | 97.2 | - |
Kundu [27] | 62.4 | - | |
Ours (RegNet) | 45.4 | 69.9 |
Supervision | Method | PA-MPJPE | N-MPJPE |
Full | Martinez [35] | 37.1 | 45.5 |
Weak | Fish [9] | 79.0 | - |
Wandt [50] | 38.2 | 50.9 | |
Drover [8] | 38.2 | - | |
Kundu [28] | 62.4 | - | |
Multi-view | Kocabas [24] | 47.9 | 54.9 |
Wandt [51] | 51.4 | 65.9 | |
Unsupervised | Chen [2] | 58.0 | - |
Yu [54] | 42.0 | 85.3 | |
Wandt [49] | 36.7 | 64.0 | |
Ours (LiftNet) | 30.7 | 50.8 |
Method | Backbone | PA-MPJPE | N-MPJPE |
---|---|---|---|
Chen [2] | SH [37] | 68.0 | - |
Wandt [50] | SH [37] | 65.1 | 89.9 |
Kundu [27] | ResNet-50 | 63.8 | - |
Kundu [27] | ResNet-50 | 62.4 | - |
Yu [54] | CPN [4] | 52.3 | 92.4 |
Wandt [49] | CPN [4] | 50.2 | 74.4 |
Ours (LiftNet) | Ours (RegNet) | 43.8 | 67.1 |
Supervision | Method | 2D Input | PA-MPJPE | N-PCK | AUC |
Weak | Kundu [27] | GT | 93.9 | 84.6 | 60.8 |
Unsup. | Yu [54] | GT | - | 86.2 | 51.7 |
Wandt [49] | GT | 54.0 | 86.0 | 50.1 | |
Ours (LiftNet) | Pred | 46.8 | 93.6 | 61.3 | |
Ours (LiftNet) | GT | 33.6 | 97.6 | 69.5 | |
Unseen data | Ours (LiftNet) | Pred | 57.3 | 77.5 | 46.4 |
Direct regression from images. In Tab. 2, we report the results of weakly-supervised direct regression of 3D pose from images on Human3.6M dataset. Weakly-supervised approaches include ones using only a small portion of 3D annotated data (10%-3D) or multi-view supervision. Both approaches rely on 3D spatial information for their supervision, leading to results close to the baseline fully supervised approach [38]. Among approaches using only 2D information, our RegNet outperforms others, obtaining results in line with baseline 3D supervised. In contrast to the previous approaches, RegNet also performs the unsupervised estimation of the full perspective camera parameters .
Lifting from 2D ground truth (GT). In Tab. 2, we report the results of lifting GT 2D pose to 3D on Human3.6M dataset. We compare against baseline fully supervised approaches [35], multi-view supervised approaches [24, 51], and weakly supervised approaches adopting different strategies like domain adaptation [28], 2D GT poses [8, 9] and partial 3D GT [50]. Our unsupervised LiftNet outperforms all previous weakly supervised approaches, also obtaining the best results among unsupervised methods. We also outperform [49] which adopts a weak perspective camera modeling combined with elevation estimation, demonstrating the efficacy of our LiftNet that explicitly leverages the full perspective camera model.
Lifting from 2D predictions. In Tab. 4, we report the results of lifting 2D pose predictions to 3D on the Human3.6M dataset. In contrast to other unsupervised approaches that use either ResNet50 [27], Stacked Hourglass (SH) [2, 50] or Cascaded Pyramid Network (CPN) [4, 49] as the backbone to extract 2D predictions to be lifted, we use RegNet which also provides an unsupervised estimation of the camera parameters. The full EPOCH approach combining both RegNet and LiftNet outperforms all other methods, demonstrating its efficacy as an end-to-end approach to estimate 3D poses from images.
Generalising to unseen data. In Tab. 4, we report the results of 3D HPE on MPI-INF-3DHP dataset. When trained on MPI-INF-3DHP, LiftNet outperforms both weakly-supervised [27] and unsupervised [49, 54] lifting methods starting from either the ground truth or from poses predicted by RegNet. When not trained on MPI-INF-3DHP (last row), EPOCH trained only on Human3.6M achieves results comparable to fine-tuned approaches, proving its ability to generalize to unseen data.
Ablation studies. In Tab. 6, we report the results of the ablation study for RegNet. First, we show how using a backbone trained with supervised learning (S) leads to poorer features (first row), leading to performance degradation compared to the full model trained with contrastive learning (C) (last row). Next, we ablate different losses, showing how is the loss that causes the biggest drop in performances since it ensures the network does not estimate 3D poses that are plausible only from a single viewpoint. and have similar effect on the results as they both enforce anthropomorphic constraints.
In Tab. 6, we report the results of the ablation study for LiftNet. We perform an ablation study on the camera modeling, showing how using a weak perspective camera leads to performance degradation due to its less precise modeling of the 2D/3D relation than the full perspective camera. Moreover, we proceed to study the effect of each loss on the performances. As for RegNet, causes the biggest drop in performances. The deformation , and the two anthropomorphic constraints and have similar effects on the numerical results, as they are all enforcing comparable constraints on the deformation and proportions of 3D poses. and all provide regularisation between different stages of the cycle consistency, so removing one of them has a comparable effect on the performance’s degradation, since LiftNet is still regularised by the remaining ones.
Backbone |
|
|
PA-MPJPE | MPJPE | |||
---|---|---|---|---|---|---|---|
S | ✓ | ✓ | ✓ | 60.1 | 82.7 | ||
✗ | ✓ | ✓ | 239.2 | 342.8 | |||
✓ | ✗ | ✓ | 49.1 | 71.4 | |||
✓ | ✓ | ✗ | 48.3 | 70.9 | |||
C | ✓ | ✓ | ✓ | 45.4 | 69.9 |
|
|
|
|
PA-MPJPE | MPJPE | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
W | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 47.7 | 84.9 | |||||
✗ | ✓ | ✓ | ✓ | ✓ | ✓ | 232.6 | 293.0 | ||||||
✓ | ✗ | ✓ | ✓ | ✓ | ✓ | 41.5 | 56.4 | ||||||
✓ | ✓ | ✗ | ✓ | ✓ | ✓ | 41.3 | 55.8 | ||||||
✓ | ✓ | ✓ | ✗ | ✓ | ✓ | 40.9 | 54.9 | ||||||
✓ | ✓ | ✓ | ✓ | ✗ | ✓ | 40.1 | 54.3 | ||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✗ | 39.5 | 53.4 | ||||||
F | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 30.7 | 50.8 |
![Refer to caption](x5.png)
4.2 Qualitative results
Fig. 5 shows our qualitative results on challenging poses for Human3.6M and MPI-INF-3DHP. Even in the presence of occlusions and rare poses (e.g. sitting on a chair, lying on the floor with crossed legs), both RegNet and LiftNet obtain visually plausible 3D poses. Moreover, we display results on 3DPW which is unseen at training time. Even if the scale of the 3D pose is different, we still obtain plausible poses from challenging images, demonstrating the ability of our EPOCH approach to generalize to unseen scenarios.
5 Conclusions
In this paper, we presented EPOCH, a novel framework that jointly estimates the 3D pose of cameras and humans consisting of LiftNet and RegNet. LiftNet performs the unsupervised 3D lifting starting from estimations of both 2D poses and camera parameters. To address the unavailability of camera parameters in real world scenarios, we design RegNet, a novel human pose regressor that can jointly estimate 2D and 3D poses as well as perspective camera parameters using weak 2D pose supervision. We show that an estimated full perspective camera allows us to substantially improve the unsupervised 3D human pose estimation accuracy and consistency over state-of-the-art results. By estimating the camera only from 2D poses without any 3D or camera ground truth, we can generalise to unseen data, making a step forward towards fully unsupervised 3D HPE in-the-wild. Supplementary Materials
6 Code
The code will be made available in a Github repository upon acceptance.
7 Details about the Normalizing Flows’ invertible function
As in [49], we design a NF to estimate the plausibility of 2D poses. We train the NF to map each pose onto a sample point of a Gaussian Distribution . Given a 2D pose and an invertible function such that , the PDF of can be computed as
(8) |
where is the Jacobian matrix of the inverse transformation. The NF is trained offline to learn the PDF of the 2D poses in a chosen training dataset. Once trained, the probability can be computed for a new sample and used as a measure of its plausibility to lie within the distribution of the learned dataset.
8 Camera matrix notation
In the paper we refer to the intrinsic and extrinsic camera matrices as detailed below.
The intrinsic matrix, representing the internal parameters of a camera, is defined as:
Where:
-
•
and are the focal lengths in the x and y directions.
-
•
represents skew, which is usually zero in most cameras. In the paper we assume for all cameras.
-
•
is the principal point, the optical center of the image.
The extrinsic matrix, denoting the rotation and translation of the camera with respect to the world coordinate system, can be expressed as:
Where:
-
•
represents the rotation matrix.
-
•
represents the translation vector.
9 Mathematical derivation of the intrinsic parameters from the image
The scaling parameters and in the paper are computed as follows:
where is the mean length of the vector from the root joint to the head joint, as in [49]. are the width and height of the bounding box in pixel prior scaling it to .
Taking inspiration from [31], we estimate the focal length starting from the uncropped input image width and height as follows:
and scale it along each axis
The principal point is computed as:
where , . and are the pixel coordinates of the top-left corner of the unscaled bounding box in the full-size image.
10 Mathematical derivation of the extrinsic parameters from the capsule
When predicting capsules, we estimate from which we can compute the rotation matrix and translation vector with respect to the input image’s viewpoint.
Starting from we compute the average vector . is expressed as , where and are the euler rotation angles in the and world-space axis and is the distance from the pelvis to the camera along the Z axis. We discard the rotation along the axis because that would lead to infinite configurations of the predicted camera and 3D pose. Thus, we set .
Starting from , we can derive the rotation matrix as:
Starting from , we can derive the translation vector. We define a setting in which the pelvis is centered in in world coordinates and in on the image plane. Thus we can express the relation between the pelvis in 2D and 3D as:
from which we can obtain
thus deriving
11 Optimization of Loss Function Weights
In RegNet and LiftNet, we encounter a discrepancy in the scale of different components of the loss function. Specifically, certain losses such as and are formulated as negative-log-likelihoods, resulting in predominantly highly negative values. In contrast, other components of the loss function yield positive values close to zero. To facilitate a more stable and efficient optimization process, we implement a scaling mechanism for and to align their value ranges with the other loss components. More in detail, we shift the loss function vertically by its expected lowest value and divide it by its expected vertical span, to make the scaled loss fit the interval .
Additionally, we introduce a non-trainable hyperparameter , set to , to strategically weight certain loss components more heavily in the overall loss function. This approach allows for greater emphasis on key aspects of the learning objective.
For RegNet, we apply this weighting strategy to , recognizing its critical role in the network’s performance. In the case of LiftNet, the weights are applied to and . This ensures the integrity of the pose cycle consistency and maintains the correct proportions of the 3D skeletal model in terms of bone lengths.
In both RegNet and LiftNet, is treated as a regularization term. Accordingly, it is scaled down by a factor of , reflecting its role in preventing overfitting and ensuring generalization, rather than directly driving the primary learning objective.
12 Additional qualitative results
In Fig. 6 we show additional qualitative results for LiftNet. Our method shows promising results even when the original annotation is not correct (see Fig. 7), since the estimated camera () and the estimated 2D pose () that RegNet gives as output are robust enough to provide a good pseudo-ground truth.
![Refer to caption](x6.png)
![Refer to caption](x7.png)
References
- [1] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems 33, 9912–9924 (2020)
- [2] Chen, C.H., Tyagi, A., Agrawal, A., Drover, D., Mv, R., Stojanov, S., Rehg, J.M.: Unsupervised 3d pose estimation with geometric self-supervision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019)
- [3] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: Simclr: A simple framework for contrastive learning of visual representations. In: International Conference on Learning Representations. vol. 2 (2020)
- [4] Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7103–7112 (2018)
- [5] Chun, S., Park, S., Chang, J.Y.: Learnable human mesh triangulation for 3d human pose and shape estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2850–2859 (2023)
- [6] Dabral, R., Mundhada, A., Kusupati, U., Afaque, S., Sharma, A., Jain, A.: Learning 3d human pose from structure and motion. In: Proceedings of the European conference on computer vision (ECCV). pp. 668–683 (2018)
- [7] Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real nvp. arXiv preprint arXiv:1605.08803 (2016)
- [8] Drover, D., MV, R., Chen, C.H., Agrawal, A., Tyagi, A., Phuoc Huynh, C.: Can 3d pose be learned from 2d projections alone? In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops. pp. 0–0 (2018)
- [9] Fish Tung, H.Y., Harley, A.W., Seto, W., Fragkiadaki, K.: Adversarial inverse graphics networks: Learning 2d-to-3d lifting and image-to-image translation from unpaired supervision. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4354–4362 (2017)
- [10] Gholami, M., Wandt, B., Rhodin, H., Ward, R., Wang, Z.J.: Adaptpose: Cross-dataset adaptation for 3d human pose estimation by learnable motion generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13075–13085 (2022)
- [11] Gong, K., Zhang, J., Feng, J.: Poseaug: A differentiable pose augmentation framework for 3d human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8575–8584 (2021)
- [12] Habibie, I., Xu, W., Mehta, D., Pons-Moll, G., Theobalt, C.: In the wild human pose estimation using explicit 2d features and intermediate 3d representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10905–10914 (2019)
- [13] Hardy, P., Kim, H.: Unsupervised reconstruction of 3d human pose interactions from 2d poses alone. arXiv preprint arXiv:2309.14865 (2023)
- [14] Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge university press (2003)
- [15] Hu, X., Ahuja, N.: Unsupervised 3d pose estimation for hierarchical dance video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11015–11024 (2021)
- [16] Huang, B., Shu, Y., Ju, J., Wang, Y.: Occluded human body capture with self-supervised spatial-temporal motion prior. arXiv preprint arXiv:2207.05375 (2022)
- [17] Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence 36(7), 1325–1339 (2013)
- [18] Ji, X., Fang, Q., Dong, J., Shuai, Q., Jiang, W., Zhou, X.: A survey on monocular 3d human pose estimation. Virtual Reality & Intelligent Hardware 2(6), 471–500 (2020)
- [19] Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara, S., Sheikh, Y.: Panoptic studio: A massively multiview system for social motion capture. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 3334–3342 (2015)
- [20] Kingma, D.P., Dhariwal, P.: Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems 31 (2018)
- [21] Kissos, I., Fritz, L., Goldman, M., Meir, O., Oks, E., Kliger, M.: Beyond weak perspective for monocular 3d human pose estimation. In: ECCV Workshops (2020)
- [22] Kobyzev, I., Prince, S.J., Brubaker, M.A.: Normalizing flows: An introduction and review of current methods. IEEE transactions on pattern analysis and machine intelligence 43(11), 3964–3979 (2020)
- [23] Kocabas, M., Huang, C.H.P., Tesch, J., Müller, L., Hilliges, O., Black, M.J.: Spec: Seeing people in the wild with an estimated camera. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11035–11045 (2021)
- [24] Kocabas, M., Karagoz, S., Akbas, E.: Self-supervised learning of 3d human pose using multi-view geometry. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1077–1086 (2019)
- [25] Kosiorek, A., Sabour, S., Teh, Y.W., Hinton, G.E.: Stacked capsule autoencoders. Advances in neural information processing systems 32 (2019)
- [26] Kundu, J.N., Seth, S., Jamkhandi, A., YM, P., Jampani, V., Chakraborty, A., et al.: Non-local latent relation distillation for self-adaptive 3d human pose estimation. Advances in Neural Information Processing Systems 34, 158–171 (2021)
- [27] Kundu, J.N., Seth, S., Jampani, V., Rakesh, M., Babu, R.V., Chakraborty, A.: Self-supervised 3d human pose estimation via part guided novel image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6152–6162 (2020)
- [28] Kundu, J.N., Seth, S., YM, P., Jampani, V., Chakraborty, A., Babu, R.V.: Uncertainty-aware adaptation for self-supervised 3d human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20448–20459 (2022)
- [29] Li, J., Bian, S., Zeng, A., Wang, C., Pang, B., Liu, W., Lu, C.: Human pose regression with residual log-likelihood estimation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11025–11034 (2021)
- [30] Li, W., Liu, H., Tang, H., Wang, P., Van Gool, L.: Mhformer: Multi-hypothesis transformer for 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13147–13156 (2022)
- [31] Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: Cliff: Carrying location information in full frames into human pose and shape estimation. In: European Conference on Computer Vision. pp. 590–606. Springer (2022)
- [32] Liu, W., Bao, Q., Sun, Y., Mei, T.: Recent advances of monocular 2d and 3d human pose estimation: a deep learning perspective. ACM Computing Surveys 55(4), 1–41 (2022)
- [33] Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34(6), 248:1–248:16 (Oct 2015)
- [34] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv (2018)
- [35] Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE international conference on computer vision. pp. 2640–2649 (2017)
- [36] Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., Theobalt, C.: Monocular 3d human pose estimation in the wild using improved cnn supervision. In: 3D Vision (3DV), 2017 Fifth International Conference on. IEEE (2017). https://doi.org/10.1109/3dv.2017.00064, http://gvv.mpi-inf.mpg.de/3dhp_dataset
- [37] Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. pp. 483–499. Springer (2016)
- [38] Pavlakos, G., Zhou, X., Daniilidis, K.: Ordinal depth supervision for 3d human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7307–7316 (2018)
- [39] Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3d human pose. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7025–7034 (2017)
- [40] Pietak, A., Ma, S., Beck, C.W., Stringer, M.D.: Fundamental ratios and logarithmic periodicity in human limb bones. Journal of anatomy 222(5), 526–537 (2013)
- [41] Rhodin, H., Salzmann, M., Fua, P.: Unsupervised geometry-aware representation for 3d human pose estimation. In: Proceedings of the European conference on computer vision (ECCV). pp. 750–767 (2018)
- [42] Rhodin, H., Spörri, J., Katircioglu, I., Constantin, V., Meyer, F., Müller, E., Salzmann, M., Fua, P.: Learning monocular 3d human pose estimation from multi-view images. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8437–8446 (2018)
- [43] Sabour, S., Tagliasacchi, A., Yazdani, S., Hinton, G., Fleet, D.J.: Unsupervised part representation by flow capsules. In: International Conference on Machine Learning. pp. 9213–9223. PMLR (2021)
- [44] Shan, W., Liu, Z., Zhang, X., Wang, Z., Han, K., Wang, S., Ma, S., Gao, W.: Diffusion-based 3d human pose estimation with multi-hypothesis aggregation. arXiv preprint arXiv:2303.11579 (2023)
- [45] Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: Proceedings of the IEEE international conference on computer vision. pp. 2602–2611 (2017)
- [46] Tripathi, S., Ranade, S., Tyagi, A., Agrawal, A.: Posenet3d: Learning temporally consistent 3d human pose via knowledge distillation. In: 2020 International Conference on 3D Vision (3DV). pp. 311–321. IEEE (2020)
- [47] Usman, B., Tagliasacchi, A., Saenko, K., Sud, A.: Metapose: Fast 3d pose from multiple views without 3d supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6759–6770 (2022)
- [48] Von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3d human pose in the wild using imus and a moving camera. In: Proceedings of the European conference on computer vision (ECCV). pp. 601–617 (2018)
- [49] Wandt, B., Little, J.J., Rhodin, H.: Elepose: Unsupervised 3d human pose estimation by predicting camera elevation and learning normalizing flows on 2d poses. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6635–6645 (2022)
- [50] Wandt, B., Rosenhahn, B.: Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7782–7791 (2019)
- [51] Wandt, B., Rudolph, M., Zell, P., Rhodin, H., Rosenhahn, B.: Canonpose: Self-supervised monocular 3d human pose estimation in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13294–13304 (2021)
- [52] Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: Simple vision transformer baselines for human pose estimation. arXiv preprint arXiv:2204.12484 (2022)
- [53] Yang, S., Quan, Z., Nie, M., Yang, W.: Transpose: Keypoint localization via transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11802–11812 (2021)
- [54] Yu, Z., Ni, B., Xu, J., Wang, J., Zhao, C., Zhang, W.: Towards alleviating the modeling ambiguity of unsupervised monocular 3d human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8651–8660 (2021)
- [55] Zhang, J., Tu, Z., Yang, J., Chen, Y., Yuan, J.: Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13232–13242 (2022)
- [56] Zhang, S., Wang, C., Dong, W., Fan, B.: A survey on depth ambiguity of 3d human pose estimation. Applied Sciences 12(20), 10591 (2022)
- [57] Zheng, C., Wu, W., Chen, C., Yang, T., Zhu, S., Shen, J., Kehtarnavaz, N., Shah, M.: Deep learning-based human pose estimation: A survey. ACM Computing Surveys 56(1), 1–37 (2023)
- [58] Zhu, W., Ma, X., Liu, Z., Liu, L., Wu, W., Wang, Y.: Motionbert: A unified perspective on learning human motion representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15085–15099 (2023)