(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version
DPoser: Diffusion Model as
Robust 3D Human Pose Prior
Abstract
This work targets to construct a robust human pose prior. However, it remains a persistent challenge due to biomechanical constraints and diverse human movements. Traditional priors like VAEs and NDFs often exhibit shortcomings in realism and generalization, notably with unseen noisy poses. To address these issues, we introduce DPoser, a robust and versatile human pose prior built upon diffusion models. DPoser regards various pose-centric tasks as inverse problems and employs variational diffusion sampling for efficient solving. Accordingly, designed with optimization frameworks, DPoser seamlessly benefits human mesh recovery, pose generation, pose completion, and motion denoising tasks. Furthermore, due to the disparity between the articulated poses and structured images, we propose truncated timestep scheduling to enhance the effectiveness of DPoser. Our approach demonstrates considerable enhancements over common uniform scheduling used in image domains, boasting improvements of 5.4%, 17.2%, and 3.8% across human mesh recovery, pose completion, and motion denoising, respectively. Comprehensive experiments demonstrate the superiority of DPoser over existing state-of-the-art pose priors across multiple tasks. ††Corresponding authors: Yulun Zhang and Haoqian Wang.
Keywords:
Human Pose Prior, Diffusion Model1 Introduction
Accurate modeling of human pose is a fundamental research topic that can benefit various applications, from human-robot interaction to augmented and virtual reality experiences. Many real-world applications rely on a prior distribution of valid human poses to perform tasks like body model fitting, motion capture, and gesture recognition. The complexity of human biomechanics, coupled with the extensive kinematic variability in movement patterns, presents a significant challenge in constructing a robust and realistic human pose prior.
Previous efforts to model human pose prior have mainly employed techniques such as Gaussian Mixture Models (GMMs) [1], Variational Autoencoders (VAEs) [37], and Neural Distance Fields (NDFs) [49]. Each technique, however, faces its own set of limitations. GMMs, for instance, might lead to the generation of implausible poses due to their unbounded nature. VAEs, restricted by their Gaussian assumptions, tend to generate average poses that may not accurately capture the full spectrum of human actions. Meanwhile, NDFs have shown promise in 3D surface modeling but struggle with generalizing across the complex, high-dimensional landscape of human pose manifolds. These limitations highlight a pressing need for a more comprehensive and dependable approach to modeling human pose priors, an endeavor this work seeks to address.
Recently, Diffusion models [16, 48, 11, 20] have gained traction for their prowess in capturing complex, high-dimensional data distributions and enabling versatile sampling techniques. Their application has been seen in generating lifelike human motion sequences [56, 42] and functioning as multi-hypothesis pose estimators from 2D inputs [17, 8]. However, these models are designed for specific generation tasks or tailored to work with conditional input data, which limits their applicability in broader contexts. The potential of diffusion models as a universal human pose prior remains largely untapped, and effective optimization methods for diverse tasks remain unanswered.
In this work, we propose DPoser, a novel approach that leverages time-dependent denoiser learned from expansive motion capture datasets to construct a robust human pose prior. We regard various pose-centric tasks as inverse problems and suggest the integration of DPoser via variational diffusion sampling techniques [33] as a regularization component within optimization frameworks like SMPLify [1]. Furthermore, our investigations reveal that significant pose-related information during diffusion is predominantly located at the latter stages of the diffusion trajectory. This revelation inspired us to develop a novel truncated timestep scheduling strategy for optimization. Our method outperforms the standard uniform scheduling, showing gains of 5.4%, 17.2%, and 3.8% in human mesh recovery, pose completion, and motion denoising, respectively.
In summary, our main contributions are as follows:
-
•
We introduce DPoser, a novel framework based on diffusion models to craft a robust and flexible human pose prior, geared for seamless integration across diverse pose-related tasks via test-time optimization.
-
•
We analyze the impact of diffusion timesteps in the pose domain and propose truncated scheduling for more efficient optimization.
-
•
Through extensive experiments, we establish that DPoser outshines state-of-the-art (SOTA) pose priors in a variety of downstream tasks.
2 Related Work
2.1 Human Pose Priors
Human body models such as SMPL [31] serve as powerful tools for parameterizing both pose and shape, thereby offering a comprehensive framework for describing human gestures. Within the SMPL model, body poses are captured using rotation matrices or joint angles linked to a kinematic skeleton. Adjusting these parameters enables the representation of a diverse range of human actions. Nonetheless, feeding unrealistic poses into these models can result in non-viable human figures, primarily because plausible human poses are confined within a complex, high-dimensional manifold due to biomechanical constraints.
Various strategies [1, 37, 49, 9] have been put forward to build human pose priors. Generative frameworks like GMMs, VAEs [22], and Generative Adversarial Networks (GANs) [13] have shown promise in encapsulating the multifaceted pose distribution, facilitating advancements in tasks like human mesh recovery [19, 12]. Further, some studies have delved into conditional pose priors tailored to specific tasks, incorporating extra information such as image features [39, 3], 2D joint coordinates [8], or sequences of preceding poses [28, 40]. Our initiative leans towards an unconditional pose prior approach, training DPoser on extensive motion capture data without relying on additional inputs like images or text, aiming for a versatile application across various pose-related scenarios.
2.2 Diffusion Models for Pose-centric Tasks
Diffusion models [47, 48, 16, 44] have emerged as powerful tools for capturing intricate data distributions, aligning particularly well with the demands of multi-hypothesis estimation in ambiguous human poses. Notable works include DiffPose [17], which leverages a Gaussian Mixture Model-guided forward diffusion process [36] and employs a Graph Convolutional Network (GCN) [23] architecture conditioned on 2D pose sequences for 3D pose estimation by learned reverse process (i.e., generation). In a similar vein, DiffusionPose [39] and GFPose [8] employ the generation-based pipeline but take different approaches in conditioning. Further, ZeDO [18] concentrates on 2D-to-3D pose lifting, while Diff-HMR [3] and DiffHand [24] explore estimating SMPL parameters and hand mesh vertices, respectively. BUDDI [34] stands out for using diffusion models to capture the joint distribution of interacting individuals and leveraging SDS loss [38, 52] for optimization during testing phases.
While DPoser shares a similar optimization implementation with BUDDI, it sets itself apart by introducing a wider perspective of inverse problems and equip** an innovative timestep scheduling strategy tailored to the characteristics of human poses. Unlike other approaches [18, 39, 8, 17] that primarily focus on 3D location-based representation, DPoser takes on the more demanding task of modeling SMPL-based rotation pose representation. This adds complexity due to the intricacies involved in representing rotations, positioning DPoser as a more versatile solution within the realm of pose-centric tasks.
3 Methods
3.1 Preliminary: Score-based Diffusion Models
Diffusion models [43, 47, 48, 16] operationalize generative processes by inverting a predefined forward diffusion process, typically formulated as a linear stochastic differential equation (SDE). Formally, the data trajectory follows the forward SDE given by:
(1) |
where and represent the drift and diffusion coefficients, while is a standard Wiener process.
The affine drift coefficients ensure analytically tractable Gaussian perturbation kernels, denoted by , where the exact coefficients can be obtained with standard techniques [41]. Using appropriately designed and , this allows the data distribution to morph into a tractable isotropic Gaussian distribution via forward diffusion.
To recover data distribution from the Gaussian distribution , we can simulate the corresponding reverse SDE of Eq. (1) [48]:
(2) |
The so-called score function [29], , serves as an unknown term in Eq. (2) and can be approximated by a neural network parameterized as 111This parameterization is obtained from the deep connection between the noise prediction in diffusion models and score function estimation in score-based models. We provide a brief recap in the Appendix.. To learn the score functions, employing denoising score matching techniques [50], we perturb the data points with noise as per:
(3) |
Subsequently, feeding and as input, we train the time-dependent noise predictor using an L2-loss defined as [16]:
(4) |
where denotes a positive weighting function.
Upon successful training, the score functions can be estimated and used to solve the reverse SDE (Eq. (2)). Through techniques like Euler-Maruyama discretization, we can generate novel samples by simulating the reverse SDE.
3.2 Learning Pose Prior with Unconditional Diffusion Models
SMPL-based pose representation. To build a flexible 3D human pose prior, we propose to utilize the SMPL body model [31], which can be viewed as a differentiable function that maps body joint angles and shape parameters to mesh vertices and joint positions . Our target is to model the distribution of joint angles .
Training of unconditional diffusion models. To this end, we adopt an unconditional diffusion model to learn the pose representation . This approach aligns with a task-agnostic strategy, focusing solely on the distribution of 3D poses. We employ sub-VP SDEs as outlined in [48], which have demonstrated efficacy in sampling quality, for constructing our diffusion model. Specifically, our chosen forward SDE (Eq. (1)) is given by:
(5) |
where denotes linear scheduled noise scales. The coefficients needed in Eq. (3) can be obtained as .
3.3 Optimization Leveraging Diffusion Priors
The acquired score functions or noise predictors, denoted as , permit the direct generation of plausible poses through Eq. (2). Yet, the broader integration of diffusion priors into general optimization frameworks remains an open avenue. We address this by reframing pose-related tasks as inverse problems and applying variational diffusion sampling techniques [33] for efficient resolution.
Inverse problem formulation. Consider an original signal . Inverse problems can be encapsulated by Eq. (6) as:
(6) |
where symbolizes the measurement operator and constitutes noise, assumed to be white Gaussian . In the context targeted in this study, always refers to body poses in SMPL [31]. This formulation allows us to approach various pose-centric tasks by adapting and interpreting accordingly:
-
•
Pose completion: Here, serves as a mask matrix to simulate partially observed poses, with being the incomplete pose data.
-
•
Motion denoising: In this scenario, applies SMPL’s forward kinematics, treating as the observed noisy 3D joints.
-
•
Human mesh recovery: integrates SMPL’s forward kinematics and camera projection to relate to 2D joint observations in images.
The aim is to recover the original signal , where, within the Bayesian framework, our objective shifts to sampling from the posterior distribution .
Solving inverse problems with diffusion models. Various techniques [14, 21, 6, 7, 45, 33] have been explored to simulate this posterior sampling process based on unconditional diffusion priors . Among them, the sampling-based scheme is widely explored and applied in tasks like image restoration. These methods incorporate the observation information into the generation process of through techniques like gradient guidance [6, 7] and back projection [48, 21, 7]. However, such methods rooted in generation are inconvenient for handling diverse pose-related tasks. To navigate these challenges, we adopt variational diffusion sampling [33] to build general optimization frameworks. Specifically, it employs a variational distribution and aims to minimize the Kullback-Leibler (KL) divergence between this variational distribution and the true posterior, mathematically expressed as . Further, under the assumption of zero variance (), the optimization problem of seeking (i.e., ) can be formulated as minimizing [46, 33]:
(7) |
where denotes the loss weights and is sampled from the standard Gaussian distribution. Here, signifies the stopped-gradient operator, indicating that backpropagation through the trained diffusion models is not required. The optimization procedure initiates by selecting a timestep and applying a perturbation to the target as per Eq. (3), resulting in . Subsequently, the gradients are applied to the optimization variable . In a nutshell, this framework [33] provides a flexible yet robust strategy for employing diffusion priors in generic optimization problems, serving as a cornerstone for our work.
Introducing DPoser regularization. To shed more light on the working mechanism, we propose to reformulate the regularization term as:
(8) | ||||
(9) |
Here, functions as a precise one-step denoising prediction using the diffusion model . This approach effectively encourages the current pose towards a denoised, plausible pose distribution, employing a straightforward L2-loss within the DPoser regularization framework. Further, the theoretical foundation of our regularization demonstrates its alignment with the gradient direction of variational diffusion sampling (Eq. (7)).
Proof: Differentiating Eq. (8) with respect to yields:
(10) |
Thus, represents a more intuitive approach to variational diffusion sampling. By incorporating alongside task-specific loss functions, this regularization term enhances the plausibility of the resultant poses.
DPoser across pose-related tasks. DPoser excels in versatility, enabling its seamless application in a spectrum of human pose-related tasks. Its adaptability is especially evident in our human mesh recovery approach, as depicted in Fig. 2. For an exhaustive examination of DPoser’s utility across tasks like pose completion and motion denoising, we direct the reader to our Appendix.
Human mesh recovery aims to deduce the human pose and shape from single-image inputs. In this context, we refine the optimization function derived from the SMPLify framework [1], integrating DPoser as a regularization term, , and streamlining the process by omitting the intricate interpenetration error component. The modified optimization objective, engaging both pose and shape parameters from the SMPL model [31], is defined as:
(11) |
The reprojection loss , acting as the data fidelity measure, is defined by:
(12) |
where calculates the 3D joint coordinates through SMPL’s forward kinematics. The function maps these 3D coordinates into 2D space, aligning with the camera’s perspective. refers to the 2D keypoints estimated using an off-the-shelf 2D pose estimator (in our case, ViTPose [55]), with reflecting the confidence score for each joint . The Geman-McClure error function () is employed to assess the discrepancy in 2D joint locations reliably.
To mitigate the issue of overfitting, which often leads to unrealistic poses when solely minimizing reprojection loss, several regularization terms are introduced. Specifically, alongside our body prior , the bending term is incorporated to penalize excessive bending at the elbows and knees, formulated as . Additionally, the shape regularization term is employed to maintain the body shape within plausible bounds. The weights for prior terms are denoted as and , respectively.
Given the structure of (as seen in Eq. (8)), a crucial aspect lies in judiciously selecting the diffusion timestep during the iterative optimization process. In the subsequent section, we address this concern by introducing our novel truncated timestep scheduling strategy.
3.4 Test-time Truncated Timestep Scheduling
Motivation from pose generation. Adapting techniques from the image domain to pose data requires a nuanced understanding of the differences between the two. Previous image-based research [4] shows that initial timesteps (larger ) correspond to the perceptual content, while later timesteps refine details. Pose data, however, lacks this structured layering and spatial redundancy, indicating a need for a tailored timestep approach in the diffusion process.
As depicted in Fig. 3, we find that pose generation doesn’t benefit from the early timesteps as image generation does. The significant stages of pose refinement occur at smaller , specifically when . A uniform distribution of timesteps, as tested in (b) with only five steps, proves less effective for pose data. In contrast, allocating these steps toward the latter end of the diffusion process, as in (c), yields significantly better samples, implying the critical information is not evenly distributed but rather is concentrated toward the end.
Truncated timestep scheduling. Based on these insights, we propose a shift from standard uniform timestep sampling to a truncated strategy, especially for pose data. By focusing on the last timesteps, particularly between 0.2 and 0.0, we target the interval rich in pose-specific information. Specifically, based on the linear descending scheduling, the truncated timestep for each optimization step can be expressed as:
(13) |
where denotes the total number of optimization iterations, and iter signifies the current iteration. This formulation is integral to our proposed optimization framework, which is comprehensively summarized in Algorithm 1. The practical implementation typically involves setting the truncated range to .
4 Experiments
In this section, we showcase the robustness and versatility of DPoser across a spectrum of pose-centric tasks, including pose generation, human mesh recovery, pose completion, and motion denoising. Due to the page limit, we leave experimental details and more qualitative assessments in the Appendix.
4.1 Experimental Setup
Implementation details. We train our DPoser model on the AMASS dataset [32], adhering to the same training partition as previous works [37, 49]. The model employs axis-angle representation for joint rotations, which we normalize to have zero mean and unit variance. The architecture consists of a fully connected neural network with approximately 8.28M parameters. It draws inspiration from GFPose [8] but omits conditional input pathways for our unconditional setting. To stabilize training, we use an exponential moving average with a decay factor of 0.9999, as advised by [48]. The Adam optimizer, a learning rate of , and a batch size of 1280 govern the optimization process. The training of 800,000 iterations takes roughly 8 hours on a single Nvidia RTX 3090Ti GPU.
Evaluation metrics. To comprehensively evaluate our models across various tasks, following Pose-NDF [49], we adopt task-specific metrics:
-
•
Pose Generation: Diversity and fidelity are evaluated using Average Pairwise Distance (APD) and Self-Intersection rates (SI), respectively.
-
•
Human Mesh Recovery: The Procrustes-aligned Mean Per Joint Position Error (PA-MPJPE) measures the accuracy of recovered human meshes.
-
•
Pose Completion: The Mean Per Joint Position Error (MPJPE) for masked body joints serves as the metric, focusing on the inferred occluded parts.
-
•
Motion Denoising: Both MPJPE and the Mean Per-Vertex Position Error (MPVPE) are calculated to assess the denoising effectiveness.
All errors are reported in millimeter units.
4.2 Pose Generation
Sample source | APD | SI |
Real-world (AMASS) [32] | 15.44 | 0.79 |
GMM [1] | 16.28 | 1.54 |
VPoser [37] | 10.75 | 1.51 |
Pose-NDF [49] | 18.75 | 1.97 |
GAN-S [9] | 15.68 | 1.27 |
DPoser (ours) | 14.28 | 1.21 |
DPoser (ours)* | 19.03 | 1.13 |
To commence, we delve into the capabilities of our DPoser model by generating samples from the learned manifold. Employing a standard Euler-Maruyama discretization with 1000 steps, we assess both the diversity and realism of the generated poses (Fig. 4). While DPoser’s outputs are visually diverse and realistic, poses generated from competing methods like GMM [1] and Pose-NDF [49] fall short in naturalism, and VPoser [37] exhibits limited diversity.
Interestingly, quantitative metrics such as APD and SI (Tab. 1) do not always corroborate our qualitative findings. For instance, a 10-step DDIM sampler [44]—suboptimal by design—outperformed real-world data [32] in APD, which we attribute to the generation of exaggerated poses. In summary, our findings underscore the need for a balanced evaluation strategy that merges quantitative metrics with qualitative observations.
4.3 Human Mesh Recovery
Initialization | No fitting | GMM [1] | VPoser [37] | Pose-NDF [49] | GAN-S [9] | DPoser(Ours) |
from scratch | 108.57 | 58.32 | 58.08 | 57.87 | 57.26 | 56.05 |
CLIFF [25] | 56.62 | 51.02 | 49.39 | 49.50 | 49.58 | 49.05 |
We probe the efficacy of DPoser in human mesh recovery (HMR), focusing on estimating human pose and shape from monocular images. We conduct experiments on the EHF dataset [37] and benchmark our method against existing SOTA priors. Our optimization-based framework incorporates two initialization paradigms: (1) a baseline initialization that utilizes mean pose values and a default camera setup, and (2) an advanced initialization scheme that leverages CLIFF [25], a pre-trained regression-based model tailored for HMR. Moreover, GAN-S [9] implementations require a GAN-inversion phase to convert initial poses into their latent representations, which is notably time-consuming.
Tab. 2 and Fig. 5 showcase the comparative performance of DPoser, highlighting its exceptional ability in HMR tasks. Notably, when fitting from scratch, it surpasses established SOTA priors like GAN-S [9] and Pose-NDF [49] and rivals the specific regression-based model [25]. The integration of CLIFF as initialization further amplifies DPoser’s performance, underscoring its efficiency and the benefits of employing refined starting conditions. Fig. 6 further confirms DPoser’s superior efficacy and adaptability across multiple datasets including EHF [37], MSCOCO [27], 3DPW [51], and UBody [26].
4.4 Pose Completion
In practical scenarios like those encountered in the UBody dataset [26] (refer to Fig. 5(d)), HMR algorithms often grapple with occlusions leading to incomplete 3D pose estimates. In this context, our ambition is to recover full 3D poses from partially observed data, initializing the occluded parts with random noise. Our DPoser model is employed to refine these initially implausible poses into feasible ones, utilizing an L2 loss on the visible parts to ensure data consistency.
Initialization | VPoser | Pose-NDF | DPoser |
Zeros | 180.90 | 157.50 | 73.92 |
10mm noise | 181.86 | 172.50 | 74.69 |
100mm noise | 180.25 | 511.51 | 74.19 |
In parallel, we employ a comparable optimization strategy for both Pose-NDF [49] and VPoser [37]. Notably, Tab. 3 reveals that Pose-NDF struggles with poorly initialized poses unseen during its training phase. To mitigate this issue, we have to initialize the occluded poses near zero (close to rest pose) for Pose-NDF to prevent optimization divergence. Additionally, as a task-specific baseline, we adapt the original VPoser model into CVPoser by incorporating conditional inputs within its VAE framework [22]. This modification enables the encoder and decoder to process additional partial poses, facilitating end-to-end conditional sampling.
Methods | Occ. left leg | Occ. legs | Occ. arms | Occ. trunk |
PoseNDF () [49] | 158.21 | 159.19 | 201.00 | 75.42 |
PoseNDF () | 147.66/158.11/7.62 | 151.86/159.21/5.33 | 196.36/200.92/3.30 | 70.88/75.39/3.25 |
PoseNDF () | 144.38/158.06/8.31 | 149.38/159.14/5.90 | 194.79/200.87/3.63 | 69.45/75.38/3.54 |
VPoser (S=1) [37] | 180.78 | 198.18 | 159.86 | 37.75 |
VPoser () | 167.92/181.30/10.53 | 178.77/198.15/14.51 | 148.17/159.65/8.64 | 31.83/37.79/4.54 |
VPoser () | 162.82/181.09/12.21 | 172.83/198.31/16.30 | 144.53/159.80/9.69 | 30.06/37.78/4.99 |
CVPoser () | 71.66/145.52/51.68 | 90.49/148.30/38.46 | 83.02/136.82/36.47 | 18.77/37.83/13.12 |
DPoser(ours) () | 74.48 | 97.39 | 81.49 | 28.58 |
DPoser(ours) () | 42.64/73.85/24.36 | 67.70/97.06/22.29 | 58.52/82.37/18.33 | 17.11/28.59/8.92 |
DPoser(ours) () | 35.37/74.01/26.47 | 59.25/96.77/24.55 | 51.27/81.76/20.04 | 13.95/28.57/9.85 |
Given the inherent uncertainties within this task, we generate multiple solutions and evaluate them based on their minimum, mean, and standard deviation errors against the ground truth. As illustrated in Tab. 4, DPoser exhibits superior performance across different occlusion scenarios compared to existing pose priors and even the task-specific CVPoser, highlighting its effectiveness in pose completion. The qualitative evaluations are presented in Fig. 7. Here, we observe that DPoser can generate a multitude of plausible poses, a capability lacking in VPoser [37]. Pose-NDF [49], meanwhile, struggles with generalizing to unseen noisy poses and making plausible adjustments from its rest pose initialization.
4.5 Motion Denoising
Though not initially designed for temporal tasks, DPoser shows remarkable proficiency in motion denoising. The task aims to estimate clean body poses from noisy 3D joint positions in motion capture sequences. Adhering to the setup outlined in HuMoR [40], we utilize 60-frame sequences from the AMASS [32] dataset and artificially introduce Gaussian noise with a standard deviation of 40 mm to the 3D joint positions. Moreover, we conduct experiments on HPS datasets [15] without additional training to validate the generalization.
As presented in Tab. 5, DPoser sets a new standard in motion denoising, outperforming even specialized motion priors like HuMoR [40]. To further confirm the robustness of DPoser, we conduct evaluations under varying conditions to gauge DPoser’s denoising capabilities. The results, detailed in Tab. 6, reveal that DPoser consistently achieves significant reductions in MPJPE, maintaining robust performance under extreme noise conditions.
4.6 Ablation Study
Timestep scheduling | HMR | Pose Completion | Motion Denoising | |
PA-MPJPE | MPJPE () | MPVPE | MPJPE | |
Random | 58.84 | 86.23/121.57/23.16 | 43.33 | 23.87 |
Fixed | 56.55 | 36.99/71.68/23.41 | 45.69 | 22.54 |
Uniform | 59.28 | 42.72/75.70/21.84 | 39.72 | 20.80 |
Truncated | 56.05 | 35.37/74.01/26.47 | 38.21 | 19.87 |
In our ablation study, we initially focus on the impact of truncated timestep scheduling on DPoser’s performance. This involves contrasting our proposed scheduling strategy against three established methods—random, fixed, and uniform scheduling [34, 33, 6, 48]. As Tab. 7 demonstrates, our strategy consistently outperforms these alternatives across all evaluated tasks. Additionally, we delve into the training aspects of DPoser, such as rotation representations and the integration of an auxiliary loss akin to HuMoR [40]. Using the same trained prior, we also compare DPoser’s capabilities with SOTA diffusion-based solvers [48, 7, 6] on pose completion, revealing its superior versatility and performance. Detailed findings and analyses from these ablation studies are presented in the Appendix.
5 Conclusion
We introduce DPoser, to our best knowledge, the first unconditional diffusion-based pose prior, tailored for an expansive array of pose-related tasks. Engineered for flexibility, DPoser can be implemented as a straightforward L2-loss regularizer and enhanced by our innovative truncated timestep scheduling for test-time optimization. Comprehensive experiments substantiate DPoser’s superior performance over existing state-of-the-art pose priors.
Limitation and future work. While our framework benefits from variational diffusion sampling [33], it also shares its limitations, such as the mode-seeking behavior. Future research could look into enhancing solution diversity via techniques like particle-based variational inference [30, 53]. Furthermore, within the broader context of inverse problems we have framed, a plethora of existing methods [45, 2, 5, 35] could be adapted to leverage our diffusion-based prior. Exploring these methods holds great potential for future progress.
Ethical Considerations. For a discussion on the potential negative impacts of our work, please refer to the Appendix.
References
- [1] Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In: ECCV (2016)
- [2] Boys, B., Girolami, M., Pidstrigach, J., Reich, S., Mosca, A., Akyildiz, O.D.: Tweedie moment projected diffusions for inverse problems. arXiv preprint arXiv:2310.06721 (2023)
- [3] Cho, H., Kim, J.: Generative approach for probabilistic human mesh recovery using diffusion models. In: ICCV (2023)
- [4] Choi, J., Lee, J., Shin, C., Kim, S., Kim, H., Yoon, S.: Perception prioritized training of diffusion models. In: CVPR (2022)
- [5] Chung, H., Kim, J., Kim, S., Ye, J.C.: Parallel diffusion models of operator and image for blind inverse problems. In: CVPR (2023)
- [6] Chung, H., Kim, J., Mccann, M.T., Klasky, M.L., Ye, J.C.: Diffusion posterior sampling for general noisy inverse problems. arXiv preprint arXiv:2209.14687 (2022)
- [7] Chung, H., Sim, B., Ryu, D., Ye, J.C.: Improving diffusion models for inverse problems using manifold constraints. NeurIPS (2022)
- [8] Ci, H., Wu, M., Zhu, W., Ma, X., Dong, H., Zhong, F., Wang, Y.: Gfpose: Learning 3d human pose prior with gradient fields. In: CVPR (2023)
- [9] Davydov, A., Remizova, A., Constantin, V., Honari, S., Salzmann, M., Fua, P.: Adversarial parametric pose prior. In: CVPR (2022)
- [10] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
- [11] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. NeurIPS (2021)
- [12] Georgakis, G., Li, R., Karanam, S., Chen, T., Košecká, J., Wu, Z.: Hierarchical kinematic human mesh recovery. In: ECCV (2020)
- [13] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM (2020)
- [14] Graikos, A., Malkin, N., Jojic, N., Samaras, D.: Diffusion models as plug-and-play priors. NeurIPS (2022)
- [15] Guzov, V., Mir, A., Sattler, T., Pons-Moll, G.: Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In: CVPR (2021)
- [16] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS (2020)
- [17] Holmquist, K., Wandt, B.: Diffpose: Multi-hypothesis human pose estimation using diffusion models. In: ICCV (2023)
- [18] Jiang, Z., Zhou, Z., Li, L., Chai, W., Yang, C.Y., Hwang, J.N.: Back to optimization: Diffusion-based zero-shot 3d human pose estimation. arXiv preprint arXiv:2307.03833 (2023)
- [19] Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
- [20] Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. NeurIPS (2022)
- [21] Kawar, B., Elad, M., Ermon, S., Song, J.: Denoising diffusion restoration models. NeurIPS (2022)
- [22] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
- [23] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
- [24] Li, L., Zhuo, L., Zhang, B., Bo, L., Chen, C.: Diffhand: End-to-end hand mesh reconstruction via diffusion models. arXiv preprint arXiv:2305.13705 (2023)
- [25] Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: Cliff: Carrying location information in full frames into human pose and shape estimation. In: ECCV (2022)
- [26] Lin, J., Zeng, A., Wang, H., Zhang, L., Li, Y.: One-stage 3d whole-body mesh recovery with component aware transformer. In: CVPR (2023)
- [27] Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Dollár, P.: Microsoft coco: common objects in context (2014). arXiv preprint arXiv:1405.0312 (2019)
- [28] Ling, H.Y., Zinno, F., Cheng, G., Van De Panne, M.: Character controllers using motion vaes. TOG (2020)
- [29] Liu, Q., Lee, J., Jordan, M.: A kernelized stein discrepancy for goodness-of-fit tests. In: ICML (2016)
- [30] Liu, Q., Wang, D.: Stein variational gradient descent: A general purpose bayesian inference algorithm. NeurIPS (2016)
- [31] Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinned multi-person linear model. ACM Transactions on Graphics 34(6) (2015)
- [32] Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes. In: ICCV (2019)
- [33] Mardani, M., Song, J., Kautz, J., Vahdat, A.: A variational perspective on solving inverse problems with diffusion models. arXiv preprint arXiv:2305.04391 (2023)
- [34] Müller, L., Ye, V., Pavlakos, G., Black, M., Kanazawa, A.: Generative proxemics: A prior for 3d social interaction from images. arXiv preprint arXiv:2306.09337 (2023)
- [35] Murata, N., Saito, K., Lai, C.H., Takida, Y., Uesaka, T., Mitsufuji, Y., Ermon, S.: Gibbsddrm: A partially collapsed gibbs sampler for solving blind inverse problems with denoising diffusion restoration. arXiv preprint arXiv:2301.12686 (2023)
- [36] Nachmani, E., Roman, R.S., Wolf, L.: Non gaussian denoising diffusion models. arXiv preprint arXiv:2106.07582 (2021)
- [37] Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: CVPR (2019)
- [38] Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)
- [39] Qiu, Z., Yang, Q., Wang, J., Wang, X., Xu, C., Fu, D., Yao, K., Han, J., Ding, E., Wang, J.: Learning structure-guided diffusion model for 2d human pose estimation. arXiv preprint arXiv:2306.17074 (2023)
- [40] Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Humor: 3d human motion model for robust pose estimation. In: ICCV (2021)
- [41] Särkkä, S., Solin, A.: Applied stochastic differential equations, vol. 10. Cambridge University Press (2019)
- [42] Shafir, Y., Tevet, G., Kapon, R., Bermano, A.H.: Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418 (2023)
- [43] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)
- [44] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
- [45] Song, J., Vahdat, A., Mardani, M., Kautz, J.: Pseudoinverse-guided diffusion models for inverse problems. In: ICLR (2022)
- [46] Song, Y., Durkan, C., Murray, I., Ermon, S.: Maximum likelihood training of score-based diffusion models. NeurIPS (2021)
- [47] Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. NeurIPS (2019)
- [48] Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)
- [49] Tiwari, G., Antić, D., Lenssen, J.E., Sarafianos, N., Tung, T., Pons-Moll, G.: Pose-ndf: Modeling human pose manifolds with neural distance fields. In: ECCV (2022)
- [50] Vincent, P.: A connection between score matching and denoising autoencoders. Neural computation (2011)
- [51] Von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3d human pose in the wild using imus and a moving camera. In: ECCV (2018)
- [52] Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: CVPR (2023)
- [53] Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213 (2023)
- [54] Wu, J., Gao, X., Liu, X., Shen, Z., Zhao, C., Feng, H., Liu, J., Ding, E.: Hd-fusion: Detailed text-to-3d generation leveraging multiple noise estimation. arXiv preprint arXiv:2307.16183 (2023)
- [55] Xu, Y., Zhang, J., Zhang, Q., Tao, D.: ViTPose: Simple vision transformer baselines for human pose estimation. In: Advances in Neural Information Processing Systems (2022)
- [56] Zhao, M., Liu, M., Ren, B., Dai, S., Sebe, N.: Modiff: Action-conditioned 3d motion generation with denoising diffusion probabilistic models. arXiv preprint arXiv:2301.03949 (2023)
- [57] Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: CVPR (2019)
- [58] Zhu, J., Zhuang, P.: Hifa: High-fidelity text-to-3d with advanced diffusion guidance. arXiv preprint arXiv:2305.18766 (2023)
Appendix for DPoser: Diffusion Model as
Robust 3D Human Pose Prior
In this appendix, we first briefly recap the parameterization of diffusion models and their connection to score functions in Sec. A, followed by the perspective of Score Distillation Sampling (SDS) to understand our DPoser regularization in Sec. B. We detail the experimental setup and nuances in Sec. C and dissect various training aspects of DPoser in Sec. D. The exploration of extended optimization techniques is discussed in Sec. E, and considerations for truncated timestep scheduling in image domains are presented in Sec. F. Additional qualitative results are showcased in Sec. G. Lastly, potential negative impacts such as biases in data and ethical concerns in application are considered in Sec. H.
A Parameterization of Score-based Diffusion Models
In the seminal work by Song et al. [48], it is demonstrated that both score-based generative models [47] and diffusion probabilistic models [16] can be understood as discretized versions of stochastic differential equations (SDEs) defined by score functions. This unification allows the training objective to be interpreted either as learning a time-dependent denoiser or as learning a sequence of score functions that describe increasingly noisy versions of the data.
We begin by revisiting the training objective for score-based models [47] to elucidate the link with diffusion models [16]. Consider the transition kernel of the forward diffusion process . Our goal is to learn score functions through a neural network , by minimizing the L2 loss as follows (we omit the expectation operator for conciseness) :
(14) |
Here, , where .
B View DPoser as Score Distillation Sampling
Strategy | HMR | Pose Completion | Motion Denoising | |
PA-MPJPE | MPJPE () | MPVPE | MPJPE | |
1 step | 56.05 | 35.37/74.01/26.47 | 38.21 | 19.87 |
5 steps | 56.16 | 36.59/80.82/31.22 | 40.22 | 21.21 |
10 steps | 56.18 | 36.78/82.59/32.32 | 40.69 | 21.34 |
Interestingly, the gradient of DPoser (Eq. (10) in the main text) coincides with Score Distillation Sampling (SDS) [38, 52], which can be interpreted as aiming to minimize the following KL divergence:
(17) |
where denote the marginal distribution whose score function is estimated by . For the specific case where , this term encourages the Dirac distribution (i.e., the optimized variable) to gravitate toward the learned data distribution , while the Gaussian perturbation like Eq. (17) softens the constraint. Building on this understanding, we can borrow advanced techniques from SDS [38, 52]—a rapidly evolving area ripe for methodological innovations [53, 54, 58]. To extend this, we experiment with a multi-step denoising strategy adapted from HiFA [58], substituting our original one-step denoising process. This alternative, however, yields suboptimal results across all evaluation metrics, as demonstrated in Tab. S-1. A plausible explanation could be that our proposed truncated timestep scheduling effectively manages low noise levels (i.e., small ), thus negating the need for more denoising steps. In addition, iterative denoising in each optimization step may cause error accumulations, leading to inaccurate gradients.
C Experimental Details
This section elaborates on the specifics of our pose completion and motion denoising experiments.
C.1 Pose Completion
For partial observations , the measurement operator is modeled as a mask matrix . Based on our optimization framework (Algorithm 1 in the main text), we define the task-specific loss, , as follows:
(18) |
Here, denotes the complete body pose we try to recover, where the unseen parts are initialized as random noise. In the following ablated studies, if not specified, the evaluation is performed using 10 hypotheses on the AMASS [32] dataset with left leg occlusion.
C.2 Motion Denoising (Noisy Input)
Methods | AMASS [32] | HPS [15] | ||
20mm | 100mm | 20mm | 100mm | |
No prior | 15.33 | 51.48 | 16.26 | 50.87 |
VPoser [37] | 15.20 | 49.10 | 17.24 | 46.69 |
Pose-NDF [49] | 13.84 | 46.10 | 15.62 | 47.50 |
DPoser (ours) | 13.64 | 33.18 | 13.45 | 35.32 |
Adhering to Pose-NDF settings [49], we aim to refine noisy joint positions over frames to obtain clean poses , initialized from mean poses in SMPL with small noise. We formulate the task-specific loss combining an observation fidelity term and a temporal consistency term :
(19) |
(20) |
where denotes the 3D joint positions regressed from SMPL [31] and is the constant mean shape parameters.
In complement to the comparative analysis presented in Table 4 of our main text, we extend our evaluation to include scenarios with varying noise levels. This extended examination, detailed in Tab. S-2, showcases DPoser’s exceptional performance against state-of-the-art (SOTA) pose priors, especially under conditions of high noise, manifesting DPoser’s resilience to noise.
C.3 Motion Denoising (Partial Input)
This task focuses on reconstructing clean poses, , from partially observed joint positions, , across frames, employing a known mask matrix to identify visible joints. The optimization objective mirrors that of motion denoising (Sec. C.2), but incorporates a mask in Eq. (19) to specifically target visible parts, ensuring that only these segments guide the recovery process.
We conducted experiments on the AMASS dataset [32] to assess our model’s performance on this task with two types of occlusions: legs and left arm. The quantitative results of these experiments are detailed in Tab. S-3, and the accompanying visualizations are provided in Sec. G.
In leg occlusion scenarios, the AMASS dataset primarily showcases straight poses, offering minimal diversity. This scenario permits decent outcomes without incorporating a pose prior, since the optimization’s starting point closely aligns with these prevalent poses. However, VPoser’s mean-centered characteristic hinders its ability to faithfully replicate the visible areas. On the other hand, Pose-NDF falls short in enhancing the occluded parts. DPoser accurately handles visible parts and guides occluded ones for more realistic poses. For left arm occlusions, which involve more varied movements, DPoser markedly surpasses other methods, underlining its adaptability and precision in handling diverse motion patterns.
Methods | Occlusion | MPJPE | MPVPE | ||
Vis. | Occ. | All. | All. | ||
No prior | Legs | 0.26 | 14.72 | 5.52 | 5.45 |
VPoser | Legs | 1.75 | 14.29 | 6.31 | 7.38 |
PoseNDF | Legs | 0.25 | 15.71 | 5.87 | 5.64 |
DPoser (ours) | Legs | 0.28 | 12.24 | 4.63 | 3.65 |
No prior | Left Arm | 0.26 | 24.87 | 4.74 | 9.91 |
VPoser | Left Arm | 1.21 | 13.23 | 3.40 | 7.68 |
PoseNDF | Left Arm | 0.25 | 17.70 | 3.42 | 7.86 |
DPoser (ours) | Left Arm | 0.27 | 7.80 | 1.64 | 3.81 |
D Ablated DPoser’s Training
Normalization | HMR | Pose Completion | Motion Denoising | |
PA-MPJPE | MPJPE () | MPVPE | MPJPE | |
w/o norm | 57.88 | 45.37/102.28/41.08 | 44.82 | 24.04 |
min-max | 59.17 | 47.41/107.00/43.42 | 42.70 | 21.29 |
z-score | 56.49 | 34.37/72.47/26.32 | 38.57 | 20.24 |
Representation | HMR | Pose Completion | Motion Denoising | |
PA-MPJPE | MPJPE () | MPVPE | MPJPE | |
axis-angle | 56.05 | 34.76/72.41/26.09 | 38.21 | 19.87 |
6D rotations | 57.54 | 40.89/81.43/27.31 | 38.44 | 20.12 |
This section dissects the impact of different rotation representations and normalization techniques on DPoser’s performance. Initially, we examine axis-angle representation, comparing various normalization strategies: min-max scaling, z-score normalization, and no normalization. Our findings, summarized in Tab. S-4, indicate that z-score normalization is generally the most effective. Subsequently, using this optimal normalization, we explore 6D rotations [57] as an alternative. As evidenced by Tab. S-5, axis-angle representation offers superior performance. This preference can be attributed to the effective modeling capabilities of diffusion models, along with the inherent advantages of axis-angle in capturing bounded joint rotations for regression tasks like human mesh recovery.
Inspired by HuMoR [40], we experiment with integrating the SMPL body model [31] as a regularization term during training. Alongside the prediction of additive noise, as outlined in Equation (4) in the main text, we employ a 10-step DDIM sampler [44] to recover a “clean” version of the pose, denoted as , from the diffused . The regularization loss aims to minimize the discrepancy between the original and recovered poses under the SMPL body model :
(21) |
Here, represents the mean shape parameters in SMPL. To account for denoising errors, we scale the regularization loss by , thereby increasing the weight for samples with smaller values (less noise).
Fig. S-1 visualizes the impact of this regularization on MPJPE during the training, specifically for pose completion tasks with occlusion of both legs.
We observe that weighted regularization offers slight performance gains in the early training process, while the absence of weighting introduces instability and deterioration in results. Despite these insights, the computational cost of incorporating the SMPL model—especially for our large batch size of 1280—makes the training approximately 8 times slower. Therefore, we opted not to include this regularization in our main experiments.
E Extended DPoser’s Optimization
Methods | Occ. left leg | Occ. legs | Occ. arms | Occ. trunk |
ScoreSDE [48] | 48.73/106.32/41.30 | 74.68/128.32/37.27 | 66.89/127.86/48.15 | 16.69/34.54/12.21 |
DPS [6] | 40.51/104.32/54.57 | 64.26/113.46/33.71 | 60.63/119.85/42.78 | 15.10/33.90/13.27 |
MCG [7] | 49.04/106.37/41.07 | 74.90/128.53/37.40 | 66.17/127.72/48.15 | 16.69/34.66/12.23 |
DPoser(ours) | 35.37/74.01/26.47 | 59.25/96.77/24.55 | 51.27/81.76/20.04 | 13.95/28.57/9.85 |
In addressing pose-centric tasks as inverse problems, we propose a versatile optimization framework, which employs variational diffusion sampling as its foundational approach [33]. Our exploration extends to an array of diffusion-based methodologies for solving these complex inverse problems. Among the techniques considered are ScoreSDE [48], MCG [7], and DPS [6]. These methods augment standard generative processes with observational data, either by employing gradient-based guidance or back-projection techniques. We compare these methods with our DPoser for pose completion tasks. Our findings, captured in Tab. S-6, reveal that DPoser outperforms the competitors under most occlusion conditions. Consequently, DPoser emerges not merely as a universally applicable solution to pose-related tasks, but also as an exceptionally efficient one.
It is worth mentioning that methods rooted in generative frameworks [48, 7, 6, 21] can pose challenges for broader applicability in pose-centric tasks. For instance, in blind inverse problems—certain parameters in (e.g., camera models in HMR) are unknown—generative methods are less straightforward to implement. ZeDO [18], a recent study focusing on the 2D-3D lifting task, adopts the ScoreSDE [48] framework and refines camera translations by solving an optimization sub-problem after each generative step. However, directly porting this strategy to HMR is non-trivial, owing to the added complexity of body shape parameter optimization—a feature currently absent in our DPoser model. Although some state-of-the-art techniques [5, 35] offer solutions by jointly modeling operator and data distributions, a full-fledged discussion on this subject is beyond this paper’s purview and remains an open question for future work.
F Truncated Timestep Scheduling on Images
Exploring truncated timestep scheduling for image-based tasks, we find its suitability for human poses doesn’t translate well to images. Initial timesteps are critical in image domains for generating foundational perceptual content.
In our study, we employed a 256x256 unconditional diffusion model [11] trained on ImageNet [10] with variational diffusion sampling [33] for image inpainting. Comparing standard (timesteps 990 to 0) and truncated scheduling (timesteps 495 to 0), both with 100 steps, the experiments confirmed that truncation compromises image quality (Fig. S-2). The standard approach preserved perceptual content, while truncation produced disjointed patches, misaligned with the original image context.
These results affirm that truncated timestep scheduling excels in pose data where key information emerges in later stages but falls short in image tasks where early timesteps are essential. This scheduling is thus bespoke to the characteristics of human pose estimation and is unsuitable for image processes that rely on the full diffusion timeline for content fidelity.
G More Qualitative Results
H Potential Negative Impacts
-
•
Bias and Fairness Concerns: Human pose prior learning models may inadvertently encode biases present in the training data, leading to biased predictions or discriminatory outcomes. This can perpetuate existing societal biases and inequalities, particularly if the training data is not representative or balanced across diverse demographics.
-
•
Ethical Considerations: The use of human pose prior learning models in applications such as surveillance, security, or healthcare raises ethical concerns regarding individual privacy, autonomy, and consent. There are debates about the appropriate use of such technologies and the potential for unintended consequences or misuse.
-
•
Dependency on Data Quality: Human pose prior learning models heavily rely on the quality and diversity of the training data. Poorly annotated or biased datasets can negatively impact the performance and reliability of these models, leading to inaccurate or unreliable predictions.