(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

DPoser: Diffusion Model as
Robust 3D Human Pose Prior

Junzhe Lu

{}^{1}

**g Lin

{}^{2}

Hongkun Dou

{}^{1}

Ailing Zeng

{}^{3}

Yue Deng

{}^{1}

Yulun Zhang

{}^{4}

and Haoqian Wang

{}^{2}

{}^{1}

Beihang University

{}^{2}

Tsinghua University

{}^{3}

International Digital Economy Academy (IDEA)

{}^{4}

ETH Zürich
https://dposer.github.io

Abstract

This work targets to construct a robust human pose prior. However, it remains a persistent challenge due to biomechanical constraints and diverse human movements. Traditional priors like VAEs and NDFs often exhibit shortcomings in realism and generalization, notably with unseen noisy poses. To address these issues, we introduce DPoser, a robust and versatile human pose prior built upon diffusion models. DPoser regards various pose-centric tasks as inverse problems and employs variational diffusion sampling for efficient solving. Accordingly, designed with optimization frameworks, DPoser seamlessly benefits human mesh recovery, pose generation, pose completion, and motion denoising tasks. Furthermore, due to the disparity between the articulated poses and structured images, we propose truncated timestep scheduling to enhance the effectiveness of DPoser. Our approach demonstrates considerable enhancements over common uniform scheduling used in image domains, boasting improvements of 5.4%, 17.2%, and 3.8% across human mesh recovery, pose completion, and motion denoising, respectively. Comprehensive experiments demonstrate the superiority of DPoser over existing state-of-the-art pose priors across multiple tasks. ^†^†Corresponding authors: Yulun Zhang and Haoqian Wang.

Keywords:

Human Pose Prior, Diffusion Model

1 Introduction

Accurate modeling of human pose is a fundamental research topic that can benefit various applications, from human-robot interaction to augmented and virtual reality experiences. Many real-world applications rely on a prior distribution of valid human poses to perform tasks like body model fitting, motion capture, and gesture recognition. The complexity of human biomechanics, coupled with the extensive kinematic variability in movement patterns, presents a significant challenge in constructing a robust and realistic human pose prior.

Previous efforts to model human pose prior have mainly employed techniques such as Gaussian Mixture Models (GMMs) [1], Variational Autoencoders (VAEs) [37], and Neural Distance Fields (NDFs) [49]. Each technique, however, faces its own set of limitations. GMMs, for instance, might lead to the generation of implausible poses due to their unbounded nature. VAEs, restricted by their Gaussian assumptions, tend to generate average poses that may not accurately capture the full spectrum of human actions. Meanwhile, NDFs have shown promise in 3D surface modeling but struggle with generalizing across the complex, high-dimensional landscape of human pose manifolds. These limitations highlight a pressing need for a more comprehensive and dependable approach to modeling human pose priors, an endeavor this work seeks to address.

Recently, Diffusion models [16, 48, 11, 20] have gained traction for their prowess in capturing complex, high-dimensional data distributions and enabling versatile sampling techniques. Their application has been seen in generating lifelike human motion sequences [56, 42] and functioning as multi-hypothesis pose estimators from 2D inputs [17, 8]. However, these models are designed for specific generation tasks or tailored to work with conditional input data, which limits their applicability in broader contexts. The potential of diffusion models as a universal human pose prior remains largely untapped, and effective optimization methods for diverse tasks remain unanswered.

Refer to caption — Figure 1: An overview of DPoser’s versatility and performance across multiple pose-related tasks. Built on diffusion models, DPoser serves as a robust and adaptable pose prior. Shown are scenarios in (a) pose generation, (b) human mesh recovery, (c) motion denoising, and (d) pose completion. DPoser consistently outstrips existing priors like VPoser [37] in performance benchmarks.

In this work, we propose DPoser, a novel approach that leverages time-dependent denoiser learned from expansive motion capture datasets to construct a robust human pose prior. We regard various pose-centric tasks as inverse problems and suggest the integration of DPoser via variational diffusion sampling techniques [33] as a regularization component within optimization frameworks like SMPLify [1]. Furthermore, our investigations reveal that significant pose-related information during diffusion is predominantly located at the latter stages of the diffusion trajectory. This revelation inspired us to develop a novel truncated timestep scheduling strategy for optimization. Our method outperforms the standard uniform scheduling, showing gains of 5.4%, 17.2%, and 3.8% in human mesh recovery, pose completion, and motion denoising, respectively.

In summary, our main contributions are as follows:

•

We introduce DPoser, a novel framework based on diffusion models to craft a robust and flexible human pose prior, geared for seamless integration across diverse pose-related tasks via test-time optimization.
•

We analyze the impact of diffusion timesteps in the pose domain and propose truncated scheduling for more efficient optimization.
•

Through extensive experiments, we establish that DPoser outshines state-of-the-art (SOTA) pose priors in a variety of downstream tasks.

2 Related Work

2.1 Human Pose Priors

Human body models such as SMPL [31] serve as powerful tools for parameterizing both pose and shape, thereby offering a comprehensive framework for describing human gestures. Within the SMPL model, body poses are captured using rotation matrices or joint angles linked to a kinematic skeleton. Adjusting these parameters enables the representation of a diverse range of human actions. Nonetheless, feeding unrealistic poses into these models can result in non-viable human figures, primarily because plausible human poses are confined within a complex, high-dimensional manifold due to biomechanical constraints.

Various strategies [1, 37, 49, 9] have been put forward to build human pose priors. Generative frameworks like GMMs, VAEs [22], and Generative Adversarial Networks (GANs) [13] have shown promise in encapsulating the multifaceted pose distribution, facilitating advancements in tasks like human mesh recovery [19, 12]. Further, some studies have delved into conditional pose priors tailored to specific tasks, incorporating extra information such as image features [39, 3], 2D joint coordinates [8], or sequences of preceding poses [28, 40]. Our initiative leans towards an unconditional pose prior approach, training DPoser on extensive motion capture data without relying on additional inputs like images or text, aiming for a versatile application across various pose-related scenarios.

2.2 Diffusion Models for Pose-centric Tasks

Diffusion models [47, 48, 16, 44] have emerged as powerful tools for capturing intricate data distributions, aligning particularly well with the demands of multi-hypothesis estimation in ambiguous human poses. Notable works include DiffPose [17], which leverages a Gaussian Mixture Model-guided forward diffusion process [36] and employs a Graph Convolutional Network (GCN) [23] architecture conditioned on 2D pose sequences for 3D pose estimation by learned reverse process (i.e., generation). In a similar vein, DiffusionPose [39] and GFPose [8] employ the generation-based pipeline but take different approaches in conditioning. Further, ZeDO [18] concentrates on 2D-to-3D pose lifting, while Diff-HMR [3] and DiffHand [24] explore estimating SMPL parameters and hand mesh vertices, respectively. BUDDI [34] stands out for using diffusion models to capture the joint distribution of interacting individuals and leveraging SDS loss [38, 52] for optimization during testing phases.

While DPoser shares a similar optimization implementation with BUDDI, it sets itself apart by introducing a wider perspective of inverse problems and equip** an innovative timestep scheduling strategy tailored to the characteristics of human poses. Unlike other approaches [18, 39, 8, 17] that primarily focus on 3D location-based representation, DPoser takes on the more demanding task of modeling SMPL-based rotation pose representation. This adds complexity due to the intricacies involved in representing rotations, positioning DPoser as a more versatile solution within the realm of pose-centric tasks.

3 Methods

3.1 Preliminary: Score-based Diffusion Models

Diffusion models [43, 47, 48, 16] operationalize generative processes by inverting a predefined forward diffusion process, typically formulated as a linear stochastic differential equation (SDE). Formally, the data trajectory $\left\{\mathbf{x}(t)\in\mathbb{R}^{n}\right\}_{t\in[0,1]}$ follows the forward SDE given by:

\mathrm{d}\mathbf{x}=\mu(t)\mathbf{x}\mathrm{d}t+g(t)\mathrm{d}\mathbf{w},

(1)

where $\mu(t)\mathbf{x}\in\mathbb{R}^{n}$ and $g(t)\in\mathbb{R}$ represent the drift and diffusion coefficients, while $\mathbf{w}$ is a standard Wiener process.

The affine drift coefficients ensure analytically tractable Gaussian perturbation kernels, denoted by $p_{0t}(\mathbf{x}_{t}\mid\mathbf{x})=\mathcal{N}(\mathbf{x}_{t};\alpha_{t}% \mathbf{x},\sigma_{t}^{2}\mathbf{I})$ , where the exact coefficients $\alpha_{t},\sigma_{t}$ can be obtained with standard techniques [41]. Using appropriately designed $\alpha_{t}$ and $\sigma_{t}$ , this allows the data distribution $\mathbf{x}_{0}\sim p_{data}$ to morph into a tractable isotropic Gaussian distribution $\mathbf{x}_{1}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ via forward diffusion.

To recover data distribution $p_{data}$ from the Gaussian distribution $\mathcal{N}(\mathbf{0},\mathbf{I})$ , we can simulate the corresponding reverse SDE of Eq. (1) [48]:

\mathrm{d}\mathbf{x}=[\mu(t)\mathbf{x}-g(t)^{2}\nabla_{\mathbf{x}_{t}}\log p_{% t}\left(\mathbf{x}_{t}\right)]\mathrm{d}t+g(t)\mathrm{d}\bar{\mathbf{w}}.

(2)

The so-called score function [29], $\nabla_{\mathbf{x}_{t}}\log p_{t}\left(\mathbf{x}_{t}\right)$ , serves as an unknown term in Eq. (2) and can be approximated by a neural network parameterized as $\epsilon_{\phi}(\mathbf{x}_{t};t)\approx-\sigma_{t}\nabla_{\mathbf{x}_{t}}\log p% _{t}\left(\mathbf{x}_{t}\right)$ ¹¹1This parameterization is obtained from the deep connection between the noise prediction in diffusion models and score function estimation in score-based models. We provide a brief recap in the Appendix.. To learn the score functions, employing denoising score matching techniques [50], we perturb the data points with noise as per:

\mathbf{x}_{t}=\alpha_{t}\mathbf{x}_{0}+\sigma_{t}\epsilon,\epsilon\sim% \mathcal{N}(\mathbf{0},\mathbf{I}).

(3)

Subsequently, feeding $\mathbf{x}_{t}$ and $t$ as input, we train the time-dependent noise predictor $\epsilon_{\phi}(\mathbf{x}_{t};t)$ using an L2-loss defined as [16]:

\mathbb{E}_{\mathbf{x}_{0}\sim p_{\mathrm{data}},\epsilon\sim\mathcal{N}(% \mathbf{0},\mathbf{I}),t\sim\mathcal{U}[0,1]}\left[w(t)||\epsilon-\epsilon_{% \phi}(\mathbf{x}_{t};t)||_{2}^{2}\right],

(4)

where $w(t)$ denotes a positive weighting function.

Upon successful training, the score functions can be estimated and used to solve the reverse SDE (Eq. (2)). Through techniques like Euler-Maruyama discretization, we can generate novel samples by simulating the reverse SDE.

3.2 Learning Pose Prior with Unconditional Diffusion Models

SMPL-based pose representation. To build a flexible 3D human pose prior, we propose to utilize the SMPL body model [31], which can be viewed as a differentiable function $[J,V]=M(\theta,\beta)$ that maps body joint angles $\theta\in\mathbb{R}^{3\times 21}$ and shape parameters $\beta\in\mathbb{R}^{10}$ to mesh vertices $V\in\mathbb{R}^{3\times 6890}$ and joint positions $J\in\mathbb{R}^{3\times 22}$ . Our target is to model the distribution of joint angles $p(\theta)$ .

Training of unconditional diffusion models. To this end, we adopt an unconditional diffusion model to learn the pose representation $\theta$ . This approach aligns with a task-agnostic strategy, focusing solely on the distribution of 3D poses. We employ sub-VP SDEs as outlined in [48], which have demonstrated efficacy in sampling quality, for constructing our diffusion model. Specifically, our chosen forward SDE (Eq. (1)) is given by:

\mathrm{d}\mathbf{x}=-\frac{1}{2}\xi(t)\mathbf{x}\mathrm{d}t+\sqrt{\xi(t)(1-e^% {-2\int_{0}^{t}\xi(s)\mathrm{d}s})}\mathrm{d}\mathbf{w},

(5)

where $\xi(t)$ denotes linear scheduled noise scales. The coefficients needed in Eq. (3) can be obtained as $\alpha_{t}=e^{-\frac{1}{2}\int_{0}^{t}\xi(s)\mathrm{d}s},\sigma_{t}=1-e^{-\int% _{0}^{t}\xi(s)\mathrm{d}s}$ .

During training, we initiate with a clean data point $\mathbf{x}_{0}$ —essentially, our pose representation $\theta$ —and introduce noise to generate samples $\mathbf{x}_{t}$ according to the forward process detailed in Eq. (3). Then we apply the objective in Eq. (4) to train the noise predictor $\epsilon_{\phi}(\mathbf{x}_{t};t)$ with weights $w(t)=\sigma_{t}^{2}$ as suggested in [48].

3.3 Optimization Leveraging Diffusion Priors

The acquired score functions or noise predictors, denoted as $\epsilon_{\phi}(\mathbf{x}_{t};t)$ , permit the direct generation of plausible poses through Eq. (2). Yet, the broader integration of diffusion priors into general optimization frameworks remains an open avenue. We address this by reframing pose-related tasks as inverse problems and applying variational diffusion sampling techniques [33] for efficient resolution.

Inverse problem formulation. Consider an original signal $\mathbf{x}_{0}$ . Inverse problems can be encapsulated by Eq. (6) as:

\mathbf{y}=\mathcal{A}(\mathbf{x}_{0})+\mathbf{n},\quad\mathbf{y},\mathbf{n}% \in\mathbb{R}^{d},~{}\mathbf{x}_{0}\in\mathbb{R}^{n},\vspace{-2mm}

(6)

where $\mathcal{A}$ symbolizes the measurement operator and $\mathbf{n}$ constitutes noise, assumed to be white Gaussian $\mathcal{N}(\mathbf{0},\sigma_{n}^{2}\mathbf{I})$ . In the context targeted in this study, $\mathbf{x}_{0}$ always refers to body poses in SMPL [31]. This formulation allows us to approach various pose-centric tasks by adapting $\mathcal{A}$ and interpreting $\mathbf{y}$ accordingly:

•

Pose completion: Here, $\mathcal{A}$ serves as a mask matrix to simulate partially observed poses, with $\mathbf{y}$ being the incomplete pose data.
•

Motion denoising: In this scenario, $\mathcal{A}$ applies SMPL’s forward kinematics, treating $\mathbf{y}$ as the observed noisy 3D joints.
•

Human mesh recovery: $\mathcal{A}$ integrates SMPL’s forward kinematics and camera projection to relate $\mathbf{y}$ to 2D joint observations in images.

The aim is to recover the original signal $\mathbf{x}_{0}$ , where, within the Bayesian framework, our objective shifts to sampling from the posterior distribution $p\left(\mathbf{x}_{0}\mid\mathbf{y}\right)$ .

Solving inverse problems with diffusion models. Various techniques [14, 21, 6, 7, 45, 33] have been explored to simulate this posterior sampling process based on unconditional diffusion priors $p\left(\mathbf{x}_{0};\phi\right)$ . Among them, the sampling-based scheme is widely explored and applied in tasks like image restoration. These methods incorporate the observation information $\mathbf{y}$ into the generation process of $\mathbf{x}_{0}$ through techniques like gradient guidance [6, 7] and back projection [48, 21, 7]. However, such methods rooted in generation are inconvenient for handling diverse pose-related tasks. To navigate these challenges, we adopt variational diffusion sampling [33] to build general optimization frameworks. Specifically, it employs a variational distribution $q\left(\mathbf{x}_{0}\mid\mathbf{y}\right):=\mathcal{N}(\mu,\sigma^{2}\mathbf{% I})$ and aims to minimize the Kullback-Leibler (KL) divergence between this variational distribution and the true posterior, mathematically expressed as $KL\big{(}q\left(\mathbf{x}_{0}\mid\mathbf{y}\right)\parallel p\left(\mathbf{x}% _{0}\mid\mathbf{y}\right)\big{)}$ . Further, under the assumption of zero variance ( $\sigma\approx 0$ ), the optimization problem of seeking $\mathbf{x}_{0}$ (i.e., $\mu$ ) can be formulated as minimizing [46, 33]:

\|\mathbf{y}-\mathcal{A}(\mathbf{x}_{0})\|^{2}+w_{t}(\mathtt{sg}[\epsilon_{% \phi}(\mathbf{x}_{t};t)-\epsilon])^{\top}\mathbf{x}_{0},\vspace{-1.5mm}

(7)

where $w_{t}$ denotes the loss weights and $\epsilon$ is sampled from the standard Gaussian distribution. Here, $\mathtt{sg}$ signifies the stopped-gradient operator, indicating that backpropagation through the trained diffusion models is not required. The optimization procedure initiates by selecting a timestep $t$ and applying a perturbation to the target $\mathbf{x}_{0}$ as per Eq. (3), resulting in $\mathbf{x}_{t}$ . Subsequently, the gradients $[\epsilon_{\phi}(\mathbf{x}_{t};t)-\epsilon]$ are applied to the optimization variable $\mathbf{x}_{0}$ . In a nutshell, this framework [33] provides a flexible yet robust strategy for employing diffusion priors in generic optimization problems, serving as a cornerstone for our work.

Introducing DPoser regularization. To shed more light on the working mechanism, we propose to reformulate the regularization term as:

	$\displaystyle L_{\mathrm{DPoser}}$	$\displaystyle=w_{t}\|\|\mathbf{x}_{0}-\mathtt{sg}[\mathbf{\hat{x}}_{0}(t)]\|\|_{2}% ^{2},\text{where}$		(8)
	$\displaystyle\mathbf{\hat{x}}_{0}(t)$	$\displaystyle=\frac{\mathbf{x}_{t}-\sigma_{t}\epsilon_{\phi}(\mathbf{x}_{t};t)% }{\alpha_{t}}.$		(9)

Here, $\mathbf{\hat{x}}_{0}(t)$ functions as a precise one-step denoising prediction using the diffusion model $\epsilon_{\phi}(\mathbf{x}_{t};t)$ . This approach effectively encourages the current pose $\mathbf{x}_{0}$ towards a denoised, plausible pose distribution, employing a straightforward L2-loss within the DPoser regularization framework. Further, the theoretical foundation of our regularization demonstrates its alignment with the gradient direction of variational diffusion sampling (Eq. (7)).

Proof: Differentiating Eq. (8) with respect to $\mathbf{x}_{0}$ yields:

$\displaystyle\nabla_{\mathbf{x}_{0}}L_{\mathrm{DPoser}}$	$\displaystyle=2w_{t}(\mathbf{x}_{0}-\mathbf{\hat{x}}_{0}(t))$
	$\displaystyle=2w_{t}(\frac{\mathbf{x}_{t}-\sigma_{t}\epsilon}{\alpha_{t}}-% \frac{\mathbf{x}_{t}-\sigma_{t}\epsilon_{\phi}(\mathbf{x}_{t};t)}{\alpha_{t}})$
	$\displaystyle=2w_{t}\frac{\sigma_{t}}{\alpha_{t}}(\epsilon_{\phi}(\mathbf{x}_{% t};t)-\epsilon)$
	$\displaystyle\propto(\epsilon_{\phi}(\mathbf{x}_{t};t)-\epsilon).$	(10)

Thus, $L_{\mathrm{DPoser}}$ represents a more intuitive approach to variational diffusion sampling. By incorporating alongside task-specific loss functions, this regularization term enhances the plausibility of the resultant poses.

DPoser across pose-related tasks. DPoser excels in versatility, enabling its seamless application in a spectrum of human pose-related tasks. Its adaptability is especially evident in our human mesh recovery approach, as depicted in Fig. 2. For an exhaustive examination of DPoser’s utility across tasks like pose completion and motion denoising, we direct the reader to our Appendix.

Human mesh recovery aims to deduce the human pose and shape from single-image inputs. In this context, we refine the optimization function derived from the SMPLify framework [1], integrating DPoser as a regularization term, $L_{\mathrm{DPoser}}$ , and streamlining the process by omitting the intricate interpenetration error component. The modified optimization objective, engaging both pose $\theta$ and shape $\beta$ parameters from the SMPL model [31], is defined as:

L(\theta,\beta)=L_{J}+w_{\theta}L_{\theta}+w_{\beta}L_{\beta}+w_{\alpha}L_{% \mathrm{DPoser}}.

(11)

The reprojection loss $L_{J}$ , acting as the data fidelity measure, is defined by:

\displaystyle L_{J}

\displaystyle=\sum_{i\in\text{Joints}}\lambda_{i}\rho\left(\Pi_{C}\left(M_{J}(% \theta,\beta)_{i}\right)-J^{\text{est}}_{i}\right),

(12)

where $M_{J}(\theta,\beta)$ calculates the 3D joint coordinates through SMPL’s forward kinematics. The function $\Pi_{C}$ maps these 3D coordinates into 2D space, aligning with the camera’s perspective. $J^{\text{est}}$ refers to the 2D keypoints estimated using an off-the-shelf 2D pose estimator (in our case, ViTPose [55]), with $\lambda_{i}$ reflecting the confidence score for each joint $i$ . The Geman-McClure error function ( $\rho$ ) is employed to assess the discrepancy in 2D joint locations reliably.

To mitigate the issue of overfitting, which often leads to unrealistic poses when solely minimizing reprojection loss, several regularization terms are introduced. Specifically, alongside our body prior $L_{\mathrm{DPoser}}$ , the bending term $L_{\theta}$ is incorporated to penalize excessive bending at the elbows and knees, formulated as $L_{\theta}=\sum_{i\in\text{(elbows, knees)}}\exp(\boldsymbol{\theta}_{i})$ . Additionally, the shape regularization term $L_{\beta}=\|\beta\|_{2}^{2}$ is employed to maintain the body shape within plausible bounds. The weights for prior terms are denoted as $w_{\theta},w_{\beta}$ and $w_{\alpha}$ , respectively.

Given the structure of $L_{\mathrm{DPoser}}$ (as seen in Eq. (8)), a crucial aspect lies in judiciously selecting the diffusion timestep $t$ during the iterative optimization process. In the subsequent section, we address this concern by introducing our novel truncated timestep scheduling strategy.

3.4 Test-time Truncated Timestep Scheduling

Motivation from pose generation. Adapting techniques from the image domain to pose data requires a nuanced understanding of the differences between the two. Previous image-based research [4] shows that initial timesteps (larger $t$ ) correspond to the perceptual content, while later timesteps refine details. Pose data, however, lacks this structured layering and spatial redundancy, indicating a need for a tailored timestep approach in the diffusion process.

As depicted in Fig. 3, we find that pose generation doesn’t benefit from the early timesteps as image generation does. The significant stages of pose refinement occur at smaller $t$ , specifically when $t\leq 0.3$ . A uniform distribution of timesteps, as tested in (b) with only five steps, proves less effective for pose data. In contrast, allocating these steps toward the latter end of the diffusion process, as in (c), yields significantly better samples, implying the critical information is not evenly distributed but rather is concentrated toward the end.

Truncated timestep scheduling. Based on these insights, we propose a shift from standard uniform timestep sampling to a truncated strategy, especially for pose data. By focusing on the last timesteps, particularly between 0.2 and 0.0, we target the interval rich in pose-specific information. Specifically, based on the linear descending scheduling, the truncated timestep $t$ for each optimization step can be expressed as:

t=t_{\text{max}}-\frac{(t_{\text{max}}-t_{\text{min}})\times\text{iter}}{N-1}.

(13)

where $N$ denotes the total number of optimization iterations, and iter signifies the current iteration. This formulation is integral to our proposed optimization framework, which is comprehensively summarized in Algorithm 1. The practical implementation typically involves setting the truncated range to $[0.2,0.05]$ .

Algorithm 1 Test-time Optimization with DPoser

1:A trained diffusion model

\epsilon_{\phi}(\mathbf{x}_{t};t)

, task-specific loss

L_{\text{task}}

, range of diffusion timesteps

[t_{\text{max}},t_{\text{min}}]

, number of optimization iterations

N

2:Initialization of SMPL body pose parameters

\mathbf{x}_{0}

3:for

\text{iter}=0,1,\ldots,N-1

t\leftarrow t_{\text{max}}-\frac{(t_{\text{max}}-t_{\text{min}})\times\text{% iter}}{N-1}

\triangleright

Timestep scheduling

5: Sample

\epsilon\sim\mathcal{N}(0,I)

\mathbf{x}_{t}\leftarrow\alpha_{t}\mathbf{x}_{0}+\sigma_{t}\epsilon

\triangleright

Forward diffusion

\mathbf{\hat{x}}_{0}(t)\leftarrow\frac{\mathbf{x}_{t}-\sigma_{t}\epsilon_{\phi% }(\mathbf{x}_{t};t)}{\alpha_{t}}

\triangleright

One-step denoiser

L_{\text{DPoser}}\leftarrow w_{t}\lVert\mathbf{x}_{0}-\text{sg}[\mathbf{\hat{x% }}_{0}(t)]\rVert_{2}^{2}

\triangleright

DPoser regularization

L_{\text{total}}\leftarrow L_{\text{task}}+L_{\text{DPoser}}

10: Update

\mathbf{x}_{0}

via backpropagation on

L_{\text{total}}

11:end for

12:return

\mathbf{x}_{0}

4 Experiments

In this section, we showcase the robustness and versatility of DPoser across a spectrum of pose-centric tasks, including pose generation, human mesh recovery, pose completion, and motion denoising. Due to the page limit, we leave experimental details and more qualitative assessments in the Appendix.

4.1 Experimental Setup

Implementation details. We train our DPoser model on the AMASS dataset [32], adhering to the same training partition as previous works [37, 49]. The model employs axis-angle representation for joint rotations, which we normalize to have zero mean and unit variance. The architecture consists of a fully connected neural network with approximately 8.28M parameters. It draws inspiration from GFPose [8] but omits conditional input pathways for our unconditional setting. To stabilize training, we use an exponential moving average with a decay factor of 0.9999, as advised by [48]. The Adam optimizer, a learning rate of $2\times 10^{-4}$ , and a batch size of 1280 govern the optimization process. The training of 800,000 iterations takes roughly 8 hours on a single Nvidia RTX 3090Ti GPU.

Evaluation metrics. To comprehensively evaluate our models across various tasks, following Pose-NDF [49], we adopt task-specific metrics:

•

Pose Generation: Diversity and fidelity are evaluated using Average Pairwise Distance (APD) and Self-Intersection rates (SI), respectively.
•

Human Mesh Recovery: The Procrustes-aligned Mean Per Joint Position Error (PA-MPJPE) measures the accuracy of recovered human meshes.
•

Pose Completion: The Mean Per Joint Position Error (MPJPE) for masked body joints serves as the metric, focusing on the inferred occluded parts.
•

Motion Denoising: Both MPJPE and the Mean Per-Vertex Position Error (MPVPE) are calculated to assess the denoising effectiveness.

All errors are reported in millimeter units.

4.2 Pose Generation

Sample source	APD $\uparrow$	SI $\downarrow$
Real-world (AMASS) [32]	15.44	0.79
GMM [1]	16.28	1.54
VPoser [37]	10.75	1.51
Pose-NDF [49]	18.75	1.97
GAN-S [9]	15.68	1.27
DPoser (ours)	14.28	1.21
DPoser (ours)*	19.03	1.13

Table 1: Comparative analysis of pose generation metrics. The discrepancy between visual impressions and APD/SI metrics is discussed, with reference to Fig. 4. *Indicates the use of a reduced 10-step sampler.

To commence, we delve into the capabilities of our DPoser model by generating samples from the learned manifold. Employing a standard Euler-Maruyama discretization with 1000 steps, we assess both the diversity and realism of the generated poses (Fig. 4). While DPoser’s outputs are visually diverse and realistic, poses generated from competing methods like GMM [1] and Pose-NDF [49] fall short in naturalism, and VPoser [37] exhibits limited diversity.

Interestingly, quantitative metrics such as APD and SI (Tab. 1) do not always corroborate our qualitative findings. For instance, a 10-step DDIM sampler [44]—suboptimal by design—outperformed real-world data [32] in APD, which we attribute to the generation of exaggerated poses. In summary, our findings underscore the need for a balanced evaluation strategy that merges quantitative metrics with qualitative observations.

4.3 Human Mesh Recovery

Initialization	No fitting	GMM [1]	VPoser [37]	Pose-NDF [49]	GAN-S [9]	DPoser(Ours)
from scratch	108.57	58.32	58.08	57.87	57.26	56.05
CLIFF [25]	56.62	51.02	49.39	49.50	49.58	49.05

Table 2: Performance comparison of human mesh recovery on the EHF dataset [37] using two initialization methods. PA-MPJPE is reported as the metric.

We probe the efficacy of DPoser in human mesh recovery (HMR), focusing on estimating human pose and shape from monocular images. We conduct experiments on the EHF dataset [37] and benchmark our method against existing SOTA priors. Our optimization-based framework incorporates two initialization paradigms: (1) a baseline initialization that utilizes mean pose values and a default camera setup, and (2) an advanced initialization scheme that leverages CLIFF [25], a pre-trained regression-based model tailored for HMR. Moreover, GAN-S [9] implementations require a GAN-inversion phase to convert initial poses into their latent representations, which is notably time-consuming.

Tab. 2 and Fig. 5 showcase the comparative performance of DPoser, highlighting its exceptional ability in HMR tasks. Notably, when fitting from scratch, it surpasses established SOTA priors like GAN-S [9] and Pose-NDF [49] and rivals the specific regression-based model [25]. The integration of CLIFF as initialization further amplifies DPoser’s performance, underscoring its efficiency and the benefits of employing refined starting conditions. Fig. 6 further confirms DPoser’s superior efficacy and adaptability across multiple datasets including EHF [37], MSCOCO [27], 3DPW [51], and UBody [26].

4.4 Pose Completion

In practical scenarios like those encountered in the UBody dataset [26] (refer to Fig. 5(d)), HMR algorithms often grapple with occlusions leading to incomplete 3D pose estimates. In this context, our ambition is to recover full 3D poses from partially observed data, initializing the occluded parts with random noise. Our DPoser model is employed to refine these initially implausible poses into feasible ones, utilizing an L2 loss on the visible parts to ensure data consistency.

Initialization	VPoser	Pose-NDF	DPoser
Zeros	180.90	157.50	73.92
10mm noise	181.86	172.50	74.69
100mm noise	180.25	511.51	74.19

Table 3: Pose completion on the AMASS [32] dataset (left leg under occlusion, single-hypotheses) using various initialization strategies. DPoser demonstrates its effectiveness across all conditions.

In parallel, we employ a comparable optimization strategy for both Pose-NDF [49] and VPoser [37]. Notably, Tab. 3 reveals that Pose-NDF struggles with poorly initialized poses unseen during its training phase. To mitigate this issue, we have to initialize the occluded poses near zero (close to rest pose) for Pose-NDF to prevent optimization divergence. Additionally, as a task-specific baseline, we adapt the original VPoser model into CVPoser by incorporating conditional inputs within its VAE framework [22]. This modification enables the encoder and decoder to process additional partial poses, facilitating end-to-end conditional sampling.

Methods	Occ. left leg	Occ. legs	Occ. arms	Occ. trunk
PoseNDF ( $S=1$ ) [49]	158.21	159.19	201.00	75.42
PoseNDF ( $S=5$ )	147.66/158.11/7.62	151.86/159.21/5.33	196.36/200.92/3.30	70.88/75.39/3.25
PoseNDF ( $S=10$ )	144.38/158.06/8.31	149.38/159.14/5.90	194.79/200.87/3.63	69.45/75.38/3.54
VPoser (S=1) [37]	180.78	198.18	159.86	37.75
VPoser ( $S=5$ )	167.92/181.30/10.53	178.77/198.15/14.51	148.17/159.65/8.64	31.83/37.79/4.54
VPoser ( $S=10$ )	162.82/181.09/12.21	172.83/198.31/16.30	144.53/159.80/9.69	30.06/37.78/4.99
CVPoser ( $S=10$ ) ${}^{\dagger}$	71.66/145.52/51.68	90.49/148.30/38.46	83.02/136.82/36.47	18.77/37.83/13.12
DPoser(ours) ( $S=1$ )	74.48	97.39	81.49	28.58
DPoser(ours) ( $S=5$ )	42.64/73.85/24.36	67.70/97.06/22.29	58.52/82.37/18.33	17.11/28.59/8.92
DPoser(ours) ( $S=10$ )	35.37/74.01/26.47	59.25/96.77/24.55	51.27/81.76/20.04	13.95/28.57/9.85

Table 4: Performance metrics (min/mean/std of MPJPE across multiple hypotheses) for pose completion under varying occlusion scenarios.

S

denotes the number of hypotheses.

{}^{\dagger}

Task-specific baseline trained with partial poses as conditional input.

Given the inherent uncertainties within this task, we generate multiple solutions and evaluate them based on their minimum, mean, and standard deviation errors against the ground truth. As illustrated in Tab. 4, DPoser exhibits superior performance across different occlusion scenarios compared to existing pose priors and even the task-specific CVPoser, highlighting its effectiveness in pose completion. The qualitative evaluations are presented in Fig. 7. Here, we observe that DPoser can generate a multitude of plausible poses, a capability lacking in VPoser [37]. Pose-NDF [49], meanwhile, struggles with generalizing to unseen noisy poses and making plausible adjustments from its rest pose initialization.

4.5 Motion Denoising

Though not initially designed for temporal tasks, DPoser shows remarkable proficiency in motion denoising. The task aims to estimate clean body poses from noisy 3D joint positions in motion capture sequences. Adhering to the setup outlined in HuMoR [40], we utilize 60-frame sequences from the AMASS [32] dataset and artificially introduce Gaussian noise with a standard deviation of 40 mm to the 3D joint positions. Moreover, we conduct experiments on HPS datasets [15] without additional training to validate the generalization.

As presented in Tab. 5, DPoser sets a new standard in motion denoising, outperforming even specialized motion priors like HuMoR [40]. To further confirm the robustness of DPoser, we conduct evaluations under varying conditions to gauge DPoser’s denoising capabilities. The results, detailed in Tab. 6, reveal that DPoser consistently achieves significant reductions in MPJPE, maintaining robust performance under extreme noise conditions.

4.6 Ablation Study

Methods AMASS [32] HPS [15] No prior 24.19 23.67 VPoser [37] 23.42 22.78 Pose-NDF [49] 22.13 21.60 MVAE [28] 26.80 N/A HuMoR [40] 22.69 N/A DPoser (ours) 19.87 20.54 Table 5: Performance metrics (MPJPE) for motion denoising under 40 mm noise.

Noise std AMASS [32] HPS [15] 20.00 31.93/13.64 31.93/13.45 40.00 63.81/19.87 63.81/20.54 100.00 159.78/33.18 159.78/35.32 Table 6: DPoser in motion denoising under varying noise scales. MPJPE is reported as before/after applying DPoser denoising.

Timestep scheduling	HMR	Pose Completion	Motion Denoising
	PA-MPJPE $\downarrow$	MPJPE ( $S=10$ ) $\downarrow$	MPVPE $\downarrow$	MPJPE $\downarrow$
Random	58.84	86.23/121.57/23.16	43.33	23.87
Fixed	56.55	36.99/71.68/23.41	45.69	22.54
Uniform	59.28	42.72/75.70/21.84	39.72	20.80
Truncated	56.05	35.37/74.01/26.47	38.21	19.87

Table 7: Evaluation of timestep scheduling strategies on key pose-related tasks, highlighting the superior efficacy of the proposed truncated scheduling.

In our ablation study, we initially focus on the impact of truncated timestep scheduling on DPoser’s performance. This involves contrasting our proposed scheduling strategy against three established methods—random, fixed, and uniform scheduling [34, 33, 6, 48]. As Tab. 7 demonstrates, our strategy consistently outperforms these alternatives across all evaluated tasks. Additionally, we delve into the training aspects of DPoser, such as rotation representations and the integration of an auxiliary loss akin to HuMoR [40]. Using the same trained prior, we also compare DPoser’s capabilities with SOTA diffusion-based solvers [48, 7, 6] on pose completion, revealing its superior versatility and performance. Detailed findings and analyses from these ablation studies are presented in the Appendix.

5 Conclusion

We introduce DPoser, to our best knowledge, the first unconditional diffusion-based pose prior, tailored for an expansive array of pose-related tasks. Engineered for flexibility, DPoser can be implemented as a straightforward L2-loss regularizer and enhanced by our innovative truncated timestep scheduling for test-time optimization. Comprehensive experiments substantiate DPoser’s superior performance over existing state-of-the-art pose priors.

Limitation and future work. While our framework benefits from variational diffusion sampling [33], it also shares its limitations, such as the mode-seeking behavior. Future research could look into enhancing solution diversity via techniques like particle-based variational inference [30, 53]. Furthermore, within the broader context of inverse problems we have framed, a plethora of existing methods [45, 2, 5, 35] could be adapted to leverage our diffusion-based prior. Exploring these methods holds great potential for future progress.

Ethical Considerations. For a discussion on the potential negative impacts of our work, please refer to the Appendix.

References

[1] Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In: ECCV (2016)
[2] Boys, B., Girolami, M., Pidstrigach, J., Reich, S., Mosca, A., Akyildiz, O.D.: Tweedie moment projected diffusions for inverse problems. arXiv preprint arXiv:2310.06721 (2023)
[3] Cho, H., Kim, J.: Generative approach for probabilistic human mesh recovery using diffusion models. In: ICCV (2023)
[4] Choi, J., Lee, J., Shin, C., Kim, S., Kim, H., Yoon, S.: Perception prioritized training of diffusion models. In: CVPR (2022)
[5] Chung, H., Kim, J., Kim, S., Ye, J.C.: Parallel diffusion models of operator and image for blind inverse problems. In: CVPR (2023)
[6] Chung, H., Kim, J., Mccann, M.T., Klasky, M.L., Ye, J.C.: Diffusion posterior sampling for general noisy inverse problems. arXiv preprint arXiv:2209.14687 (2022)
[7] Chung, H., Sim, B., Ryu, D., Ye, J.C.: Improving diffusion models for inverse problems using manifold constraints. NeurIPS (2022)
[8] Ci, H., Wu, M., Zhu, W., Ma, X., Dong, H., Zhong, F., Wang, Y.: Gfpose: Learning 3d human pose prior with gradient fields. In: CVPR (2023)
[9] Davydov, A., Remizova, A., Constantin, V., Honari, S., Salzmann, M., Fua, P.: Adversarial parametric pose prior. In: CVPR (2022)
[10] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
[11] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. NeurIPS (2021)
[12] Georgakis, G., Li, R., Karanam, S., Chen, T., Košecká, J., Wu, Z.: Hierarchical kinematic human mesh recovery. In: ECCV (2020)
[13] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM (2020)
[14] Graikos, A., Malkin, N., Jojic, N., Samaras, D.: Diffusion models as plug-and-play priors. NeurIPS (2022)
[15] Guzov, V., Mir, A., Sattler, T., Pons-Moll, G.: Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In: CVPR (2021)
[16] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS (2020)
[17] Holmquist, K., Wandt, B.: Diffpose: Multi-hypothesis human pose estimation using diffusion models. In: ICCV (2023)
[18] Jiang, Z., Zhou, Z., Li, L., Chai, W., Yang, C.Y., Hwang, J.N.: Back to optimization: Diffusion-based zero-shot 3d human pose estimation. arXiv preprint arXiv:2307.03833 (2023)
[19] Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
[20] Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. NeurIPS (2022)
[21] Kawar, B., Elad, M., Ermon, S., Song, J.: Denoising diffusion restoration models. NeurIPS (2022)
[22] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
[23] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
[24] Li, L., Zhuo, L., Zhang, B., Bo, L., Chen, C.: Diffhand: End-to-end hand mesh reconstruction via diffusion models. arXiv preprint arXiv:2305.13705 (2023)
[25] Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: Cliff: Carrying location information in full frames into human pose and shape estimation. In: ECCV (2022)
[26] Lin, J., Zeng, A., Wang, H., Zhang, L., Li, Y.: One-stage 3d whole-body mesh recovery with component aware transformer. In: CVPR (2023)
[27] Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Dollár, P.: Microsoft coco: common objects in context (2014). arXiv preprint arXiv:1405.0312 (2019)
[28] Ling, H.Y., Zinno, F., Cheng, G., Van De Panne, M.: Character controllers using motion vaes. TOG (2020)
[29] Liu, Q., Lee, J., Jordan, M.: A kernelized stein discrepancy for goodness-of-fit tests. In: ICML (2016)
[30] Liu, Q., Wang, D.: Stein variational gradient descent: A general purpose bayesian inference algorithm. NeurIPS (2016)
[31] Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinned multi-person linear model. ACM Transactions on Graphics 34(6) (2015)
[32] Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes. In: ICCV (2019)
[33] Mardani, M., Song, J., Kautz, J., Vahdat, A.: A variational perspective on solving inverse problems with diffusion models. arXiv preprint arXiv:2305.04391 (2023)
[34] Müller, L., Ye, V., Pavlakos, G., Black, M., Kanazawa, A.: Generative proxemics: A prior for 3d social interaction from images. arXiv preprint arXiv:2306.09337 (2023)
[35] Murata, N., Saito, K., Lai, C.H., Takida, Y., Uesaka, T., Mitsufuji, Y., Ermon, S.: Gibbsddrm: A partially collapsed gibbs sampler for solving blind inverse problems with denoising diffusion restoration. arXiv preprint arXiv:2301.12686 (2023)
[36] Nachmani, E., Roman, R.S., Wolf, L.: Non gaussian denoising diffusion models. arXiv preprint arXiv:2106.07582 (2021)
[37] Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: CVPR (2019)
[38] Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)
[39] Qiu, Z., Yang, Q., Wang, J., Wang, X., Xu, C., Fu, D., Yao, K., Han, J., Ding, E., Wang, J.: Learning structure-guided diffusion model for 2d human pose estimation. arXiv preprint arXiv:2306.17074 (2023)
[40] Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Humor: 3d human motion model for robust pose estimation. In: ICCV (2021)
[41] Särkkä, S., Solin, A.: Applied stochastic differential equations, vol. 10. Cambridge University Press (2019)
[42] Shafir, Y., Tevet, G., Kapon, R., Bermano, A.H.: Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418 (2023)
[43] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)
[44] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
[45] Song, J., Vahdat, A., Mardani, M., Kautz, J.: Pseudoinverse-guided diffusion models for inverse problems. In: ICLR (2022)
[46] Song, Y., Durkan, C., Murray, I., Ermon, S.: Maximum likelihood training of score-based diffusion models. NeurIPS (2021)
[47] Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. NeurIPS (2019)
[48] Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)
[49] Tiwari, G., Antić, D., Lenssen, J.E., Sarafianos, N., Tung, T., Pons-Moll, G.: Pose-ndf: Modeling human pose manifolds with neural distance fields. In: ECCV (2022)
[50] Vincent, P.: A connection between score matching and denoising autoencoders. Neural computation (2011)
[51] Von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3d human pose in the wild using imus and a moving camera. In: ECCV (2018)
[52] Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: CVPR (2023)
[53] Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213 (2023)
[54] Wu, J., Gao, X., Liu, X., Shen, Z., Zhao, C., Feng, H., Liu, J., Ding, E.: Hd-fusion: Detailed text-to-3d generation leveraging multiple noise estimation. arXiv preprint arXiv:2307.16183 (2023)
[55] Xu, Y., Zhang, J., Zhang, Q., Tao, D.: ViTPose: Simple vision transformer baselines for human pose estimation. In: Advances in Neural Information Processing Systems (2022)
[56] Zhao, M., Liu, M., Ren, B., Dai, S., Sebe, N.: Modiff: Action-conditioned 3d motion generation with denoising diffusion probabilistic models. arXiv preprint arXiv:2301.03949 (2023)
[57] Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: CVPR (2019)
[58] Zhu, J., Zhuang, P.: Hifa: High-fidelity text-to-3d with advanced diffusion guidance. arXiv preprint arXiv:2305.18766 (2023)

Appendix for DPoser: Diffusion Model as

Robust 3D Human Pose Prior

In this appendix, we first briefly recap the parameterization of diffusion models and their connection to score functions in Sec. A, followed by the perspective of Score Distillation Sampling (SDS) to understand our DPoser regularization in Sec. B. We detail the experimental setup and nuances in Sec. C and dissect various training aspects of DPoser in Sec. D. The exploration of extended optimization techniques is discussed in Sec. E, and considerations for truncated timestep scheduling in image domains are presented in Sec. F. Additional qualitative results are showcased in Sec. G. Lastly, potential negative impacts such as biases in data and ethical concerns in application are considered in Sec. H.

A Parameterization of Score-based Diffusion Models

In the seminal work by Song et al. [48], it is demonstrated that both score-based generative models [47] and diffusion probabilistic models [16] can be understood as discretized versions of stochastic differential equations (SDEs) defined by score functions. This unification allows the training objective to be interpreted either as learning a time-dependent denoiser or as learning a sequence of score functions that describe increasingly noisy versions of the data.

We begin by revisiting the training objective for score-based models [47] to elucidate the link with diffusion models [16]. Consider the transition kernel of the forward diffusion process $p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};\alpha_{t}% \mathbf{x}_{0},\sigma_{t}^{2}\mathbf{I})$ . Our goal is to learn score functions $\nabla_{\mathbf{x}_{t}}\log p_{t}(\mathbf{x}_{t})$ through a neural network $s_{\theta}(\mathbf{x}_{t};t)$ , by minimizing the L2 loss as follows (we omit the expectation operator for conciseness) :

\mathbb{E}\left[w(t)||s_{\theta}(\mathbf{x}_{t};t)-\nabla_{\mathbf{x}_{t}}\log p% _{t}\left(\mathbf{x}_{t}\right)||_{2}^{2}\right].

(14)

Here, $\mathbf{x}_{t}=\alpha_{t}\mathbf{x}_{0}+\sigma_{t}\epsilon$ , where $\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ .

Based on denoising score matching [50], we know the minimizing objective Eq. (14) is equivalent to the following tractable term:

\mathbb{E}\left[w(t)||s_{\theta}(\mathbf{x}_{t};t)-\nabla_{\mathbf{x}_{t}}\log p% _{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})||_{2}^{2}\right].

(15)

To link this with the noise predictor $\epsilon_{\theta}(\mathbf{x}_{t};t)$ in diffusion models, we can employ the reparameterization $s_{\theta}(\mathbf{x}_{t};t)=-\frac{\epsilon_{\theta}(\mathbf{x}_{t};t)}{% \sigma_{t}}$ . Then, Eq. (15) can be simplified as follows:

	$\displaystyle w(t)\|\|-\frac{\epsilon_{\theta}(\mathbf{x}_{t};t)}{\sigma_{t}}-% \nabla_{\mathbf{x}_{t}}\log p_{0t}(\mathbf{x}_{t}\mid\mathbf{x}_{0})\|\|_{2}^{2}$
$\displaystyle=$	$\displaystyle w(t)\|\|-\frac{\epsilon_{\theta}(\mathbf{x}_{t};t)}{\sigma_{t}}+% \frac{(\mathbf{x}_{t}-\alpha_{t}\mathbf{x}_{0})}{\sigma_{t}^{2}}\|\|_{2}^{2}$
$\displaystyle=$	$\displaystyle w(t)\|\|-\frac{\epsilon_{\theta}(\mathbf{x}_{t};t)}{\sigma_{t}}+% \frac{\sigma_{t}\epsilon}{\sigma_{t}^{2}})\|\|_{2}^{2}$
$\displaystyle=$	$\displaystyle\frac{w(t)}{\sigma_{t}^{2}}\|\|\epsilon_{\theta}(\mathbf{x}_{t};t)-% \epsilon)\|\|_{2}^{2}$	(16)

The resulting form of Eq. (16) aligns precisely with the noise prediction form of diffusion models [16] (refer to Eq. (4) in the main text). This implies that by training $\epsilon_{\theta}(\mathbf{x}_{t};t)$ in a diffusion model context, we simultaneously get a handle on the score function, approximated as $\nabla_{\mathbf{x}_{t}}\log p_{t}(\mathbf{x}_{t})\approx-\frac{\epsilon_{% \theta}(\mathbf{x}_{t};t)}{\sigma_{t}}$ .

B View DPoser as Score Distillation Sampling

Strategy	HMR	Pose Completion	Motion Denoising
	PA-MPJPE $\downarrow$	MPJPE ( $S=10$ ) $\downarrow$	MPVPE $\downarrow$	MPJPE $\downarrow$
1 step	56.05	35.37/74.01/26.47	38.21	19.87
5 steps	56.16	36.59/80.82/31.22	40.22	21.21
10 steps	56.18	36.78/82.59/32.32	40.69	21.34

Table S-1: Efficacy of different denoising steps in DPoser’s optimization.

Interestingly, the gradient of DPoser (Eq. (10) in the main text) coincides with Score Distillation Sampling (SDS) [38, 52], which can be interpreted as aiming to minimize the following KL divergence:

KL\big{(}p_{0t}\left(\mathbf{x}_{t}\mid\mathbf{x}_{0}\right)\parallel p_{t}^{% \mathtt{SDE}}\left(\mathbf{x}_{t};\theta\right)\big{)},

(17)

where $p_{t}^{\mathtt{SDE}}\left(\mathbf{x}_{t};\theta\right)$ denote the marginal distribution whose score function is estimated by $\epsilon_{\theta}(\mathbf{x}_{t};t)$ . For the specific case where $t\to 0$ , this term encourages the Dirac distribution $\delta(\mathbf{x}_{0})$ (i.e., the optimized variable) to gravitate toward the learned data distribution $p_{0}^{\mathtt{SDE}}\left(\mathbf{x}_{0};\theta\right)$ , while the Gaussian perturbation like Eq. (17) softens the constraint. Building on this understanding, we can borrow advanced techniques from SDS [38, 52]—a rapidly evolving area ripe for methodological innovations [53, 54, 58]. To extend this, we experiment with a multi-step denoising strategy adapted from HiFA [58], substituting our original one-step denoising process. This alternative, however, yields suboptimal results across all evaluation metrics, as demonstrated in Tab. S-1. A plausible explanation could be that our proposed truncated timestep scheduling effectively manages low noise levels (i.e., small $t$ ), thus negating the need for more denoising steps. In addition, iterative denoising in each optimization step may cause error accumulations, leading to inaccurate gradients.

C Experimental Details

This section elaborates on the specifics of our pose completion and motion denoising experiments.

C.1 Pose Completion

For partial observations $\mathbf{y}$ , the measurement operator $\mathcal{A}$ is modeled as a mask matrix $M\in\mathbb{R}^{d\times n}$ . Based on our optimization framework (Algorithm 1 in the main text), we define the task-specific loss, $L_{\text{comp}}$ , as follows:

L_{\text{comp}}=||M\mathbf{x}_{0}-\mathbf{y}||_{2}^{2}.

(18)

Here, $\mathbf{x}_{0}$ denotes the complete body pose $\theta$ we try to recover, where the unseen parts are initialized as random noise. In the following ablated studies, if not specified, the evaluation is performed using 10 hypotheses on the AMASS [32] dataset with left leg occlusion.

C.2 Motion Denoising (Noisy Input)

Methods	AMASS [32]		HPS [15]
	20mm	100mm	20mm	100mm
No prior	15.33	51.48	16.26	50.87
VPoser [37]	15.20	49.10	17.24	46.69
Pose-NDF [49]	13.84	46.10	15.62	47.50
DPoser (ours)	13.64	33.18	13.45	35.32

Table S-2: Performance comparison of motion denoising under varying noise scales. MPJPE is reported afters denoising.

Adhering to Pose-NDF settings [49], we aim to refine noisy joint positions $J_{\text{obs}}^{t}$ over $N$ frames to obtain clean poses $\theta^{t}$ , initialized from mean poses in SMPL with small noise. We formulate the task-specific loss combining an observation fidelity term $L_{\text{obs}}$ and a temporal consistency term $L_{\text{temp}}$ :

L_{\text{obs}}=\sum_{t=0}^{N-1}||M_{J}(\theta^{t},\beta_{0})-J_{\text{obs}}^{t% }||_{2}^{2},

(19)

L_{\text{temp}}=\sum_{t=1}^{N-1}||M_{J}(\theta^{t-1},\beta_{0})-M_{J}(\theta^{% t},\beta_{0})||_{2}^{2},

(20)

where $M_{J}$ denotes the 3D joint positions regressed from SMPL [31] and $\beta_{0}$ is the constant mean shape parameters.

In complement to the comparative analysis presented in Table 4 of our main text, we extend our evaluation to include scenarios with varying noise levels. This extended examination, detailed in Tab. S-2, showcases DPoser’s exceptional performance against state-of-the-art (SOTA) pose priors, especially under conditions of high noise, manifesting DPoser’s resilience to noise.

C.3 Motion Denoising (Partial Input)

This task focuses on reconstructing clean poses, $\theta^{t}$ , from partially observed joint positions, $J_{\text{obs}}^{t}$ , across $N$ frames, employing a known mask matrix to identify visible joints. The optimization objective mirrors that of motion denoising (Sec. C.2), but incorporates a mask in Eq. (19) to specifically target visible parts, ensuring that only these segments guide the recovery process.

We conducted experiments on the AMASS dataset [32] to assess our model’s performance on this task with two types of occlusions: legs and left arm. The quantitative results of these experiments are detailed in Tab. S-3, and the accompanying visualizations are provided in Sec. G.

In leg occlusion scenarios, the AMASS dataset primarily showcases straight poses, offering minimal diversity. This scenario permits decent outcomes without incorporating a pose prior, since the optimization’s starting point closely aligns with these prevalent poses. However, VPoser’s mean-centered characteristic hinders its ability to faithfully replicate the visible areas. On the other hand, Pose-NDF falls short in enhancing the occluded parts. DPoser accurately handles visible parts and guides occluded ones for more realistic poses. For left arm occlusions, which involve more varied movements, DPoser markedly surpasses other methods, underlining its adaptability and precision in handling diverse motion patterns.

Methods	Occlusion	MPJPE			MPVPE
		Vis.	Occ.	All.	All.
No prior	Legs	0.26	14.72	5.52	5.45
VPoser	Legs	1.75	14.29	6.31	7.38
PoseNDF	Legs	0.25	15.71	5.87	5.64
DPoser (ours)	Legs	0.28	12.24	4.63	3.65
No prior	Left Arm	0.26	24.87	4.74	9.91
VPoser	Left Arm	1.21	13.23	3.40	7.68
PoseNDF	Left Arm	0.25	17.70	3.42	7.86
DPoser (ours)	Left Arm	0.27	7.80	1.64	3.81

Table S-3: Comparative analysis of methods for motion denoising with different occlusions (Legs and Left Arm) on the AMASS dataset. Errors (in cm) are evaluated in terms of MPJPE across visible (Vis.), occluded (Occ.), and all joints, along with MPVPE for all vertices.

D Ablated DPoser’s Training

Normalization	HMR	Pose Completion	Motion Denoising
	PA-MPJPE $\downarrow$	MPJPE ( $S=10$ ) $\downarrow$	MPVPE $\downarrow$	MPJPE $\downarrow$
w/o norm	57.88	45.37/102.28/41.08	44.82	24.04
min-max	59.17	47.41/107.00/43.42	42.70	21.29
z-score	56.49	34.37/72.47/26.32	38.57	20.24

Table S-4: Evaluation of DPoser’s performance under different normalization methods, specifically for the axis-angle rotation representation.

Representation	HMR	Pose Completion	Motion Denoising
	PA-MPJPE $\downarrow$	MPJPE ( $S=10$ ) $\downarrow$	MPVPE $\downarrow$	MPJPE $\downarrow$
axis-angle	56.05	34.76/72.41/26.09	38.21	19.87
6D rotations	57.54	40.89/81.43/27.31	38.44	20.12

Table S-5: Comparative performance of rotation representations under z-score normalization across multiple tasks and metrics.

This section dissects the impact of different rotation representations and normalization techniques on DPoser’s performance. Initially, we examine axis-angle representation, comparing various normalization strategies: min-max scaling, z-score normalization, and no normalization. Our findings, summarized in Tab. S-4, indicate that z-score normalization is generally the most effective. Subsequently, using this optimal normalization, we explore 6D rotations [57] as an alternative. As evidenced by Tab. S-5, axis-angle representation offers superior performance. This preference can be attributed to the effective modeling capabilities of diffusion models, along with the inherent advantages of axis-angle in capturing bounded joint rotations for regression tasks like human mesh recovery.

Inspired by HuMoR [40], we experiment with integrating the SMPL body model [31] as a regularization term during training. Alongside the prediction of additive noise, as outlined in Equation (4) in the main text, we employ a 10-step DDIM sampler [44] to recover a “clean” version of the pose, denoted as $\tilde{\mathbf{x}}_{0}$ , from the diffused $\mathbf{x}_{t}$ . The regularization loss aims to minimize the discrepancy between the original and recovered poses under the SMPL body model $M$ :

L_{\mathrm{reg}}=||M_{J}(\tilde{\mathbf{x}}_{0},\beta_{0})-M_{J}(\mathbf{x}_{0% },\beta_{0})||_{2}^{2}+||M_{V}(\tilde{\mathbf{x}}_{0},\beta_{0})-M_{V}(\mathbf% {x}_{0},\beta_{0})||_{2}^{2}.

(21)

Here, $\beta_{0}$ represents the mean shape parameters in SMPL. To account for denoising errors, we scale the regularization loss by $\mathrm{log}(1+\frac{\alpha_{t}}{\sigma_{t}})$ , thereby increasing the weight for samples with smaller $t$ values (less noise).

Fig. S-1 visualizes the impact of this regularization on MPJPE during the training, specifically for pose completion tasks with occlusion of both legs.

We observe that weighted regularization offers slight performance gains in the early training process, while the absence of weighting introduces instability and deterioration in results. Despite these insights, the computational cost of incorporating the SMPL model—especially for our large batch size of 1280—makes the training approximately 8 times slower. Therefore, we opted not to include this regularization in our main experiments.

E Extended DPoser’s Optimization

Methods	Occ. left leg	Occ. legs	Occ. arms	Occ. trunk
ScoreSDE [48]	48.73/106.32/41.30	74.68/128.32/37.27	66.89/127.86/48.15	16.69/34.54/12.21
DPS [6]	40.51/104.32/54.57	64.26/113.46/33.71	60.63/119.85/42.78	15.10/33.90/13.27
MCG [7]	49.04/106.37/41.07	74.90/128.53/37.40	66.17/127.72/48.15	16.69/34.66/12.23
DPoser(ours)	35.37/74.01/26.47	59.25/96.77/24.55	51.27/81.76/20.04	13.95/28.57/9.85

Table S-6: Comparative evaluation of diffusion-based solvers for pose completion on the AMASS dataset [32] (hypotheses number

S=10

In addressing pose-centric tasks as inverse problems, we propose a versatile optimization framework, which employs variational diffusion sampling as its foundational approach [33]. Our exploration extends to an array of diffusion-based methodologies for solving these complex inverse problems. Among the techniques considered are ScoreSDE [48], MCG [7], and DPS [6]. These methods augment standard generative processes with observational data, either by employing gradient-based guidance or back-projection techniques. We compare these methods with our DPoser for pose completion tasks. Our findings, captured in Tab. S-6, reveal that DPoser outperforms the competitors under most occlusion conditions. Consequently, DPoser emerges not merely as a universally applicable solution to pose-related tasks, but also as an exceptionally efficient one.

It is worth mentioning that methods rooted in generative frameworks [48, 7, 6, 21] can pose challenges for broader applicability in pose-centric tasks. For instance, in blind inverse problems—certain parameters in $\mathcal{A}$ (e.g., camera models in HMR) are unknown—generative methods are less straightforward to implement. ZeDO [18], a recent study focusing on the 2D-3D lifting task, adopts the ScoreSDE [48] framework and refines camera translations by solving an optimization sub-problem after each generative step. However, directly porting this strategy to HMR is non-trivial, owing to the added complexity of body shape parameter optimization—a feature currently absent in our DPoser model. Although some state-of-the-art techniques [5, 35] offer solutions by jointly modeling operator $\mathcal{A}$ and data distributions, a full-fledged discussion on this subject is beyond this paper’s purview and remains an open question for future work.

F Truncated Timestep Scheduling on Images

Exploring truncated timestep scheduling for image-based tasks, we find its suitability for human poses doesn’t translate well to images. Initial timesteps are critical in image domains for generating foundational perceptual content.

In our study, we employed a 256x256 unconditional diffusion model [11] trained on ImageNet [10] with variational diffusion sampling [33] for image inpainting. Comparing standard (timesteps 990 to 0) and truncated scheduling (timesteps 495 to 0), both with 100 steps, the experiments confirmed that truncation compromises image quality (Fig. S-2). The standard approach preserved perceptual content, while truncation produced disjointed patches, misaligned with the original image context.

These results affirm that truncated timestep scheduling excels in pose data where key information emerges in later stages but falls short in image tasks where early timesteps are essential. This scheduling is thus bespoke to the characteristics of human pose estimation and is unsuitable for image processes that rely on the full diffusion timeline for content fidelity.

G More Qualitative Results

We show more qualitative results for pose generation (Fig. S-3), pose completion (Fig. S-4), human mesh recovery (Fig. S-5) and motion denoising (Fig. S-6, Fig. S-7).

H Potential Negative Impacts

•

Bias and Fairness Concerns: Human pose prior learning models may inadvertently encode biases present in the training data, leading to biased predictions or discriminatory outcomes. This can perpetuate existing societal biases and inequalities, particularly if the training data is not representative or balanced across diverse demographics.
•

Ethical Considerations: The use of human pose prior learning models in applications such as surveillance, security, or healthcare raises ethical concerns regarding individual privacy, autonomy, and consent. There are debates about the appropriate use of such technologies and the potential for unintended consequences or misuse.
•

Dependency on Data Quality: Human pose prior learning models heavily rely on the quality and diversity of the training data. Poorly annotated or biased datasets can negatively impact the performance and reliability of these models, leading to inaccurate or unreliable predictions.