HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: axessibility

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2312.05541v2 [cs.CV] 23 Mar 2024

(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

DPoser: Diffusion Model as
Robust 3D Human Pose Prior

Junzhe Lu11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT    **g Lin22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT    Hongkun Dou11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT   
Ailing Zeng33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT
   Yue Deng11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT    Yulun Zhang44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT    and Haoqian Wang22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Beihang University
   22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Tsinghua University   
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT International Digital Economy Academy (IDEA)
   44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT ETH Zürich
https://dposer.github.io
Abstract

This work targets to construct a robust human pose prior. However, it remains a persistent challenge due to biomechanical constraints and diverse human movements. Traditional priors like VAEs and NDFs often exhibit shortcomings in realism and generalization, notably with unseen noisy poses. To address these issues, we introduce DPoser, a robust and versatile human pose prior built upon diffusion models. DPoser regards various pose-centric tasks as inverse problems and employs variational diffusion sampling for efficient solving. Accordingly, designed with optimization frameworks, DPoser seamlessly benefits human mesh recovery, pose generation, pose completion, and motion denoising tasks. Furthermore, due to the disparity between the articulated poses and structured images, we propose truncated timestep scheduling to enhance the effectiveness of DPoser. Our approach demonstrates considerable enhancements over common uniform scheduling used in image domains, boasting improvements of 5.4%, 17.2%, and 3.8% across human mesh recovery, pose completion, and motion denoising, respectively. Comprehensive experiments demonstrate the superiority of DPoser over existing state-of-the-art pose priors across multiple tasks. Corresponding authors: Yulun Zhang and Haoqian Wang.

Keywords:
Human Pose Prior, Diffusion Model

1 Introduction

Accurate modeling of human pose is a fundamental research topic that can benefit various applications, from human-robot interaction to augmented and virtual reality experiences. Many real-world applications rely on a prior distribution of valid human poses to perform tasks like body model fitting, motion capture, and gesture recognition. The complexity of human biomechanics, coupled with the extensive kinematic variability in movement patterns, presents a significant challenge in constructing a robust and realistic human pose prior.

Previous efforts to model human pose prior have mainly employed techniques such as Gaussian Mixture Models (GMMs) [1], Variational Autoencoders (VAEs) [37], and Neural Distance Fields (NDFs) [49]. Each technique, however, faces its own set of limitations. GMMs, for instance, might lead to the generation of implausible poses due to their unbounded nature. VAEs, restricted by their Gaussian assumptions, tend to generate average poses that may not accurately capture the full spectrum of human actions. Meanwhile, NDFs have shown promise in 3D surface modeling but struggle with generalizing across the complex, high-dimensional landscape of human pose manifolds. These limitations highlight a pressing need for a more comprehensive and dependable approach to modeling human pose priors, an endeavor this work seeks to address.

Recently, Diffusion models [16, 48, 11, 20] have gained traction for their prowess in capturing complex, high-dimensional data distributions and enabling versatile sampling techniques. Their application has been seen in generating lifelike human motion sequences [56, 42] and functioning as multi-hypothesis pose estimators from 2D inputs [17, 8]. However, these models are designed for specific generation tasks or tailored to work with conditional input data, which limits their applicability in broader contexts. The potential of diffusion models as a universal human pose prior remains largely untapped, and effective optimization methods for diverse tasks remain unanswered.

Refer to caption
Figure 1: An overview of DPoser’s versatility and performance across multiple pose-related tasks. Built on diffusion models, DPoser serves as a robust and adaptable pose prior. Shown are scenarios in (a) pose generation, (b) human mesh recovery, (c) motion denoising, and (d) pose completion. DPoser consistently outstrips existing priors like VPoser [37] in performance benchmarks.

In this work, we propose DPoser, a novel approach that leverages time-dependent denoiser learned from expansive motion capture datasets to construct a robust human pose prior. We regard various pose-centric tasks as inverse problems and suggest the integration of DPoser via variational diffusion sampling techniques [33] as a regularization component within optimization frameworks like SMPLify [1]. Furthermore, our investigations reveal that significant pose-related information during diffusion is predominantly located at the latter stages of the diffusion trajectory. This revelation inspired us to develop a novel truncated timestep scheduling strategy for optimization. Our method outperforms the standard uniform scheduling, showing gains of 5.4%, 17.2%, and 3.8% in human mesh recovery, pose completion, and motion denoising, respectively.

In summary, our main contributions are as follows:

  • We introduce DPoser, a novel framework based on diffusion models to craft a robust and flexible human pose prior, geared for seamless integration across diverse pose-related tasks via test-time optimization.

  • We analyze the impact of diffusion timesteps in the pose domain and propose truncated scheduling for more efficient optimization.

  • Through extensive experiments, we establish that DPoser outshines state-of-the-art (SOTA) pose priors in a variety of downstream tasks.

2 Related Work

2.1 Human Pose Priors

Human body models such as SMPL [31] serve as powerful tools for parameterizing both pose and shape, thereby offering a comprehensive framework for describing human gestures. Within the SMPL model, body poses are captured using rotation matrices or joint angles linked to a kinematic skeleton. Adjusting these parameters enables the representation of a diverse range of human actions. Nonetheless, feeding unrealistic poses into these models can result in non-viable human figures, primarily because plausible human poses are confined within a complex, high-dimensional manifold due to biomechanical constraints.

Various strategies [1, 37, 49, 9] have been put forward to build human pose priors. Generative frameworks like GMMs, VAEs [22], and Generative Adversarial Networks (GANs) [13] have shown promise in encapsulating the multifaceted pose distribution, facilitating advancements in tasks like human mesh recovery [19, 12]. Further, some studies have delved into conditional pose priors tailored to specific tasks, incorporating extra information such as image features [39, 3], 2D joint coordinates [8], or sequences of preceding poses [28, 40]. Our initiative leans towards an unconditional pose prior approach, training DPoser on extensive motion capture data without relying on additional inputs like images or text, aiming for a versatile application across various pose-related scenarios.

2.2 Diffusion Models for Pose-centric Tasks

Diffusion models [47, 48, 16, 44] have emerged as powerful tools for capturing intricate data distributions, aligning particularly well with the demands of multi-hypothesis estimation in ambiguous human poses. Notable works include DiffPose [17], which leverages a Gaussian Mixture Model-guided forward diffusion process [36] and employs a Graph Convolutional Network (GCN) [23] architecture conditioned on 2D pose sequences for 3D pose estimation by learned reverse process (i.e., generation). In a similar vein, DiffusionPose [39] and GFPose [8] employ the generation-based pipeline but take different approaches in conditioning. Further, ZeDO [18] concentrates on 2D-to-3D pose lifting, while Diff-HMR [3] and DiffHand [24] explore estimating SMPL parameters and hand mesh vertices, respectively. BUDDI [34] stands out for using diffusion models to capture the joint distribution of interacting individuals and leveraging SDS loss [38, 52] for optimization during testing phases.

While DPoser shares a similar optimization implementation with BUDDI, it sets itself apart by introducing a wider perspective of inverse problems and equip** an innovative timestep scheduling strategy tailored to the characteristics of human poses. Unlike other approaches [18, 39, 8, 17] that primarily focus on 3D location-based representation, DPoser takes on the more demanding task of modeling SMPL-based rotation pose representation. This adds complexity due to the intricacies involved in representing rotations, positioning DPoser as a more versatile solution within the realm of pose-centric tasks.

3 Methods

3.1 Preliminary: Score-based Diffusion Models

Diffusion models [43, 47, 48, 16] operationalize generative processes by inverting a predefined forward diffusion process, typically formulated as a linear stochastic differential equation (SDE). Formally, the data trajectory {𝐱(t)n}t[0,1]subscript𝐱𝑡superscript𝑛𝑡01\left\{\mathbf{x}(t)\in\mathbb{R}^{n}\right\}_{t\in[0,1]}{ bold_x ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t ∈ [ 0 , 1 ] end_POSTSUBSCRIPT follows the forward SDE given by:

d𝐱=μ(t)𝐱dt+g(t)d𝐰,d𝐱𝜇𝑡𝐱d𝑡𝑔𝑡d𝐰\mathrm{d}\mathbf{x}=\mu(t)\mathbf{x}\mathrm{d}t+g(t)\mathrm{d}\mathbf{w},roman_d bold_x = italic_μ ( italic_t ) bold_x roman_d italic_t + italic_g ( italic_t ) roman_d bold_w , (1)

where μ(t)𝐱n𝜇𝑡𝐱superscript𝑛\mu(t)\mathbf{x}\in\mathbb{R}^{n}italic_μ ( italic_t ) bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and g(t)𝑔𝑡g(t)\in\mathbb{R}italic_g ( italic_t ) ∈ blackboard_R represent the drift and diffusion coefficients, while 𝐰𝐰\mathbf{w}bold_w is a standard Wiener process.

The affine drift coefficients ensure analytically tractable Gaussian perturbation kernels, denoted by p0t(𝐱t𝐱)=𝒩(𝐱t;αt𝐱,σt2𝐈)subscript𝑝0𝑡conditionalsubscript𝐱𝑡𝐱𝒩subscript𝐱𝑡subscript𝛼𝑡𝐱superscriptsubscript𝜎𝑡2𝐈p_{0t}(\mathbf{x}_{t}\mid\mathbf{x})=\mathcal{N}(\mathbf{x}_{t};\alpha_{t}% \mathbf{x},\sigma_{t}^{2}\mathbf{I})italic_p start_POSTSUBSCRIPT 0 italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_x ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ), where the exact coefficients αt,σtsubscript𝛼𝑡subscript𝜎𝑡\alpha_{t},\sigma_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be obtained with standard techniques [41]. Using appropriately designed αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, this allows the data distribution 𝐱0pdatasimilar-tosubscript𝐱0subscript𝑝𝑑𝑎𝑡𝑎\mathbf{x}_{0}\sim p_{data}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_d italic_a italic_t italic_a end_POSTSUBSCRIPT to morph into a tractable isotropic Gaussian distribution 𝐱1𝒩(𝟎,𝐈)similar-tosubscript𝐱1𝒩0𝐈\mathbf{x}_{1}\sim\mathcal{N}(\mathbf{0},\mathbf{I})bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I ) via forward diffusion.

To recover data distribution pdatasubscript𝑝𝑑𝑎𝑡𝑎p_{data}italic_p start_POSTSUBSCRIPT italic_d italic_a italic_t italic_a end_POSTSUBSCRIPT from the Gaussian distribution 𝒩(𝟎,𝐈)𝒩0𝐈\mathcal{N}(\mathbf{0},\mathbf{I})caligraphic_N ( bold_0 , bold_I ), we can simulate the corresponding reverse SDE of Eq. (1[48]:

d𝐱=[μ(t)𝐱g(t)2𝐱tlogpt(𝐱t)]dt+g(t)d𝐰¯.d𝐱delimited-[]𝜇𝑡𝐱𝑔superscript𝑡2subscriptsubscript𝐱𝑡subscript𝑝𝑡subscript𝐱𝑡d𝑡𝑔𝑡d¯𝐰\mathrm{d}\mathbf{x}=[\mu(t)\mathbf{x}-g(t)^{2}\nabla_{\mathbf{x}_{t}}\log p_{% t}\left(\mathbf{x}_{t}\right)]\mathrm{d}t+g(t)\mathrm{d}\bar{\mathbf{w}}.roman_d bold_x = [ italic_μ ( italic_t ) bold_x - italic_g ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] roman_d italic_t + italic_g ( italic_t ) roman_d over¯ start_ARG bold_w end_ARG . (2)

The so-called score function [29], 𝐱tlogpt(𝐱t)subscriptsubscript𝐱𝑡subscript𝑝𝑡subscript𝐱𝑡\nabla_{\mathbf{x}_{t}}\log p_{t}\left(\mathbf{x}_{t}\right)∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), serves as an unknown term in Eq. (2) and can be approximated by a neural network parameterized as ϵϕ(𝐱t;t)σt𝐱tlogpt(𝐱t)subscriptitalic-ϵitalic-ϕsubscript𝐱𝑡𝑡subscript𝜎𝑡subscriptsubscript𝐱𝑡subscript𝑝𝑡subscript𝐱𝑡\epsilon_{\phi}(\mathbf{x}_{t};t)\approx-\sigma_{t}\nabla_{\mathbf{x}_{t}}\log p% _{t}\left(\mathbf{x}_{t}\right)italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) ≈ - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )111This parameterization is obtained from the deep connection between the noise prediction in diffusion models and score function estimation in score-based models. We provide a brief recap in the Appendix.. To learn the score functions, employing denoising score matching techniques [50], we perturb the data points with noise as per:

𝐱t=αt𝐱0+σtϵ,ϵ𝒩(𝟎,𝐈).formulae-sequencesubscript𝐱𝑡subscript𝛼𝑡subscript𝐱0subscript𝜎𝑡italic-ϵsimilar-toitalic-ϵ𝒩0𝐈\mathbf{x}_{t}=\alpha_{t}\mathbf{x}_{0}+\sigma_{t}\epsilon,\epsilon\sim% \mathcal{N}(\mathbf{0},\mathbf{I}).bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ , italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) . (3)

Subsequently, feeding 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and t𝑡titalic_t as input, we train the time-dependent noise predictor ϵϕ(𝐱t;t)subscriptitalic-ϵitalic-ϕsubscript𝐱𝑡𝑡\epsilon_{\phi}(\mathbf{x}_{t};t)italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) using an L2-loss defined as [16]:

𝔼𝐱0pdata,ϵ𝒩(𝟎,𝐈),t𝒰[0,1][w(t)ϵϵϕ(𝐱t;t)22],subscript𝔼formulae-sequencesimilar-tosubscript𝐱0subscript𝑝dataformulae-sequencesimilar-toitalic-ϵ𝒩0𝐈similar-to𝑡𝒰01delimited-[]𝑤𝑡superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵitalic-ϕsubscript𝐱𝑡𝑡22\mathbb{E}_{\mathbf{x}_{0}\sim p_{\mathrm{data}},\epsilon\sim\mathcal{N}(% \mathbf{0},\mathbf{I}),t\sim\mathcal{U}[0,1]}\left[w(t)||\epsilon-\epsilon_{% \phi}(\mathbf{x}_{t};t)||_{2}^{2}\right],blackboard_E start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT roman_data end_POSTSUBSCRIPT , italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) , italic_t ∼ caligraphic_U [ 0 , 1 ] end_POSTSUBSCRIPT [ italic_w ( italic_t ) | | italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (4)

where w(t)𝑤𝑡w(t)italic_w ( italic_t ) denotes a positive weighting function.

Upon successful training, the score functions can be estimated and used to solve the reverse SDE (Eq. (2)). Through techniques like Euler-Maruyama discretization, we can generate novel samples by simulating the reverse SDE.

3.2 Learning Pose Prior with Unconditional Diffusion Models

SMPL-based pose representation. To build a flexible 3D human pose prior, we propose to utilize the SMPL body model [31], which can be viewed as a differentiable function [J,V]=M(θ,β)𝐽𝑉𝑀𝜃𝛽[J,V]=M(\theta,\beta)[ italic_J , italic_V ] = italic_M ( italic_θ , italic_β ) that maps body joint angles θ3×21𝜃superscript321\theta\in\mathbb{R}^{3\times 21}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 21 end_POSTSUPERSCRIPT and shape parameters β10𝛽superscript10\beta\in\mathbb{R}^{10}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT to mesh vertices V3×6890𝑉superscript36890V\in\mathbb{R}^{3\times 6890}italic_V ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 6890 end_POSTSUPERSCRIPT and joint positions J3×22𝐽superscript322J\in\mathbb{R}^{3\times 22}italic_J ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 22 end_POSTSUPERSCRIPT. Our target is to model the distribution of joint angles p(θ)𝑝𝜃p(\theta)italic_p ( italic_θ ).

Training of unconditional diffusion models. To this end, we adopt an unconditional diffusion model to learn the pose representation θ𝜃\thetaitalic_θ. This approach aligns with a task-agnostic strategy, focusing solely on the distribution of 3D poses. We employ sub-VP SDEs as outlined in [48], which have demonstrated efficacy in sampling quality, for constructing our diffusion model. Specifically, our chosen forward SDE (Eq. (1)) is given by:

d𝐱=12ξ(t)𝐱dt+ξ(t)(1e20tξ(s)ds)d𝐰,d𝐱12𝜉𝑡𝐱d𝑡𝜉𝑡1superscript𝑒2superscriptsubscript0𝑡𝜉𝑠differential-d𝑠d𝐰\mathrm{d}\mathbf{x}=-\frac{1}{2}\xi(t)\mathbf{x}\mathrm{d}t+\sqrt{\xi(t)(1-e^% {-2\int_{0}^{t}\xi(s)\mathrm{d}s})}\mathrm{d}\mathbf{w},roman_d bold_x = - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_ξ ( italic_t ) bold_x roman_d italic_t + square-root start_ARG italic_ξ ( italic_t ) ( 1 - italic_e start_POSTSUPERSCRIPT - 2 ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ξ ( italic_s ) roman_d italic_s end_POSTSUPERSCRIPT ) end_ARG roman_d bold_w , (5)

where ξ(t)𝜉𝑡\xi(t)italic_ξ ( italic_t ) denotes linear scheduled noise scales. The coefficients needed in Eq. (3) can be obtained as αt=e120tξ(s)ds,σt=1e0tξ(s)dsformulae-sequencesubscript𝛼𝑡superscript𝑒12superscriptsubscript0𝑡𝜉𝑠differential-d𝑠subscript𝜎𝑡1superscript𝑒superscriptsubscript0𝑡𝜉𝑠differential-d𝑠\alpha_{t}=e^{-\frac{1}{2}\int_{0}^{t}\xi(s)\mathrm{d}s},\sigma_{t}=1-e^{-\int% _{0}^{t}\xi(s)\mathrm{d}s}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ξ ( italic_s ) roman_d italic_s end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_e start_POSTSUPERSCRIPT - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ξ ( italic_s ) roman_d italic_s end_POSTSUPERSCRIPT.

During training, we initiate with a clean data point 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT—essentially, our pose representation θ𝜃\thetaitalic_θ—and introduce noise to generate samples 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT according to the forward process detailed in Eq. (3). Then we apply the objective in Eq. (4) to train the noise predictor ϵϕ(𝐱t;t)subscriptitalic-ϵitalic-ϕsubscript𝐱𝑡𝑡\epsilon_{\phi}(\mathbf{x}_{t};t)italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) with weights w(t)=σt2𝑤𝑡superscriptsubscript𝜎𝑡2w(t)=\sigma_{t}^{2}italic_w ( italic_t ) = italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as suggested in [48].

3.3 Optimization Leveraging Diffusion Priors

The acquired score functions or noise predictors, denoted as ϵϕ(𝐱t;t)subscriptitalic-ϵitalic-ϕsubscript𝐱𝑡𝑡\epsilon_{\phi}(\mathbf{x}_{t};t)italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ), permit the direct generation of plausible poses through Eq. (2). Yet, the broader integration of diffusion priors into general optimization frameworks remains an open avenue. We address this by reframing pose-related tasks as inverse problems and applying variational diffusion sampling techniques [33] for efficient resolution.

Inverse problem formulation. Consider an original signal 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Inverse problems can be encapsulated by Eq. (6) as:

𝐲=𝒜(𝐱0)+𝐧,𝐲,𝐧d,𝐱0n,formulae-sequence𝐲𝒜subscript𝐱0𝐧𝐲formulae-sequence𝐧superscript𝑑subscript𝐱0superscript𝑛\mathbf{y}=\mathcal{A}(\mathbf{x}_{0})+\mathbf{n},\quad\mathbf{y},\mathbf{n}% \in\mathbb{R}^{d},~{}\mathbf{x}_{0}\in\mathbb{R}^{n},\vspace{-2mm}bold_y = caligraphic_A ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + bold_n , bold_y , bold_n ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , (6)

where 𝒜𝒜\mathcal{A}caligraphic_A symbolizes the measurement operator and 𝐧𝐧\mathbf{n}bold_n constitutes noise, assumed to be white Gaussian 𝒩(𝟎,σn2𝐈)𝒩0superscriptsubscript𝜎𝑛2𝐈\mathcal{N}(\mathbf{0},\sigma_{n}^{2}\mathbf{I})caligraphic_N ( bold_0 , italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ). In the context targeted in this study, 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT always refers to body poses in SMPL [31]. This formulation allows us to approach various pose-centric tasks by adapting 𝒜𝒜\mathcal{A}caligraphic_A and interpreting 𝐲𝐲\mathbf{y}bold_y accordingly:

  • Pose completion: Here, 𝒜𝒜\mathcal{A}caligraphic_A serves as a mask matrix to simulate partially observed poses, with 𝐲𝐲\mathbf{y}bold_y being the incomplete pose data.

  • Motion denoising: In this scenario, 𝒜𝒜\mathcal{A}caligraphic_A applies SMPL’s forward kinematics, treating 𝐲𝐲\mathbf{y}bold_y as the observed noisy 3D joints.

  • Human mesh recovery: 𝒜𝒜\mathcal{A}caligraphic_A integrates SMPL’s forward kinematics and camera projection to relate 𝐲𝐲\mathbf{y}bold_y to 2D joint observations in images.

The aim is to recover the original signal 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where, within the Bayesian framework, our objective shifts to sampling from the posterior distribution p(𝐱0𝐲)𝑝conditionalsubscript𝐱0𝐲p\left(\mathbf{x}_{0}\mid\mathbf{y}\right)italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∣ bold_y ).

Solving inverse problems with diffusion models. Various techniques [14, 21, 6, 7, 45, 33] have been explored to simulate this posterior sampling process based on unconditional diffusion priors p(𝐱0;ϕ)𝑝subscript𝐱0italic-ϕp\left(\mathbf{x}_{0};\phi\right)italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_ϕ ). Among them, the sampling-based scheme is widely explored and applied in tasks like image restoration. These methods incorporate the observation information 𝐲𝐲\mathbf{y}bold_y into the generation process of 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT through techniques like gradient guidance [6, 7] and back projection [48, 21, 7]. However, such methods rooted in generation are inconvenient for handling diverse pose-related tasks. To navigate these challenges, we adopt variational diffusion sampling [33] to build general optimization frameworks. Specifically, it employs a variational distribution q(𝐱0𝐲):=𝒩(μ,σ2𝐈)assign𝑞conditionalsubscript𝐱0𝐲𝒩𝜇superscript𝜎2𝐈q\left(\mathbf{x}_{0}\mid\mathbf{y}\right):=\mathcal{N}(\mu,\sigma^{2}\mathbf{% I})italic_q ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∣ bold_y ) := caligraphic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) and aims to minimize the Kullback-Leibler (KL) divergence between this variational distribution and the true posterior, mathematically expressed as KL(q(𝐱0𝐲)p(𝐱0𝐲))KL\big{(}q\left(\mathbf{x}_{0}\mid\mathbf{y}\right)\parallel p\left(\mathbf{x}% _{0}\mid\mathbf{y}\right)\big{)}italic_K italic_L ( italic_q ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∣ bold_y ) ∥ italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∣ bold_y ) ). Further, under the assumption of zero variance (σ0𝜎0\sigma\approx 0italic_σ ≈ 0), the optimization problem of seeking 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (i.e., μ𝜇\muitalic_μ) can be formulated as minimizing [46, 33]:

𝐲𝒜(𝐱0)2+wt(𝚜𝚐[ϵϕ(𝐱t;t)ϵ])𝐱0,superscriptnorm𝐲𝒜subscript𝐱02subscript𝑤𝑡superscript𝚜𝚐delimited-[]subscriptitalic-ϵitalic-ϕsubscript𝐱𝑡𝑡italic-ϵtopsubscript𝐱0\|\mathbf{y}-\mathcal{A}(\mathbf{x}_{0})\|^{2}+w_{t}(\mathtt{sg}[\epsilon_{% \phi}(\mathbf{x}_{t};t)-\epsilon])^{\top}\mathbf{x}_{0},\vspace{-1.5mm}∥ bold_y - caligraphic_A ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( typewriter_sg [ italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) - italic_ϵ ] ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , (7)

where wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the loss weights and ϵitalic-ϵ\epsilonitalic_ϵ is sampled from the standard Gaussian distribution. Here, 𝚜𝚐𝚜𝚐\mathtt{sg}typewriter_sg signifies the stopped-gradient operator, indicating that backpropagation through the trained diffusion models is not required. The optimization procedure initiates by selecting a timestep t𝑡titalic_t and applying a perturbation to the target 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as per Eq. (3), resulting in 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Subsequently, the gradients [ϵϕ(𝐱t;t)ϵ]delimited-[]subscriptitalic-ϵitalic-ϕsubscript𝐱𝑡𝑡italic-ϵ[\epsilon_{\phi}(\mathbf{x}_{t};t)-\epsilon][ italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) - italic_ϵ ] are applied to the optimization variable 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. In a nutshell, this framework [33] provides a flexible yet robust strategy for employing diffusion priors in generic optimization problems, serving as a cornerstone for our work.

Refer to caption
Figure 2: Overview of the DPoser Methodology. Panel (a) presents three tasks: human mesh recovery, pose completion, and motion denoising, with omissions like camera optimization for clarity. Panel (b) demonstrates the DPoser regularization process, introducing noise to the current pose and applying a one-step denoiser to achieve a denoised pose. LDPosersubscript𝐿DPoserL_{\text{DPoser}}italic_L start_POSTSUBSCRIPT DPoser end_POSTSUBSCRIPT is computed between the denoised and current pose. Panel (c) outlines the optimization process from initial to fitted poses via loss minimization.

Introducing DPoser regularization. To shed more light on the working mechanism, we propose to reformulate the regularization term as:

LDPosersubscript𝐿DPoser\displaystyle L_{\mathrm{DPoser}}italic_L start_POSTSUBSCRIPT roman_DPoser end_POSTSUBSCRIPT =wt𝐱0𝚜𝚐[𝐱^0(t)]22,whereabsentsubscript𝑤𝑡superscriptsubscriptnormsubscript𝐱0𝚜𝚐delimited-[]subscript^𝐱0𝑡22where\displaystyle=w_{t}||\mathbf{x}_{0}-\mathtt{sg}[\mathbf{\hat{x}}_{0}(t)]||_{2}% ^{2},\text{where}= italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - typewriter_sg [ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t ) ] | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , where (8)
𝐱^0(t)subscript^𝐱0𝑡\displaystyle\mathbf{\hat{x}}_{0}(t)over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t ) =𝐱tσtϵϕ(𝐱t;t)αt.absentsubscript𝐱𝑡subscript𝜎𝑡subscriptitalic-ϵitalic-ϕsubscript𝐱𝑡𝑡subscript𝛼𝑡\displaystyle=\frac{\mathbf{x}_{t}-\sigma_{t}\epsilon_{\phi}(\mathbf{x}_{t};t)% }{\alpha_{t}}.= divide start_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG . (9)

Here, 𝐱^0(t)subscript^𝐱0𝑡\mathbf{\hat{x}}_{0}(t)over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t ) functions as a precise one-step denoising prediction using the diffusion model ϵϕ(𝐱t;t)subscriptitalic-ϵitalic-ϕsubscript𝐱𝑡𝑡\epsilon_{\phi}(\mathbf{x}_{t};t)italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ). This approach effectively encourages the current pose 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT towards a denoised, plausible pose distribution, employing a straightforward L2-loss within the DPoser regularization framework. Further, the theoretical foundation of our regularization demonstrates its alignment with the gradient direction of variational diffusion sampling (Eq. (7)).

Proof: Differentiating Eq. (8) with respect to 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT yields:

𝐱0LDPosersubscriptsubscript𝐱0subscript𝐿DPoser\displaystyle\nabla_{\mathbf{x}_{0}}L_{\mathrm{DPoser}}∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_DPoser end_POSTSUBSCRIPT =2wt(𝐱0𝐱^0(t))absent2subscript𝑤𝑡subscript𝐱0subscript^𝐱0𝑡\displaystyle=2w_{t}(\mathbf{x}_{0}-\mathbf{\hat{x}}_{0}(t))= 2 italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t ) )
=2wt(𝐱tσtϵαt𝐱tσtϵϕ(𝐱t;t)αt)absent2subscript𝑤𝑡subscript𝐱𝑡subscript𝜎𝑡italic-ϵsubscript𝛼𝑡subscript𝐱𝑡subscript𝜎𝑡subscriptitalic-ϵitalic-ϕsubscript𝐱𝑡𝑡subscript𝛼𝑡\displaystyle=2w_{t}(\frac{\mathbf{x}_{t}-\sigma_{t}\epsilon}{\alpha_{t}}-% \frac{\mathbf{x}_{t}-\sigma_{t}\epsilon_{\phi}(\mathbf{x}_{t};t)}{\alpha_{t}})= 2 italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( divide start_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - divide start_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG )
=2wtσtαt(ϵϕ(𝐱t;t)ϵ)absent2subscript𝑤𝑡subscript𝜎𝑡subscript𝛼𝑡subscriptitalic-ϵitalic-ϕsubscript𝐱𝑡𝑡italic-ϵ\displaystyle=2w_{t}\frac{\sigma_{t}}{\alpha_{t}}(\epsilon_{\phi}(\mathbf{x}_{% t};t)-\epsilon)= 2 italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) - italic_ϵ )
(ϵϕ(𝐱t;t)ϵ).proportional-toabsentsubscriptitalic-ϵitalic-ϕsubscript𝐱𝑡𝑡italic-ϵ\displaystyle\propto(\epsilon_{\phi}(\mathbf{x}_{t};t)-\epsilon).∝ ( italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) - italic_ϵ ) . (10)

Thus, LDPosersubscript𝐿DPoserL_{\mathrm{DPoser}}italic_L start_POSTSUBSCRIPT roman_DPoser end_POSTSUBSCRIPT represents a more intuitive approach to variational diffusion sampling. By incorporating alongside task-specific loss functions, this regularization term enhances the plausibility of the resultant poses.

DPoser across pose-related tasks. DPoser excels in versatility, enabling its seamless application in a spectrum of human pose-related tasks. Its adaptability is especially evident in our human mesh recovery approach, as depicted in Fig. 2. For an exhaustive examination of DPoser’s utility across tasks like pose completion and motion denoising, we direct the reader to our Appendix.

Human mesh recovery aims to deduce the human pose and shape from single-image inputs. In this context, we refine the optimization function derived from the SMPLify framework [1], integrating DPoser as a regularization term, LDPosersubscript𝐿DPoserL_{\mathrm{DPoser}}italic_L start_POSTSUBSCRIPT roman_DPoser end_POSTSUBSCRIPT, and streamlining the process by omitting the intricate interpenetration error component. The modified optimization objective, engaging both pose θ𝜃\thetaitalic_θ and shape β𝛽\betaitalic_β parameters from the SMPL model [31], is defined as:

L(θ,β)=LJ+wθLθ+wβLβ+wαLDPoser.𝐿𝜃𝛽subscript𝐿𝐽subscript𝑤𝜃subscript𝐿𝜃subscript𝑤𝛽subscript𝐿𝛽subscript𝑤𝛼subscript𝐿DPoserL(\theta,\beta)=L_{J}+w_{\theta}L_{\theta}+w_{\beta}L_{\beta}+w_{\alpha}L_{% \mathrm{DPoser}}.italic_L ( italic_θ , italic_β ) = italic_L start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_DPoser end_POSTSUBSCRIPT . (11)

The reprojection loss LJsubscript𝐿𝐽L_{J}italic_L start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT, acting as the data fidelity measure, is defined by:

LJsubscript𝐿𝐽\displaystyle L_{J}italic_L start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT =iJointsλiρ(ΠC(MJ(θ,β)i)Jiest),absentsubscript𝑖Jointssubscript𝜆𝑖𝜌subscriptΠ𝐶subscript𝑀𝐽subscript𝜃𝛽𝑖subscriptsuperscript𝐽est𝑖\displaystyle=\sum_{i\in\text{Joints}}\lambda_{i}\rho\left(\Pi_{C}\left(M_{J}(% \theta,\beta)_{i}\right)-J^{\text{est}}_{i}\right),= ∑ start_POSTSUBSCRIPT italic_i ∈ Joints end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ρ ( roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ( italic_θ , italic_β ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_J start_POSTSUPERSCRIPT est end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (12)

where MJ(θ,β)subscript𝑀𝐽𝜃𝛽M_{J}(\theta,\beta)italic_M start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ( italic_θ , italic_β ) calculates the 3D joint coordinates through SMPL’s forward kinematics. The function ΠCsubscriptΠ𝐶\Pi_{C}roman_Π start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT maps these 3D coordinates into 2D space, aligning with the camera’s perspective. Jestsuperscript𝐽estJ^{\text{est}}italic_J start_POSTSUPERSCRIPT est end_POSTSUPERSCRIPT refers to the 2D keypoints estimated using an off-the-shelf 2D pose estimator (in our case, ViTPose [55]), with λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT reflecting the confidence score for each joint i𝑖iitalic_i. The Geman-McClure error function (ρ𝜌\rhoitalic_ρ) is employed to assess the discrepancy in 2D joint locations reliably.

To mitigate the issue of overfitting, which often leads to unrealistic poses when solely minimizing reprojection loss, several regularization terms are introduced. Specifically, alongside our body prior LDPosersubscript𝐿DPoserL_{\mathrm{DPoser}}italic_L start_POSTSUBSCRIPT roman_DPoser end_POSTSUBSCRIPT, the bending term Lθsubscript𝐿𝜃L_{\theta}italic_L start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is incorporated to penalize excessive bending at the elbows and knees, formulated as Lθ=i(elbows, knees)exp(𝜽i)subscript𝐿𝜃subscript𝑖(elbows, knees)subscript𝜽𝑖L_{\theta}=\sum_{i\in\text{(elbows, knees)}}\exp(\boldsymbol{\theta}_{i})italic_L start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ (elbows, knees) end_POSTSUBSCRIPT roman_exp ( bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Additionally, the shape regularization term Lβ=β22subscript𝐿𝛽superscriptsubscriptnorm𝛽22L_{\beta}=\|\beta\|_{2}^{2}italic_L start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT = ∥ italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is employed to maintain the body shape within plausible bounds. The weights for prior terms are denoted as wθ,wβsubscript𝑤𝜃subscript𝑤𝛽w_{\theta},w_{\beta}italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT and wαsubscript𝑤𝛼w_{\alpha}italic_w start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT, respectively.

Given the structure of LDPosersubscript𝐿DPoserL_{\mathrm{DPoser}}italic_L start_POSTSUBSCRIPT roman_DPoser end_POSTSUBSCRIPT (as seen in Eq. (8)), a crucial aspect lies in judiciously selecting the diffusion timestep t𝑡titalic_t during the iterative optimization process. In the subsequent section, we address this concern by introducing our novel truncated timestep scheduling strategy.

3.4 Test-time Truncated Timestep Scheduling

Motivation from pose generation. Adapting techniques from the image domain to pose data requires a nuanced understanding of the differences between the two. Previous image-based research [4] shows that initial timesteps (larger t𝑡titalic_t) correspond to the perceptual content, while later timesteps refine details. Pose data, however, lacks this structured layering and spatial redundancy, indicating a need for a tailored timestep approach in the diffusion process.

Refer to caption
Figure 3: Illustration of the rationale behind our proposed truncated timestep scheduling. We employ the deterministic DDIM sampler [44] with limited steps and assess the quality of generated poses using the Self-Intersection percentage (SI).

As depicted in Fig. 3, we find that pose generation doesn’t benefit from the early timesteps as image generation does. The significant stages of pose refinement occur at smaller t𝑡titalic_t, specifically when t0.3𝑡0.3t\leq 0.3italic_t ≤ 0.3. A uniform distribution of timesteps, as tested in (b) with only five steps, proves less effective for pose data. In contrast, allocating these steps toward the latter end of the diffusion process, as in (c), yields significantly better samples, implying the critical information is not evenly distributed but rather is concentrated toward the end.

Truncated timestep scheduling. Based on these insights, we propose a shift from standard uniform timestep sampling to a truncated strategy, especially for pose data. By focusing on the last timesteps, particularly between 0.2 and 0.0, we target the interval rich in pose-specific information. Specifically, based on the linear descending scheduling, the truncated timestep t𝑡titalic_t for each optimization step can be expressed as:

t=tmax(tmaxtmin)×iterN1.𝑡subscript𝑡maxsubscript𝑡maxsubscript𝑡miniter𝑁1t=t_{\text{max}}-\frac{(t_{\text{max}}-t_{\text{min}})\times\text{iter}}{N-1}.italic_t = italic_t start_POSTSUBSCRIPT max end_POSTSUBSCRIPT - divide start_ARG ( italic_t start_POSTSUBSCRIPT max end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ) × iter end_ARG start_ARG italic_N - 1 end_ARG . (13)

where N𝑁Nitalic_N denotes the total number of optimization iterations, and iter signifies the current iteration. This formulation is integral to our proposed optimization framework, which is comprehensively summarized in Algorithm 1. The practical implementation typically involves setting the truncated range to [0.2,0.05]0.20.05[0.2,0.05][ 0.2 , 0.05 ].

Algorithm 1 Test-time Optimization with DPoser
1:A trained diffusion model ϵϕ(𝐱t;t)subscriptitalic-ϵitalic-ϕsubscript𝐱𝑡𝑡\epsilon_{\phi}(\mathbf{x}_{t};t)italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ), task-specific loss Ltasksubscript𝐿taskL_{\text{task}}italic_L start_POSTSUBSCRIPT task end_POSTSUBSCRIPT, range of diffusion timesteps [tmax,tmin]subscript𝑡maxsubscript𝑡min[t_{\text{max}},t_{\text{min}}][ italic_t start_POSTSUBSCRIPT max end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ], number of optimization iterations N𝑁Nitalic_N.
2:Initialization of SMPL body pose parameters 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
3:for iter=0,1,,N1iter01𝑁1\text{iter}=0,1,\ldots,N-1iter = 0 , 1 , … , italic_N - 1 do
4:     ttmax(tmaxtmin)×iterN1𝑡subscript𝑡maxsubscript𝑡maxsubscript𝑡miniter𝑁1t\leftarrow t_{\text{max}}-\frac{(t_{\text{max}}-t_{\text{min}})\times\text{% iter}}{N-1}italic_t ← italic_t start_POSTSUBSCRIPT max end_POSTSUBSCRIPT - divide start_ARG ( italic_t start_POSTSUBSCRIPT max end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ) × iter end_ARG start_ARG italic_N - 1 end_ARG \triangleright Timestep scheduling
5:     Sample ϵ𝒩(0,I)similar-toitalic-ϵ𝒩0𝐼\epsilon\sim\mathcal{N}(0,I)italic_ϵ ∼ caligraphic_N ( 0 , italic_I )
6:     𝐱tαt𝐱0+σtϵsubscript𝐱𝑡subscript𝛼𝑡subscript𝐱0subscript𝜎𝑡italic-ϵ\mathbf{x}_{t}\leftarrow\alpha_{t}\mathbf{x}_{0}+\sigma_{t}\epsilonbold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ \triangleright Forward diffusion
7:     𝐱^0(t)𝐱tσtϵϕ(𝐱t;t)αtsubscript^𝐱0𝑡subscript𝐱𝑡subscript𝜎𝑡subscriptitalic-ϵitalic-ϕsubscript𝐱𝑡𝑡subscript𝛼𝑡\mathbf{\hat{x}}_{0}(t)\leftarrow\frac{\mathbf{x}_{t}-\sigma_{t}\epsilon_{\phi% }(\mathbf{x}_{t};t)}{\alpha_{t}}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t ) ← divide start_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG \triangleright One-step denoiser
8:     LDPoserwt𝐱0sg[𝐱^0(t)]22subscript𝐿DPosersubscript𝑤𝑡superscriptsubscriptdelimited-∥∥subscript𝐱0sgdelimited-[]subscript^𝐱0𝑡22L_{\text{DPoser}}\leftarrow w_{t}\lVert\mathbf{x}_{0}-\text{sg}[\mathbf{\hat{x% }}_{0}(t)]\rVert_{2}^{2}italic_L start_POSTSUBSCRIPT DPoser end_POSTSUBSCRIPT ← italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - sg [ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t ) ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT \triangleright DPoser regularization
9:     LtotalLtask+LDPosersubscript𝐿totalsubscript𝐿tasksubscript𝐿DPoserL_{\text{total}}\leftarrow L_{\text{task}}+L_{\text{DPoser}}italic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT ← italic_L start_POSTSUBSCRIPT task end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT DPoser end_POSTSUBSCRIPT
10:     Update 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT via backpropagation on Ltotalsubscript𝐿totalL_{\text{total}}italic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT
11:end for
12:return 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

4 Experiments

In this section, we showcase the robustness and versatility of DPoser across a spectrum of pose-centric tasks, including pose generation, human mesh recovery, pose completion, and motion denoising. Due to the page limit, we leave experimental details and more qualitative assessments in the Appendix.

4.1 Experimental Setup

Implementation details. We train our DPoser model on the AMASS dataset [32], adhering to the same training partition as previous works [37, 49]. The model employs axis-angle representation for joint rotations, which we normalize to have zero mean and unit variance. The architecture consists of a fully connected neural network with approximately 8.28M parameters. It draws inspiration from GFPose [8] but omits conditional input pathways for our unconditional setting. To stabilize training, we use an exponential moving average with a decay factor of 0.9999, as advised by [48]. The Adam optimizer, a learning rate of 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and a batch size of 1280 govern the optimization process. The training of 800,000 iterations takes roughly 8 hours on a single Nvidia RTX 3090Ti GPU.

Evaluation metrics. To comprehensively evaluate our models across various tasks, following Pose-NDF [49], we adopt task-specific metrics:

  • Pose Generation: Diversity and fidelity are evaluated using Average Pairwise Distance (APD) and Self-Intersection rates (SI), respectively.

  • Human Mesh Recovery: The Procrustes-aligned Mean Per Joint Position Error (PA-MPJPE) measures the accuracy of recovered human meshes.

  • Pose Completion: The Mean Per Joint Position Error (MPJPE) for masked body joints serves as the metric, focusing on the inferred occluded parts.

  • Motion Denoising: Both MPJPE and the Mean Per-Vertex Position Error (MPVPE) are calculated to assess the denoising effectiveness.

All errors are reported in millimeter units.

4.2 Pose Generation

Refer to caption
(a) GAN-S [9]
Refer to caption
(b) DPoser (ours)
Refer to caption
(c) DPoser (ours)*
Refer to caption
(d) GMM [1]
Refer to caption
(e) Pose-NDF [49]
Refer to caption
(f) VPoser [37]
Figure 4: Qualitative comparison of generated human poses: (b) illustrates naturalistic poses aligned with real-world data, whereas (c) shows poses that, despite superior metrics, lack natural appearance. *We use a DDIM sampler [44] with only 10 steps.
Sample source APD \uparrow SI \downarrow
Real-world (AMASS) [32] 15.44 0.79
GMM [1] 16.28 1.54
VPoser [37] 10.75 1.51
Pose-NDF [49] 18.75 1.97
GAN-S [9] 15.68 1.27
DPoser (ours) 14.28 1.21
DPoser (ours)* 19.03 1.13
Table 1: Comparative analysis of pose generation metrics. The discrepancy between visual impressions and APD/SI metrics is discussed, with reference to Fig. 4. *Indicates the use of a reduced 10-step sampler.

To commence, we delve into the capabilities of our DPoser model by generating samples from the learned manifold. Employing a standard Euler-Maruyama discretization with 1000 steps, we assess both the diversity and realism of the generated poses (Fig. 4). While DPoser’s outputs are visually diverse and realistic, poses generated from competing methods like GMM [1] and Pose-NDF [49] fall short in naturalism, and VPoser [37] exhibits limited diversity.

Interestingly, quantitative metrics such as APD and SI (Tab. 1) do not always corroborate our qualitative findings. For instance, a 10-step DDIM sampler [44]—suboptimal by design—outperformed real-world data [32] in APD, which we attribute to the generation of exaggerated poses. In summary, our findings underscore the need for a balanced evaluation strategy that merges quantitative metrics with qualitative observations.

4.3 Human Mesh Recovery

Initialization No fitting GMM [1] VPoser [37] Pose-NDF [49] GAN-S [9] DPoser(Ours)
from scratch 108.57 58.32 58.08 57.87 57.26 56.05
CLIFF [25] 56.62 51.02 49.39 49.50 49.58 49.05
Table 2: Performance comparison of human mesh recovery on the EHF dataset [37] using two initialization methods. PA-MPJPE is reported as the metric.
Refer to caption
Figure 5: Human mesh recovery. (a) Fitting from scratch. *Ground truth for the EHF dataset is annotated in SMPL-X [37], which extends SMPL [31] with fully articulated hands and an expressive face. (b) Initialization using the CLIFF [25] prediction.
Refer to caption
(a) EHF dataset
Refer to caption
(b) MSCOCO dataset
Refer to caption
(c) 3DPW dataset
Refer to caption
(d) UBody dataset
Figure 6: Qualitative evaluations of human mesh recovery leveraging DPoser as pose prior. CLIFF [25] serves as the optimization initializer.

We probe the efficacy of DPoser in human mesh recovery (HMR), focusing on estimating human pose and shape from monocular images. We conduct experiments on the EHF dataset [37] and benchmark our method against existing SOTA priors. Our optimization-based framework incorporates two initialization paradigms: (1) a baseline initialization that utilizes mean pose values and a default camera setup, and (2) an advanced initialization scheme that leverages CLIFF [25], a pre-trained regression-based model tailored for HMR. Moreover, GAN-S [9] implementations require a GAN-inversion phase to convert initial poses into their latent representations, which is notably time-consuming.

Tab. 2 and Fig. 5 showcase the comparative performance of DPoser, highlighting its exceptional ability in HMR tasks. Notably, when fitting from scratch, it surpasses established SOTA priors like GAN-S [9] and Pose-NDF [49] and rivals the specific regression-based model [25]. The integration of CLIFF as initialization further amplifies DPoser’s performance, underscoring its efficiency and the benefits of employing refined starting conditions. Fig. 6 further confirms DPoser’s superior efficacy and adaptability across multiple datasets including EHF [37], MSCOCO [27], 3DPW [51], and UBody [26].

4.4 Pose Completion

In practical scenarios like those encountered in the UBody dataset [26] (refer to Fig. 5(d)), HMR algorithms often grapple with occlusions leading to incomplete 3D pose estimates. In this context, our ambition is to recover full 3D poses from partially observed data, initializing the occluded parts with random noise. Our DPoser model is employed to refine these initially implausible poses into feasible ones, utilizing an L2 loss on the visible parts to ensure data consistency.

Initialization VPoser Pose-NDF DPoser
Zeros 180.90 157.50 73.92
10mm noise 181.86 172.50 74.69
100mm noise 180.25 511.51 74.19
Table 3: Pose completion on the AMASS [32] dataset (left leg under occlusion, single-hypotheses) using various initialization strategies. DPoser demonstrates its effectiveness across all conditions.

In parallel, we employ a comparable optimization strategy for both Pose-NDF [49] and VPoser [37]. Notably, Tab. 3 reveals that Pose-NDF struggles with poorly initialized poses unseen during its training phase. To mitigate this issue, we have to initialize the occluded poses near zero (close to rest pose) for Pose-NDF to prevent optimization divergence. Additionally, as a task-specific baseline, we adapt the original VPoser model into CVPoser by incorporating conditional inputs within its VAE framework [22]. This modification enables the encoder and decoder to process additional partial poses, facilitating end-to-end conditional sampling.

Methods Occ. left leg Occ. legs Occ. arms Occ. trunk
PoseNDF (S=1𝑆1S=1italic_S = 1[49] 158.21 159.19 201.00 75.42
PoseNDF (S=5𝑆5S=5italic_S = 5) 147.66/158.11/7.62 151.86/159.21/5.33 196.36/200.92/3.30 70.88/75.39/3.25
PoseNDF (S=10𝑆10S=10italic_S = 10) 144.38/158.06/8.31 149.38/159.14/5.90 194.79/200.87/3.63 69.45/75.38/3.54
VPoser (S=1) [37] 180.78 198.18 159.86 37.75
VPoser (S=5𝑆5S=5italic_S = 5) 167.92/181.30/10.53 178.77/198.15/14.51 148.17/159.65/8.64 31.83/37.79/4.54
VPoser (S=10𝑆10S=10italic_S = 10) 162.82/181.09/12.21 172.83/198.31/16.30 144.53/159.80/9.69 30.06/37.78/4.99
CVPoser (S=10𝑆10S=10italic_S = 10) {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 71.66/145.52/51.68 90.49/148.30/38.46 83.02/136.82/36.47 18.77/37.83/13.12
DPoser(ours) (S=1𝑆1S=1italic_S = 1) 74.48 97.39 81.49 28.58
DPoser(ours) (S=5𝑆5S=5italic_S = 5) 42.64/73.85/24.36 67.70/97.06/22.29 58.52/82.37/18.33 17.11/28.59/8.92
DPoser(ours) (S=10𝑆10S=10italic_S = 10) 35.37/74.01/26.47 59.25/96.77/24.55 51.27/81.76/20.04 13.95/28.57/9.85
Table 4: Performance metrics (min/mean/std of MPJPE across multiple hypotheses) for pose completion under varying occlusion scenarios. S𝑆Sitalic_S denotes the number of hypotheses. {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT Task-specific baseline trained with partial poses as conditional input.
Refer to caption
Figure 7: Qualitative evaluations of pose completion. Three hypotheses are drawn for each method. DPoser uniquely offers multiple plausible solutions for partially observed poses, a scenario where competitors often struggle due to limited generalization.

Given the inherent uncertainties within this task, we generate multiple solutions and evaluate them based on their minimum, mean, and standard deviation errors against the ground truth. As illustrated in Tab. 4, DPoser exhibits superior performance across different occlusion scenarios compared to existing pose priors and even the task-specific CVPoser, highlighting its effectiveness in pose completion. The qualitative evaluations are presented in Fig. 7. Here, we observe that DPoser can generate a multitude of plausible poses, a capability lacking in VPoser [37]. Pose-NDF [49], meanwhile, struggles with generalizing to unseen noisy poses and making plausible adjustments from its rest pose initialization.

4.5 Motion Denoising

Though not initially designed for temporal tasks, DPoser shows remarkable proficiency in motion denoising. The task aims to estimate clean body poses from noisy 3D joint positions in motion capture sequences. Adhering to the setup outlined in HuMoR [40], we utilize 60-frame sequences from the AMASS [32] dataset and artificially introduce Gaussian noise with a standard deviation of 40 mm to the 3D joint positions. Moreover, we conduct experiments on HPS datasets [15] without additional training to validate the generalization.

As presented in Tab. 5, DPoser sets a new standard in motion denoising, outperforming even specialized motion priors like HuMoR [40]. To further confirm the robustness of DPoser, we conduct evaluations under varying conditions to gauge DPoser’s denoising capabilities. The results, detailed in Tab. 6, reveal that DPoser consistently achieves significant reductions in MPJPE, maintaining robust performance under extreme noise conditions.

4.6 Ablation Study

Methods AMASS [32] HPS [15] No prior 24.19 23.67 VPoser [37] 23.42 22.78 Pose-NDF [49] 22.13 21.60 MVAE [28] 26.80 N/A HuMoR [40] 22.69 N/A DPoser (ours) 19.87 20.54 Table 5: Performance metrics (MPJPE) for motion denoising under 40 mm noise.
Noise std AMASS [32] HPS [15] 20.00 31.93/13.64 31.93/13.45 40.00 63.81/19.87 63.81/20.54 100.00 159.78/33.18 159.78/35.32 Table 6: DPoser in motion denoising under varying noise scales. MPJPE is reported as before/after applying DPoser denoising.
Timestep scheduling HMR Pose Completion Motion Denoising
PA-MPJPE \downarrow MPJPE (S=10𝑆10S=10italic_S = 10) \downarrow MPVPE \downarrow MPJPE \downarrow
Random 58.84 86.23/121.57/23.16 43.33 23.87
Fixed 56.55 36.99/71.68/23.41 45.69 22.54
Uniform 59.28 42.72/75.70/21.84 39.72 20.80
Truncated 56.05 35.37/74.01/26.47 38.21 19.87
Table 7: Evaluation of timestep scheduling strategies on key pose-related tasks, highlighting the superior efficacy of the proposed truncated scheduling.

In our ablation study, we initially focus on the impact of truncated timestep scheduling on DPoser’s performance. This involves contrasting our proposed scheduling strategy against three established methods—random, fixed, and uniform scheduling [34, 33, 6, 48]. As Tab. 7 demonstrates, our strategy consistently outperforms these alternatives across all evaluated tasks. Additionally, we delve into the training aspects of DPoser, such as rotation representations and the integration of an auxiliary loss akin to HuMoR [40]. Using the same trained prior, we also compare DPoser’s capabilities with SOTA diffusion-based solvers [48, 7, 6] on pose completion, revealing its superior versatility and performance. Detailed findings and analyses from these ablation studies are presented in the Appendix.

5 Conclusion

We introduce DPoser, to our best knowledge, the first unconditional diffusion-based pose prior, tailored for an expansive array of pose-related tasks. Engineered for flexibility, DPoser can be implemented as a straightforward L2-loss regularizer and enhanced by our innovative truncated timestep scheduling for test-time optimization. Comprehensive experiments substantiate DPoser’s superior performance over existing state-of-the-art pose priors.

Limitation and future work. While our framework benefits from variational diffusion sampling [33], it also shares its limitations, such as the mode-seeking behavior. Future research could look into enhancing solution diversity via techniques like particle-based variational inference [30, 53]. Furthermore, within the broader context of inverse problems we have framed, a plethora of existing methods [45, 2, 5, 35] could be adapted to leverage our diffusion-based prior. Exploring these methods holds great potential for future progress.

Ethical Considerations. For a discussion on the potential negative impacts of our work, please refer to the Appendix.

References

  • [1] Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In: ECCV (2016)
  • [2] Boys, B., Girolami, M., Pidstrigach, J., Reich, S., Mosca, A., Akyildiz, O.D.: Tweedie moment projected diffusions for inverse problems. arXiv preprint arXiv:2310.06721 (2023)
  • [3] Cho, H., Kim, J.: Generative approach for probabilistic human mesh recovery using diffusion models. In: ICCV (2023)
  • [4] Choi, J., Lee, J., Shin, C., Kim, S., Kim, H., Yoon, S.: Perception prioritized training of diffusion models. In: CVPR (2022)
  • [5] Chung, H., Kim, J., Kim, S., Ye, J.C.: Parallel diffusion models of operator and image for blind inverse problems. In: CVPR (2023)
  • [6] Chung, H., Kim, J., Mccann, M.T., Klasky, M.L., Ye, J.C.: Diffusion posterior sampling for general noisy inverse problems. arXiv preprint arXiv:2209.14687 (2022)
  • [7] Chung, H., Sim, B., Ryu, D., Ye, J.C.: Improving diffusion models for inverse problems using manifold constraints. NeurIPS (2022)
  • [8] Ci, H., Wu, M., Zhu, W., Ma, X., Dong, H., Zhong, F., Wang, Y.: Gfpose: Learning 3d human pose prior with gradient fields. In: CVPR (2023)
  • [9] Davydov, A., Remizova, A., Constantin, V., Honari, S., Salzmann, M., Fua, P.: Adversarial parametric pose prior. In: CVPR (2022)
  • [10] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
  • [11] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. NeurIPS (2021)
  • [12] Georgakis, G., Li, R., Karanam, S., Chen, T., Košecká, J., Wu, Z.: Hierarchical kinematic human mesh recovery. In: ECCV (2020)
  • [13] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM (2020)
  • [14] Graikos, A., Malkin, N., Jojic, N., Samaras, D.: Diffusion models as plug-and-play priors. NeurIPS (2022)
  • [15] Guzov, V., Mir, A., Sattler, T., Pons-Moll, G.: Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In: CVPR (2021)
  • [16] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS (2020)
  • [17] Holmquist, K., Wandt, B.: Diffpose: Multi-hypothesis human pose estimation using diffusion models. In: ICCV (2023)
  • [18] Jiang, Z., Zhou, Z., Li, L., Chai, W., Yang, C.Y., Hwang, J.N.: Back to optimization: Diffusion-based zero-shot 3d human pose estimation. arXiv preprint arXiv:2307.03833 (2023)
  • [19] Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
  • [20] Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. NeurIPS (2022)
  • [21] Kawar, B., Elad, M., Ermon, S., Song, J.: Denoising diffusion restoration models. NeurIPS (2022)
  • [22] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
  • [23] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
  • [24] Li, L., Zhuo, L., Zhang, B., Bo, L., Chen, C.: Diffhand: End-to-end hand mesh reconstruction via diffusion models. arXiv preprint arXiv:2305.13705 (2023)
  • [25] Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: Cliff: Carrying location information in full frames into human pose and shape estimation. In: ECCV (2022)
  • [26] Lin, J., Zeng, A., Wang, H., Zhang, L., Li, Y.: One-stage 3d whole-body mesh recovery with component aware transformer. In: CVPR (2023)
  • [27] Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Dollár, P.: Microsoft coco: common objects in context (2014). arXiv preprint arXiv:1405.0312 (2019)
  • [28] Ling, H.Y., Zinno, F., Cheng, G., Van De Panne, M.: Character controllers using motion vaes. TOG (2020)
  • [29] Liu, Q., Lee, J., Jordan, M.: A kernelized stein discrepancy for goodness-of-fit tests. In: ICML (2016)
  • [30] Liu, Q., Wang, D.: Stein variational gradient descent: A general purpose bayesian inference algorithm. NeurIPS (2016)
  • [31] Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinned multi-person linear model. ACM Transactions on Graphics 34(6) (2015)
  • [32] Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes. In: ICCV (2019)
  • [33] Mardani, M., Song, J., Kautz, J., Vahdat, A.: A variational perspective on solving inverse problems with diffusion models. arXiv preprint arXiv:2305.04391 (2023)
  • [34] Müller, L., Ye, V., Pavlakos, G., Black, M., Kanazawa, A.: Generative proxemics: A prior for 3d social interaction from images. arXiv preprint arXiv:2306.09337 (2023)
  • [35] Murata, N., Saito, K., Lai, C.H., Takida, Y., Uesaka, T., Mitsufuji, Y., Ermon, S.: Gibbsddrm: A partially collapsed gibbs sampler for solving blind inverse problems with denoising diffusion restoration. arXiv preprint arXiv:2301.12686 (2023)
  • [36] Nachmani, E., Roman, R.S., Wolf, L.: Non gaussian denoising diffusion models. arXiv preprint arXiv:2106.07582 (2021)
  • [37] Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: CVPR (2019)
  • [38] Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)
  • [39] Qiu, Z., Yang, Q., Wang, J., Wang, X., Xu, C., Fu, D., Yao, K., Han, J., Ding, E., Wang, J.: Learning structure-guided diffusion model for 2d human pose estimation. arXiv preprint arXiv:2306.17074 (2023)
  • [40] Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Humor: 3d human motion model for robust pose estimation. In: ICCV (2021)
  • [41] Särkkä, S., Solin, A.: Applied stochastic differential equations, vol. 10. Cambridge University Press (2019)
  • [42] Shafir, Y., Tevet, G., Kapon, R., Bermano, A.H.: Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418 (2023)
  • [43] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)
  • [44] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
  • [45] Song, J., Vahdat, A., Mardani, M., Kautz, J.: Pseudoinverse-guided diffusion models for inverse problems. In: ICLR (2022)
  • [46] Song, Y., Durkan, C., Murray, I., Ermon, S.: Maximum likelihood training of score-based diffusion models. NeurIPS (2021)
  • [47] Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. NeurIPS (2019)
  • [48] Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)
  • [49] Tiwari, G., Antić, D., Lenssen, J.E., Sarafianos, N., Tung, T., Pons-Moll, G.: Pose-ndf: Modeling human pose manifolds with neural distance fields. In: ECCV (2022)
  • [50] Vincent, P.: A connection between score matching and denoising autoencoders. Neural computation (2011)
  • [51] Von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3d human pose in the wild using imus and a moving camera. In: ECCV (2018)
  • [52] Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: CVPR (2023)
  • [53] Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213 (2023)
  • [54] Wu, J., Gao, X., Liu, X., Shen, Z., Zhao, C., Feng, H., Liu, J., Ding, E.: Hd-fusion: Detailed text-to-3d generation leveraging multiple noise estimation. arXiv preprint arXiv:2307.16183 (2023)
  • [55] Xu, Y., Zhang, J., Zhang, Q., Tao, D.: ViTPose: Simple vision transformer baselines for human pose estimation. In: Advances in Neural Information Processing Systems (2022)
  • [56] Zhao, M., Liu, M., Ren, B., Dai, S., Sebe, N.: Modiff: Action-conditioned 3d motion generation with denoising diffusion probabilistic models. arXiv preprint arXiv:2301.03949 (2023)
  • [57] Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: CVPR (2019)
  • [58] Zhu, J., Zhuang, P.: Hifa: High-fidelity text-to-3d with advanced diffusion guidance. arXiv preprint arXiv:2305.18766 (2023)

Appendix for DPoser: Diffusion Model as

Robust 3D Human Pose Prior

In this appendix, we first briefly recap the parameterization of diffusion models and their connection to score functions in Sec. A, followed by the perspective of Score Distillation Sampling (SDS) to understand our DPoser regularization in Sec. B. We detail the experimental setup and nuances in Sec. C and dissect various training aspects of DPoser in Sec. D. The exploration of extended optimization techniques is discussed in Sec. E, and considerations for truncated timestep scheduling in image domains are presented in Sec. F. Additional qualitative results are showcased in Sec. G. Lastly, potential negative impacts such as biases in data and ethical concerns in application are considered in Sec. H.

A Parameterization of Score-based Diffusion Models

In the seminal work by Song et al. [48], it is demonstrated that both score-based generative models [47] and diffusion probabilistic models [16] can be understood as discretized versions of stochastic differential equations (SDEs) defined by score functions. This unification allows the training objective to be interpreted either as learning a time-dependent denoiser or as learning a sequence of score functions that describe increasingly noisy versions of the data.

We begin by revisiting the training objective for score-based models [47] to elucidate the link with diffusion models [16]. Consider the transition kernel of the forward diffusion process p0t(𝐱t|𝐱0)=𝒩(𝐱t;αt𝐱0,σt2𝐈)subscript𝑝0𝑡conditionalsubscript𝐱𝑡subscript𝐱0𝒩subscript𝐱𝑡subscript𝛼𝑡subscript𝐱0superscriptsubscript𝜎𝑡2𝐈p_{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t};\alpha_{t}% \mathbf{x}_{0},\sigma_{t}^{2}\mathbf{I})italic_p start_POSTSUBSCRIPT 0 italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ). Our goal is to learn score functions 𝐱tlogpt(𝐱t)subscriptsubscript𝐱𝑡subscript𝑝𝑡subscript𝐱𝑡\nabla_{\mathbf{x}_{t}}\log p_{t}(\mathbf{x}_{t})∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) through a neural network sθ(𝐱t;t)subscript𝑠𝜃subscript𝐱𝑡𝑡s_{\theta}(\mathbf{x}_{t};t)italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ), by minimizing the L2 loss as follows (we omit the expectation operator for conciseness) :

𝔼[w(t)sθ(𝐱t;t)𝐱tlogpt(𝐱t)22].𝔼delimited-[]𝑤𝑡superscriptsubscriptnormsubscript𝑠𝜃subscript𝐱𝑡𝑡subscriptsubscript𝐱𝑡subscript𝑝𝑡subscript𝐱𝑡22\mathbb{E}\left[w(t)||s_{\theta}(\mathbf{x}_{t};t)-\nabla_{\mathbf{x}_{t}}\log p% _{t}\left(\mathbf{x}_{t}\right)||_{2}^{2}\right].blackboard_E [ italic_w ( italic_t ) | | italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) - ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (14)

Here, 𝐱t=αt𝐱0+σtϵsubscript𝐱𝑡subscript𝛼𝑡subscript𝐱0subscript𝜎𝑡italic-ϵ\mathbf{x}_{t}=\alpha_{t}\mathbf{x}_{0}+\sigma_{t}\epsilonbold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ, where ϵ𝒩(𝟎,𝐈)similar-toitalic-ϵ𝒩0𝐈\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I})italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ).

Based on denoising score matching [50], we know the minimizing objective Eq. (14) is equivalent to the following tractable term:

𝔼[w(t)||sθ(𝐱t;t)𝐱tlogp0t(𝐱t|𝐱0)||22].\mathbb{E}\left[w(t)||s_{\theta}(\mathbf{x}_{t};t)-\nabla_{\mathbf{x}_{t}}\log p% _{0t}(\mathbf{x}_{t}|\mathbf{x}_{0})||_{2}^{2}\right].blackboard_E [ italic_w ( italic_t ) | | italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) - ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT 0 italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (15)

To link this with the noise predictor ϵθ(𝐱t;t)subscriptitalic-ϵ𝜃subscript𝐱𝑡𝑡\epsilon_{\theta}(\mathbf{x}_{t};t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) in diffusion models, we can employ the reparameterization sθ(𝐱t;t)=ϵθ(𝐱t;t)σtsubscript𝑠𝜃subscript𝐱𝑡𝑡subscriptitalic-ϵ𝜃subscript𝐱𝑡𝑡subscript𝜎𝑡s_{\theta}(\mathbf{x}_{t};t)=-\frac{\epsilon_{\theta}(\mathbf{x}_{t};t)}{% \sigma_{t}}italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) = - divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG. Then, Eq. (15) can be simplified as follows:

w(t)||ϵθ(𝐱t;t)σt𝐱tlogp0t(𝐱t𝐱0)||22\displaystyle w(t)||-\frac{\epsilon_{\theta}(\mathbf{x}_{t};t)}{\sigma_{t}}-% \nabla_{\mathbf{x}_{t}}\log p_{0t}(\mathbf{x}_{t}\mid\mathbf{x}_{0})||_{2}^{2}italic_w ( italic_t ) | | - divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT 0 italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== w(t)ϵθ(𝐱t;t)σt+(𝐱tαt𝐱0)σt222𝑤𝑡superscriptsubscriptnormsubscriptitalic-ϵ𝜃subscript𝐱𝑡𝑡subscript𝜎𝑡subscript𝐱𝑡subscript𝛼𝑡subscript𝐱0superscriptsubscript𝜎𝑡222\displaystyle w(t)||-\frac{\epsilon_{\theta}(\mathbf{x}_{t};t)}{\sigma_{t}}+% \frac{(\mathbf{x}_{t}-\alpha_{t}\mathbf{x}_{0})}{\sigma_{t}^{2}}||_{2}^{2}italic_w ( italic_t ) | | - divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + divide start_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== w(t)||ϵθ(𝐱t;t)σt+σtϵσt2)||22\displaystyle w(t)||-\frac{\epsilon_{\theta}(\mathbf{x}_{t};t)}{\sigma_{t}}+% \frac{\sigma_{t}\epsilon}{\sigma_{t}^{2}})||_{2}^{2}italic_w ( italic_t ) | | - divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== w(t)σt2||ϵθ(𝐱t;t)ϵ)||22\displaystyle\frac{w(t)}{\sigma_{t}^{2}}||\epsilon_{\theta}(\mathbf{x}_{t};t)-% \epsilon)||_{2}^{2}divide start_ARG italic_w ( italic_t ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG | | italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) - italic_ϵ ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (16)

The resulting form of Eq. (16) aligns precisely with the noise prediction form of diffusion models [16] (refer to Eq. (4) in the main text). This implies that by training ϵθ(𝐱t;t)subscriptitalic-ϵ𝜃subscript𝐱𝑡𝑡\epsilon_{\theta}(\mathbf{x}_{t};t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) in a diffusion model context, we simultaneously get a handle on the score function, approximated as 𝐱tlogpt(𝐱t)ϵθ(𝐱t;t)σtsubscriptsubscript𝐱𝑡subscript𝑝𝑡subscript𝐱𝑡subscriptitalic-ϵ𝜃subscript𝐱𝑡𝑡subscript𝜎𝑡\nabla_{\mathbf{x}_{t}}\log p_{t}(\mathbf{x}_{t})\approx-\frac{\epsilon_{% \theta}(\mathbf{x}_{t};t)}{\sigma_{t}}∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≈ - divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG.

B View DPoser as Score Distillation Sampling

Strategy HMR Pose Completion Motion Denoising
PA-MPJPE \downarrow MPJPE (S=10𝑆10S=10italic_S = 10) \downarrow MPVPE \downarrow MPJPE \downarrow
1 step 56.05 35.37/74.01/26.47 38.21 19.87
5 steps 56.16 36.59/80.82/31.22 40.22 21.21
10 steps 56.18 36.78/82.59/32.32 40.69 21.34
Table S-1: Efficacy of different denoising steps in DPoser’s optimization.

Interestingly, the gradient of DPoser (Eq. (10) in the main text) coincides with Score Distillation Sampling (SDS) [38, 52], which can be interpreted as aiming to minimize the following KL divergence:

KL(p0t(𝐱t𝐱0)pt𝚂𝙳𝙴(𝐱t;θ)),𝐾𝐿conditionalsubscript𝑝0𝑡conditionalsubscript𝐱𝑡subscript𝐱0superscriptsubscript𝑝𝑡𝚂𝙳𝙴subscript𝐱𝑡𝜃KL\big{(}p_{0t}\left(\mathbf{x}_{t}\mid\mathbf{x}_{0}\right)\parallel p_{t}^{% \mathtt{SDE}}\left(\mathbf{x}_{t};\theta\right)\big{)},italic_K italic_L ( italic_p start_POSTSUBSCRIPT 0 italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_SDE end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ ) ) , (17)

where pt𝚂𝙳𝙴(𝐱t;θ)superscriptsubscript𝑝𝑡𝚂𝙳𝙴subscript𝐱𝑡𝜃p_{t}^{\mathtt{SDE}}\left(\mathbf{x}_{t};\theta\right)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_SDE end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_θ ) denote the marginal distribution whose score function is estimated by ϵθ(𝐱t;t)subscriptitalic-ϵ𝜃subscript𝐱𝑡𝑡\epsilon_{\theta}(\mathbf{x}_{t};t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ). For the specific case where t0𝑡0t\to 0italic_t → 0, this term encourages the Dirac distribution δ(𝐱0)𝛿subscript𝐱0\delta(\mathbf{x}_{0})italic_δ ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (i.e., the optimized variable) to gravitate toward the learned data distribution p0𝚂𝙳𝙴(𝐱0;θ)superscriptsubscript𝑝0𝚂𝙳𝙴subscript𝐱0𝜃p_{0}^{\mathtt{SDE}}\left(\mathbf{x}_{0};\theta\right)italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_SDE end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_θ ), while the Gaussian perturbation like Eq. (17) softens the constraint. Building on this understanding, we can borrow advanced techniques from SDS [38, 52]—a rapidly evolving area ripe for methodological innovations [53, 54, 58]. To extend this, we experiment with a multi-step denoising strategy adapted from HiFA [58], substituting our original one-step denoising process. This alternative, however, yields suboptimal results across all evaluation metrics, as demonstrated in Tab. S-1. A plausible explanation could be that our proposed truncated timestep scheduling effectively manages low noise levels (i.e., small t𝑡titalic_t), thus negating the need for more denoising steps. In addition, iterative denoising in each optimization step may cause error accumulations, leading to inaccurate gradients.

C Experimental Details

This section elaborates on the specifics of our pose completion and motion denoising experiments.

C.1 Pose Completion

For partial observations 𝐲𝐲\mathbf{y}bold_y, the measurement operator 𝒜𝒜\mathcal{A}caligraphic_A is modeled as a mask matrix Md×n𝑀superscript𝑑𝑛M\in\mathbb{R}^{d\times n}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_n end_POSTSUPERSCRIPT. Based on our optimization framework (Algorithm 1 in the main text), we define the task-specific loss, Lcompsubscript𝐿compL_{\text{comp}}italic_L start_POSTSUBSCRIPT comp end_POSTSUBSCRIPT, as follows:

Lcomp=M𝐱0𝐲22.subscript𝐿compsuperscriptsubscriptnorm𝑀subscript𝐱0𝐲22L_{\text{comp}}=||M\mathbf{x}_{0}-\mathbf{y}||_{2}^{2}.italic_L start_POSTSUBSCRIPT comp end_POSTSUBSCRIPT = | | italic_M bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (18)

Here, 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denotes the complete body pose θ𝜃\thetaitalic_θ we try to recover, where the unseen parts are initialized as random noise. In the following ablated studies, if not specified, the evaluation is performed using 10 hypotheses on the AMASS [32] dataset with left leg occlusion.

C.2 Motion Denoising (Noisy Input)

Methods AMASS [32] HPS [15]
20mm 100mm 20mm 100mm
No prior 15.33 51.48 16.26 50.87
VPoser [37] 15.20 49.10 17.24 46.69
Pose-NDF [49] 13.84 46.10 15.62 47.50
DPoser (ours) 13.64 33.18 13.45 35.32
Table S-2: Performance comparison of motion denoising under varying noise scales. MPJPE is reported afters denoising.

Adhering to Pose-NDF settings [49], we aim to refine noisy joint positions Jobstsuperscriptsubscript𝐽obs𝑡J_{\text{obs}}^{t}italic_J start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT over N𝑁Nitalic_N frames to obtain clean poses θtsuperscript𝜃𝑡\theta^{t}italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, initialized from mean poses in SMPL with small noise. We formulate the task-specific loss combining an observation fidelity term Lobssubscript𝐿obsL_{\text{obs}}italic_L start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT and a temporal consistency term Ltempsubscript𝐿tempL_{\text{temp}}italic_L start_POSTSUBSCRIPT temp end_POSTSUBSCRIPT:

Lobs=t=0N1MJ(θt,β0)Jobst22,subscript𝐿obssuperscriptsubscript𝑡0𝑁1superscriptsubscriptnormsubscript𝑀𝐽superscript𝜃𝑡subscript𝛽0superscriptsubscript𝐽obs𝑡22L_{\text{obs}}=\sum_{t=0}^{N-1}||M_{J}(\theta^{t},\beta_{0})-J_{\text{obs}}^{t% }||_{2}^{2},italic_L start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT | | italic_M start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_J start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (19)
Ltemp=t=1N1MJ(θt1,β0)MJ(θt,β0)22,subscript𝐿tempsuperscriptsubscript𝑡1𝑁1superscriptsubscriptnormsubscript𝑀𝐽superscript𝜃𝑡1subscript𝛽0subscript𝑀𝐽superscript𝜃𝑡subscript𝛽022L_{\text{temp}}=\sum_{t=1}^{N-1}||M_{J}(\theta^{t-1},\beta_{0})-M_{J}(\theta^{% t},\beta_{0})||_{2}^{2},italic_L start_POSTSUBSCRIPT temp end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT | | italic_M start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT , italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_M start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (20)

where MJsubscript𝑀𝐽M_{J}italic_M start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT denotes the 3D joint positions regressed from SMPL [31] and β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the constant mean shape parameters.

In complement to the comparative analysis presented in Table 4 of our main text, we extend our evaluation to include scenarios with varying noise levels. This extended examination, detailed in Tab. S-2, showcases DPoser’s exceptional performance against state-of-the-art (SOTA) pose priors, especially under conditions of high noise, manifesting DPoser’s resilience to noise.

C.3 Motion Denoising (Partial Input)

This task focuses on reconstructing clean poses, θtsuperscript𝜃𝑡\theta^{t}italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, from partially observed joint positions, Jobstsuperscriptsubscript𝐽obs𝑡J_{\text{obs}}^{t}italic_J start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, across N𝑁Nitalic_N frames, employing a known mask matrix to identify visible joints. The optimization objective mirrors that of motion denoising (Sec. C.2), but incorporates a mask in Eq. (19) to specifically target visible parts, ensuring that only these segments guide the recovery process.

We conducted experiments on the AMASS dataset [32] to assess our model’s performance on this task with two types of occlusions: legs and left arm. The quantitative results of these experiments are detailed in Tab. S-3, and the accompanying visualizations are provided in Sec. G.

In leg occlusion scenarios, the AMASS dataset primarily showcases straight poses, offering minimal diversity. This scenario permits decent outcomes without incorporating a pose prior, since the optimization’s starting point closely aligns with these prevalent poses. However, VPoser’s mean-centered characteristic hinders its ability to faithfully replicate the visible areas. On the other hand, Pose-NDF falls short in enhancing the occluded parts. DPoser accurately handles visible parts and guides occluded ones for more realistic poses. For left arm occlusions, which involve more varied movements, DPoser markedly surpasses other methods, underlining its adaptability and precision in handling diverse motion patterns.

Methods Occlusion MPJPE MPVPE
Vis. Occ. All. All.
No prior Legs 0.26 14.72 5.52 5.45
VPoser Legs 1.75 14.29 6.31 7.38
PoseNDF Legs 0.25 15.71 5.87 5.64
DPoser (ours) Legs 0.28 12.24 4.63 3.65
No prior Left Arm 0.26 24.87 4.74 9.91
VPoser Left Arm 1.21 13.23 3.40 7.68
PoseNDF Left Arm 0.25 17.70 3.42 7.86
DPoser (ours) Left Arm 0.27 7.80 1.64 3.81
Table S-3: Comparative analysis of methods for motion denoising with different occlusions (Legs and Left Arm) on the AMASS dataset. Errors (in cm) are evaluated in terms of MPJPE across visible (Vis.), occluded (Occ.), and all joints, along with MPVPE for all vertices.

D Ablated DPoser’s Training

Normalization HMR Pose Completion Motion Denoising
PA-MPJPE \downarrow MPJPE (S=10𝑆10S=10italic_S = 10) \downarrow MPVPE \downarrow MPJPE \downarrow
w/o norm 57.88 45.37/102.28/41.08 44.82 24.04
min-max 59.17 47.41/107.00/43.42 42.70 21.29
z-score 56.49 34.37/72.47/26.32 38.57 20.24
Table S-4: Evaluation of DPoser’s performance under different normalization methods, specifically for the axis-angle rotation representation.
Representation HMR Pose Completion Motion Denoising
PA-MPJPE \downarrow MPJPE (S=10𝑆10S=10italic_S = 10) \downarrow MPVPE \downarrow MPJPE \downarrow
axis-angle 56.05 34.76/72.41/26.09 38.21 19.87
6D rotations 57.54 40.89/81.43/27.31 38.44 20.12
Table S-5: Comparative performance of rotation representations under z-score normalization across multiple tasks and metrics.

This section dissects the impact of different rotation representations and normalization techniques on DPoser’s performance. Initially, we examine axis-angle representation, comparing various normalization strategies: min-max scaling, z-score normalization, and no normalization. Our findings, summarized in Tab. S-4, indicate that z-score normalization is generally the most effective. Subsequently, using this optimal normalization, we explore 6D rotations [57] as an alternative. As evidenced by Tab. S-5, axis-angle representation offers superior performance. This preference can be attributed to the effective modeling capabilities of diffusion models, along with the inherent advantages of axis-angle in capturing bounded joint rotations for regression tasks like human mesh recovery.

Inspired by HuMoR [40], we experiment with integrating the SMPL body model [31] as a regularization term during training. Alongside the prediction of additive noise, as outlined in Equation (4) in the main text, we employ a 10-step DDIM sampler [44] to recover a “clean” version of the pose, denoted as 𝐱~0subscript~𝐱0\tilde{\mathbf{x}}_{0}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, from the diffused 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The regularization loss aims to minimize the discrepancy between the original and recovered poses under the SMPL body model M𝑀Mitalic_M:

Lreg=MJ(𝐱~0,β0)MJ(𝐱0,β0)22+MV(𝐱~0,β0)MV(𝐱0,β0)22.subscript𝐿regsuperscriptsubscriptnormsubscript𝑀𝐽subscript~𝐱0subscript𝛽0subscript𝑀𝐽subscript𝐱0subscript𝛽022superscriptsubscriptnormsubscript𝑀𝑉subscript~𝐱0subscript𝛽0subscript𝑀𝑉subscript𝐱0subscript𝛽022L_{\mathrm{reg}}=||M_{J}(\tilde{\mathbf{x}}_{0},\beta_{0})-M_{J}(\mathbf{x}_{0% },\beta_{0})||_{2}^{2}+||M_{V}(\tilde{\mathbf{x}}_{0},\beta_{0})-M_{V}(\mathbf% {x}_{0},\beta_{0})||_{2}^{2}.italic_L start_POSTSUBSCRIPT roman_reg end_POSTSUBSCRIPT = | | italic_M start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_M start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + | | italic_M start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_M start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (21)

Here, β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT represents the mean shape parameters in SMPL. To account for denoising errors, we scale the regularization loss by log(1+αtσt)log1subscript𝛼𝑡subscript𝜎𝑡\mathrm{log}(1+\frac{\alpha_{t}}{\sigma_{t}})roman_log ( 1 + divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ), thereby increasing the weight for samples with smaller t𝑡titalic_t values (less noise).

Fig. S-1 visualizes the impact of this regularization on MPJPE during the training, specifically for pose completion tasks with occlusion of both legs.

Refer to caption
Figure S-1: MPJPE evolution in DPoser training for pose completion, assessed on AMASS [32] with 10 hypotheses under legs occlusion scenarios.

We observe that weighted regularization offers slight performance gains in the early training process, while the absence of weighting introduces instability and deterioration in results. Despite these insights, the computational cost of incorporating the SMPL model—especially for our large batch size of 1280—makes the training approximately 8 times slower. Therefore, we opted not to include this regularization in our main experiments.

E Extended DPoser’s Optimization

Methods Occ. left leg Occ. legs Occ. arms Occ. trunk
ScoreSDE [48] 48.73/106.32/41.30 74.68/128.32/37.27 66.89/127.86/48.15 16.69/34.54/12.21
DPS [6] 40.51/104.32/54.57 64.26/113.46/33.71 60.63/119.85/42.78 15.10/33.90/13.27
MCG [7] 49.04/106.37/41.07 74.90/128.53/37.40 66.17/127.72/48.15 16.69/34.66/12.23
DPoser(ours) 35.37/74.01/26.47 59.25/96.77/24.55 51.27/81.76/20.04 13.95/28.57/9.85
Table S-6: Comparative evaluation of diffusion-based solvers for pose completion on the AMASS dataset [32] (hypotheses number S=10𝑆10S=10italic_S = 10).

In addressing pose-centric tasks as inverse problems, we propose a versatile optimization framework, which employs variational diffusion sampling as its foundational approach [33]. Our exploration extends to an array of diffusion-based methodologies for solving these complex inverse problems. Among the techniques considered are ScoreSDE [48], MCG [7], and DPS [6]. These methods augment standard generative processes with observational data, either by employing gradient-based guidance or back-projection techniques. We compare these methods with our DPoser for pose completion tasks. Our findings, captured in Tab. S-6, reveal that DPoser outperforms the competitors under most occlusion conditions. Consequently, DPoser emerges not merely as a universally applicable solution to pose-related tasks, but also as an exceptionally efficient one.

It is worth mentioning that methods rooted in generative frameworks [48, 7, 6, 21] can pose challenges for broader applicability in pose-centric tasks. For instance, in blind inverse problems—certain parameters in 𝒜𝒜\mathcal{A}caligraphic_A (e.g., camera models in HMR) are unknown—generative methods are less straightforward to implement. ZeDO [18], a recent study focusing on the 2D-3D lifting task, adopts the ScoreSDE [48] framework and refines camera translations by solving an optimization sub-problem after each generative step. However, directly porting this strategy to HMR is non-trivial, owing to the added complexity of body shape parameter optimization—a feature currently absent in our DPoser model. Although some state-of-the-art techniques [5, 35] offer solutions by jointly modeling operator 𝒜𝒜\mathcal{A}caligraphic_A and data distributions, a full-fledged discussion on this subject is beyond this paper’s purview and remains an open question for future work.

F Truncated Timestep Scheduling on Images

Refer to caption
Figure S-2: Image inpainting using standard (a) and truncated (b) timestep scheduling. The process evolution is shown over iterations with the middle row depicting the log-magnitude spectrum and the bottom row the phase spectrum.

Exploring truncated timestep scheduling for image-based tasks, we find its suitability for human poses doesn’t translate well to images. Initial timesteps are critical in image domains for generating foundational perceptual content.

In our study, we employed a 256x256 unconditional diffusion model [11] trained on ImageNet [10] with variational diffusion sampling [33] for image inpainting. Comparing standard (timesteps 990 to 0) and truncated scheduling (timesteps 495 to 0), both with 100 steps, the experiments confirmed that truncation compromises image quality (Fig. S-2). The standard approach preserved perceptual content, while truncation produced disjointed patches, misaligned with the original image context.

These results affirm that truncated timestep scheduling excels in pose data where key information emerges in later stages but falls short in image tasks where early timesteps are essential. This scheduling is thus bespoke to the characteristics of human pose estimation and is unsuitable for image processes that rely on the full diffusion timeline for content fidelity.

G More Qualitative Results

We show more qualitative results for pose generation (Fig. S-3), pose completion (Fig. S-4), human mesh recovery (Fig. S-5) and motion denoising (Fig. S-6, Fig. S-7).

H Potential Negative Impacts

  • Bias and Fairness Concerns: Human pose prior learning models may inadvertently encode biases present in the training data, leading to biased predictions or discriminatory outcomes. This can perpetuate existing societal biases and inequalities, particularly if the training data is not representative or balanced across diverse demographics.

  • Ethical Considerations: The use of human pose prior learning models in applications such as surveillance, security, or healthcare raises ethical concerns regarding individual privacy, autonomy, and consent. There are debates about the appropriate use of such technologies and the potential for unintended consequences or misuse.

  • Dependency on Data Quality: Human pose prior learning models heavily rely on the quality and diversity of the training data. Poorly annotated or biased datasets can negatively impact the performance and reliability of these models, leading to inaccurate or unreliable predictions.

Refer to caption
Figure S-3: Pose generation. DPoser can generate diverse and realistic poses.
Refer to caption
Refer to caption
Figure S-4: Pose completion. (a) Left leg under occlusion. (b) Trunk under occlusion.
Refer to caption
Refer to caption
Figure S-5: Human mesh recovery. (a) Initialization using mean poses and default camera. *Ground truth for the EHF dataset is annotated in SMPL-X [37], which extends SMPL [31] with fully articulated hands and an expressive face. (b) Initialization using the CLIFF [25] prediction.
Refer to caption
Refer to caption
Figure S-6: Motion denoising with noisy observations. (a) Gaussian noise with 40mm standard deviation. (b) Gaussian noise with 100mm standard deviation. We visualize every 20th𝑡{}^{th}start_FLOATSUPERSCRIPT italic_t italic_h end_FLOATSUPERSCRIPT of the sequence.
Refer to caption
Refer to caption
Figure S-7: Motion denoising with partial observations. (a) Legs under occlusion. (b) Left arm under occlusion. We visualize every 20th𝑡{}^{th}start_FLOATSUPERSCRIPT italic_t italic_h end_FLOATSUPERSCRIPT of the sequence.